|
Description  |
|
|
RELATED CASES
This application discloses subject matter also disclosed in the following
applications, filed herewith and assigned to Digital Equipment
Corporation, the assignee of this invention:
Ser. No. 547,589, filed Jun. 29, 1990, now abandoned, entitled BRANCH
PREDICTION IN HIGH-PERFORMANCE PROCESSOR, by Richard L. Sites and Richard
T. Witek. inventors;
Ser. No. 547,630, filed Jun. 29, 1990, entitled IMPROVING PERFORMANCE IN
REDUCED INSTRUCTION SET PROCESSOR, by Richard L. Sites and Richard T.
Witek, inventors;
Ser. No. 547,629, filed Jun. 29, 1990, now abandoned, entitled IMPROVING
BRANCH PERFORMANCE IN HIGH SPEED PROCESSOR, by Richard L. Sites and
Richard T. Witek, inventors;
Ser. No. 547,600, filed Jun. 29, 1990, now abandoned, entitled GRANULARITY
HINT FOR TRANSLATION BUFFER IN HIGH PERFORMANCE PROCESSOR. by Richard L.
Sites and Richard T. Witek, inventors;
Ser. No. 547,618, filed Jun. 29, 1990, now U.S. Pat. No. 5,193,167,
entitled ENSURING DATA INTEGRITY IN MULTIPROCESSOR OR PIPELINED PROCESSOR
SYSTEM, by Richard L. Sites and Richard T. Witek. inventors;
Ser. No 547,684, filed Jun. 29, 1990, now abandoned, entitled IMPROVING
COMPUTER PERFORMANCE BY ELIMINATING BRANCHES, by Richard L. Sites and
Richard T. Witek, inventors; and
Ser. No. 547,992, filed Jun. 29, 1990, now abandoned, entitled BYTE-COMPARE
OPERATION FOR HIGH-PERFORMANCE PROCESSOR, by Richard L. Sites and Richard
T. Witek, inventors.
BACKGROUND OF THE INVENTION
This invention relates to digital computers, and more particularly to a
high-performance processor executing a reduced instruction set.
Complex instruction set or CISC processors are characterized by having a
large number of instructions in their instruction set, often including
memory-to-memory instructions with complex memory accessing modes. The
instructions are usually of variable length, with simple instructions
being only perhaps one byte in length, but the length ranging up to dozens
of bytes. The VAX.TM. instruction set is a primary example of CISC and
employs instructions having one to two byte opcodes plus from zero to six
operand specifiers, where each operand specifier is from one byte to many
bytes in length. The size of the operand specifier depends upon the
addressing mode, size of displacement (byte, word or longword), etc. The
first byte of the operand specifier describes the addressing mode for that
operand, while the opcode defines the number of operands: one, two or
three. When the opcode itself is decoded, however, the total length of the
instruction is not yet .known to the processor because the operand
specifiers have not yet been decoded. Another characteristic of processors
of the VAX type is the use of byte or byte string memory references, in
addition to quadword or longword references; that is, a memory reference
may be of a length variable from one byte to multiple words, including
unaligned byte references.
Reduced instruction set or RISC processors are characterized by a smaller
number of instructions which are simple to decode, and by requiring that
all arithmetic/logic operations be performed register-to-register. Another
feature is that of allowing no complex memory accesses; all memory
accesses are register load/store operations, and there are a small number
of relatively simple addressing modes, i.e., only a few ways of specifying
operand addresses. Instructions are of only one length, and memory
accesses are of a standard data width, usually aligned. Instruction
execution is of the direct hardwired type, as distinct from microcoding.
There is a fixed instruction cycle time, and the instructions are defined
to be relatively simple so that they all execute in one short cycle (on
average, since pipelining will spread the actual execution over several
cycles).
One advantage of CISC processors is in writing source code. The variety of
powerful instructions, memory accessing modes and data types should result
in more work being done for each line of code (actually, compilers do not
produce code taking full advantage of this), but whatever gain in
compactness of source code is accomplished at the expense of execution
time. Particularly as pipelining of instruction execution has become
necessary to achieve performance levels demanded of systems presently, the
data or state dependencies of successive instructions, and the vast
differences in memory access time vs. machine cycle time, produce
excessive stalls and exceptions, slowing execution. The advantage of RISC
processors is the speed of execution of code, but the disadvantage is that
less is accomplished by each line of code, and the code to accomplish a
given task is much more lengthy. One line of VAX code can accomplish the
same as many lines of RISC code.
When CPUs were much faster than memory, it was advantageous to do more work
per instruction, because otherwise the CPU would always be waiting for the
memory to deliver instructions--this factor led to more complex
instructions that encapsulated what would be otherwise implemented as
subroutines. When CPU and memory speed became more balanced, a simple
approach such as that of the RISC concepts becomes more feasible, assuming
the memory system is able to deliver one instruction and some data in each
cycle. Hierarchical memory techniques, as well as faster access cycles,
provide these faster memory speeds. Another factor that has influenced the
CISC vs. RISC choice is the change in relative cost of off-chip vs.
on-chip interconnection resulting from VLSI construction of CPUs.
Construction on chips instead of boards changes the economics--first it
pays to make the architecture simple enough to be on one chip, then more
on-chip memory is possible (and needed) to avoid going off-chip for memory
references. A further factor in the comparison is that adding more complex
instructions and addressing modes as in a CISC solution complicates (thus
slows down) stages of the instruction execution process. The complex
function might make the function execute faster than an equivalent
sequence of simple instructions, but it can lengthen the instruction cycle
time, making all instructions execute slower; thus an added function must
increase the overall performance enough to compensate for the decrease in
the instruction execution rate.
The performance advantages of RISC processors, taking into account these
and other factors, is considered to outweigh the shortcomings, and, were
it not for the existing software base, most new processors would probably
be designed using RISC features. A problem is that business enterprises
have invested many years of operating background, including operator
training as well as the cost of the code itself, in applications programs
and data structures using the CISC type processors which were the most
widely used in the past ten or fifteen years. The expense and disruption
of operations to rewrite all of the code and data structures to
accommodate a new processor architecture may not be justified, even though
the performance advantages ultimately expected to be achieved would be
substantial.
Accordingly, the objective is to accomplish all of the performance
advantages of a RISC-type processor architecture, but yet allow the data
structures and code previously generated for existing CISC-type processors
to be translated for use in a high-performance processor.
SUMMARY OF THE INVENTION
In accordance with one embodiment of the invention, a high-performance
processor is provided which is of the RISC type, using a standardized,
fixed instruction size, and permitting only a simplified memory access
data width, using simple addressing modes. The instruction set is limited
to register-to-register operations (for arithmetic and logic type
operations using the ALU, etc.) and register load/store operations where
memory is referenced; there are no memory-to-memory operations, nor
register-to-memory operations in which the ALU or other logic functions
are done. The functions performed by instructions are limited to allow
non-microcoded implementation, simple to decode and execute in a short
cycle. On-chip floating point processing is provided, and on-chip
instruction and data caches are employed in an example embodiment.
Byte manipulation instructions are included to permit use of
previously-established data structures. These instructions include the
facility for doing in-register byte extract, insert and masking, along
with non-aligned load and store instructions, so that byte addresses can
be made use of even though the actual memory operations are aligned
quadword in nature.
The provision of load/locked and store/conditional instructions permits the
implementation of atomic byte writes. To write to a byte address in a
multibyte (e.g., quadword) aligned memory, the CPU loads a quadword (or
longword) and locks this location, writes to the byte address in register
while leaving the remainder of the quadword undisturbed, then stores the
updated quadword in memory conditionally, depending upon whether the
quadword has been written by another processor since the load/locked
operation.
Another byte manipulation instruction, according to one feature of the
invention, is a byte compare instruction. All bytes of a quadword in a
register are compared to corresponding bytes in another register. The
result is a single byte (one bit for each byte compared) in a third
register. Since this operation is done to a general purpose register
(rather than to a special hardware location), several of the byte compares
can be done in sequence, and no added state must be accounted for upon
interrupt or the like. This byte compare can be used to advantage with a
byte zeroing instruction in which selected bytes of a quadword are zeroed,
with the bytes being selected by bits in a low-order byte of a register.
That is, the result of a byte compare can be used to zero bytes of another
register.
BRIEF DESCRIPTION OF THE DRAWINGS
The novel features believed characteristic of the invention are set forth
in the appended claims. The invention itself, however, as well as other
features and advantages thereof, will be best understood by reference to
the detailed description of specific embodiments which follows, when read
in conjunction with the accompanying drawings, wherein:
FIG. 1 is an electrical diagram in block form of a computer system
employing a CPU which may employ features of the invention;
FIG. 2 is a diagram of data types used in the processor of FIG. 1;
FIG. 3 is an electrical diagram in block form of the instruction unit or
I-box of the CPU of FIG. 1;
FIG. 4 is an electrical diagram in block form of the integer execution unit
or E-box in the CPU of FIG. 1;
FIG. 5 is an electrical diagram in block form of the addressing unit or
A-box in the CPU of FIG. 1;
FIG. 6 is an electrical diagram in block form of the floating point
execution unit or F-box in the CPU of FIG. 1;
FIG. 7 is a timing diagram of the pipelining in the CPU of FIGS. 1-6;
FIG. 8 is a diagram of the instruction formats used in the instruction set
of the CPU of FIGS. 1-6;
FIG. 9 is a diagram of the format of a virtual address used in the CPU of
FIGS. 1-6;
FIG. 10 is a diagram of the format of a page table entry used in the CPU of
FIGS. 1-6; and
FIG. 11 is a diagram of the addressing translation mechanism used in the
CPU of FIGS. 1-6.
DETAILED DESCRIPTION OF SPECIFIC EMBODIMENT
Referring to FIG. 1, a computer system which may use features of the
invention, according to one embodiment, includes a CPU 10 connected by a
system bus 11 to a main memory 12, with an I/O unit (not shown) also
accessed via the system bus. The system may be of various levels, from a
stand-alone workstation up to a mid-range multiprocessor, in which case
other CPUs such as a CPU 15 also access the main memory 12 via the system
bus 11.
The CPU 10 is preferably a single-chip integrated circuit device, although
features of the invention could be employed in a processor constructed in
multi-chip form. Within the single chip an integer execution unit 16
(referred to as the "E-box") is included, along with a floating point
execution unit 17 (referred to as the F-box"). Instruction fetch and
decoding is performed in an instruction unit 18 or "I-box", and an address
unit or "A-box" 19 performs the functions of address generation, memory
management, write buffering and bus interface. The memory is hierarchical,
with on-chip instruction and data caches being included in the instruction
unit 18 and address unit 19 in one embodiment, while a larger,
second-level cache 20 is provided off-chip, being controlled by a cache
controller in the address unit 19.
The CPU 10 employs an instruction set as described below in which all
instructions are of a fixed size, in this case 32-bit or one longword. The
instruction and data types employed are for byte, word, longword and
quadword, as illustrated in FIG. 2. As used herein, a byte is 8-bits, a
word is 16-bits or two bytes, a longword is 32-bits or four bytes, and a
quadword is 64-bits or eight bytes. The data paths and registers within
the CPU 10 are generally 64-bit or quadword size, and the memory 12 and
caches use the quadword as the basic unit of transfer. Performance is
enhanced by allowing only quadword or longword loads and stores, although,
in order to be compatible with data types used in prior software
development, byte manipulation is allowed by certain unique instructions,
still maintaining the feature of only quadword or longword loads and
stores.
Referring to FIG. 3, the instruction unit 18 or I-box is shown in more
detail. The primary function of the instruction unit 18 is to issue
instructions to the E-box 16, A-box 19 and F-box 17. The instruction unit
18 includes an instruction cache 21 which stores perhaps 8 Kbytes of
instruction stream data, and a quadword (two instructions) of this
instruction stream data is loaded to an instruction register 22 in each
cycle where the pipeline advances. The instruction unit 18, in a preferred
embodiment, decodes two instructions in parallel in decoders 23 and 24,
then checks that the required resources are available for both
instructions by check circuitry 25. If resources are available and dual
issue is possible then both instructions may be issued by applying
register addresses on busses 26 and 27 and control bits on microcontrol
busses 28 and 29 to the appropriate elements in the CPU 10. If the
resources are available for only the first instruction or the instructions
cannot be dual issued then the instruction unit 18 issues only the first
instruction from the decoder 23. The instruction unit 18 does not issue
instructions out of order, even if the resources are available for the
second instruction (from decoder 24) and not for the first instruction.
The instruction unit 18 does not issue instructions until the resources
for the first instruction become available. If only the first of a pair of
instructions issues (from the decoder 23), the instruction unit 18 does
not advance another instruction into the instruction register 22 to
attempt to dual issue again. Dual issues is only attempted on aligned
quadword pairs as fetched from memory (or instruction cache 21) and loaded
to instruction register 22 as an aligned quadword.
The instruction unit 18 contains a branch prediction circuit 30 responsive
to the instructions in the instruction stream to be loaded into register
22. The prediction circuit 30 along with a subroutine return stack 31 is
used to predict branch addresses and to cause address generating circuitry
32 to prefetch the instruction stream before needed. The subroutine return
stack 31 (having fourentries, for example) is controlled by the hint bits
in the jump, jump to subroutine and return instructions as will be
described. The virtual PC (program counter) 33 is included in the address
generation circuitry 32 to produce addresses for instruction stream data
in the selected order.
One branch prediction method is the use of the value of the sign bit of the
branch displacement to predict conditional branches, so the circuit 30 is
responsive to the sign bit of the displacement appearing in the branch
instructions appearing at inputs 35. If the sign bit is negative, it
predicts the branch is taken, and addressing circuit 32 adds the
displacement to register Ra (one of the registers of register set 43, as
selected by field Ra of the instruction) to produce the first address of
the new address sequence to be fetched. If the sign is positive it
predicts not taken, and the present instruction stream is continued in
sequence.
The instruction unit 18 contains an 8-entry fully associative translation
buffer (TB) 36 to cache recently used instruction-stream address
translations and protection information for 8 Kbyte pages. Although 64-bit
addresses are nominally possible, as a practical matter 43-bit addresses
are adequate for the present. Every cycle the 43-bit virtual program
counter 33 is presented to the instruction stream TB 36. If the page table
entry (PTE) associated with the virtual PC is cached in the TB 36 then the
page frame number (PFN) and protection bits for the page which contains
the virtual PC is used by the instruction unit 18 to complete the address
translation and access checks. A physical address is thus applied to the
address input 37 of the instruction cache 21, or if there is a cache miss
then this instruction stream physical address is applied by the bus 38
through the address unit 19 to the cache 20 or memory, 12. In a preferred
embodiment, the instruction stream TB 36 supports any of the four
granularity hint block sizes as defined below, so that the probability of
a hit in the TB 36 is increased.
The execution unit or E-box 16 is shown in more detail in FIG. 4. The
execution unit 16 contains the 64-bit integer execution datapath including
an arithmetic/logic unit (ALU) 40, a barrel shifter 41, and an integer
multiplier 42. The execution unit 16 also contains the 32-register 64-bit
wide register file 43, containing registers R0 to R31, although R31 is
hardwired as all zeros. The register file 43 has four read ports and two
write ports which allow the sourcing (sinking) of operands (results) to
both the integer execution datapath and the address unit 19. A bus
structure 44 connects two of the read ports of the register file 43 to the
selected inputs of the ALU 40, the shifter 41 or the multiplier 42 as
specified by the control bits of the decoded instruction on busses 28 or
29 from the instruction unit 18, and connects the output of the
appropriate function to one of the write ports to store the result. That
is, the address fields from the instruction are applied by the busses 26
or 27 to select the registers to be used in execution the instruction, and
the control bits 28 or 29 define the operation in the ALU, etc., and
defines which internal busses of the bus structure 44 are to be used when,
etc.
The A-box or address unit 19 is shown in more detail in FIG. 5. The A-box
19 includes five functions: address translation using a translation buffer
48, a load silo 49 for incoming data, a write buffer 50 for outgoing write
data, an interface 51 to a data cache, and the external interface 52 to
the bus 11. The address translation datapath has the displacement adder 53
which generates the effective address (by accessing the register file 43
via the second set of read and write ports, and the PC), the data TB 48
which generates the physical address on address bus 54, and muxes and
bypassers needed for the pipelining.
The 32-entry fully associative data translation buffer 48 caches
recently-used data-stream page table entries for 8 Kbyte pages. Each entry
supports any of the four granularity hint block sizes, and a detector 55
is responsive to the granularity hint as described below to change the
number of low-order bits of the virtual address passed through from
virtual address bus 56 to the physical address bus 54.
For load and store instructions, the effective 43-bit virtual address is
presented to TB 48 via bus 56. If the PTE of the supplied virtual address
is cached in the TB 48, the PFN and protection bits for the page which
contains the address are used by the address unit 19 to complete the
address translation and access checks.
The write buffer 50 has two purposes: (1) To minimize the number of CPU
stall cycles by providing a high bandwidth (but finite) resource for
receiving store data. This is required since the CPU 10 can generate store
data at the peak rate of one quadword every CPU cycle which may be greater
than the rate at which the external cache 20 can accept the data; and (2)
To attempt to aggregate store data into aligned 32-byte cache blocks for
the purpose of maximizing the rate at which data may be written from the
CPU 10 into the external cache 20. The write buffer 50 has eight entries.
A write buffer entry is invalid if it does not contain data to be written
or is valid if it contains data to be written. The write buffer 50
contains two pointers: the head pointer 57 and the tail pointer 58. The
head pointer 57 points to the valid write buffer entry which has been
valid the longest period of time. The tail pointer 58 points to the valid
buffer entry slot which will next be validated. If the write buffer 50 is
completely full (empty) the head and tail pointers point to the same valid
(invalid) entry. Each time the write buffer 50 is presented with a new
store instruction the physical address generated by the instruction is
compared to the address in each valid write buffer entry. If the address
is in the same aligned 32-byte block as an address in a valid write buffer
entry then the store data is merged into that entry and the entry's
longword mask bits are updated. If no matching address is found in the
write buffer then the store data is written into the entry designated by
the tail pointer 58, the entry is validated, and the tail pointer 58 is
incremented to the next entry.
The address unit 19 contains a fully folded memory reference pipeline which
may accept a new load or store instruction every cycle until a fill of a
data cache 59 ("D-cache") is required. Since the data cache 59 lines are
only allocated on load misses, the address unit 19 may accept a new
instruction every cycle until a load miss occurs. When a load miss occurs
the instruction unit 18 stops issuing all instructions that use the load
port of the register file 43 (load, store, jump subroutine, etc.,
instructions).
Since the result of each data cache 59 lookup is known late in the pipeline
(stage S7 as will be described) and instructions are issued in pipe stage
S3, there may be two instructions in the address unit 19 pipeline behind a
load instruction which misses the data cache 59. These two instructions
are handled as follows: First, loads which hit the data cache 59 are
allowed to complete, hit under miss. Second, load misses are placed in the
silo 49 and replayed in order after the first load miss completes. Third,
store instructions are presented to the data cache 59 at their normal time
with respect to the pipeline. They are silo'ed and presented to the write
buffer 50 in order with respect to load misses.
The on-chip pipelined floating point unit 17 or F-box as shown in more
detail in FIG. 6 is capable of executing both DEC and IEEE floating point
instructions according to the instruction set to be described. The
floating point unit 17 contains a 32-entry, 64-bit, floating point
register file 61, and a floating point arithmetic and logic unit 62.
Divides and multiplies are performed in a multiply/divide circuit 63. A
bus structure 64 interconnects two read ports of the register file 61 to
the appropriate functional circuit as directed by the control bits of the
decoded instruction on busses 28 or 29 from the instruction unit 18. The
registers selected for an operation are defined by the output buses 26 or
27 from the instruction decode. The floating point unit 17 can accept an
instruction every cycle, with the exception of floating point divide
instructions, which can be accepted only every several cycles. A latency
of more than one cycle is exhibited for all floating point instructions.
In an example embodiment, the CPU 10 has an 8 Kbyte data cache 59, and 8
Kbyte instruction cache 21, with the size of the caches depending on the
available chip area. The on-chip data cache 59 is write-through, direct
mapped, read-allocate physical cache and has 32-byte (1-hexaword) blocks.
The system may keep the data cache 59 coherent with memory 12 by using an
invalidate bus, not shown. The data cache 59 has longword parity in the
data array 66 and there is a parity bit for each tag entry in tag store
67.
The instruction cache 21 may be 8 Kbytes, or 16 Kbytes, for example, or may
be larger or smaller, depending upon die area. Although described above as
using physical addressing with a TB 36, it may also be a virtual cache, in
which case it will contain no provision for maintaining its coherence with
memory 12. If the cache 21 is a physical addressed cache the chip will
contain circuitry for maintaining its coherence with memory: (1) when the
write buffer 50 entries are sent to the external interface 52, the address
will be compared against a duplicate instruction cache 21 tag, and the
corresponding block of instruction cache 21 will be conditionally
invalidated; (2) the invalidate bus will be connected to the instruction
cache 21.
The main data paths and registers in the CPU 10 are all 64-bits wide. That
is, each of the integer registers 43, as well as each of the floating
point registers 61, is a 64-bit register, and the ALU 40 has two 64-bit
inputs 40a and 40b and a 64-bit output 40c. The bus structure 44 in the
execution unit 16, which actually consists of more than one bus, has
64-bit wide data paths for transferring operands between the integer
registers 43 and the inputs and output of the ALU 40. The instruction
decoders 23 and 24 produce register address outputs 26 and 27 which are
applied to the addressing circuits of the integer registers 43 and/or
floating point registers 61 to select which register operands are used as
inputs to the ALU 40 or 62, and which of the registers 43 or registers 61
is the destination for the ALU (or other functional unit) output.
The dual issue decision is made by the circuitry 25 according to the
following requirement, where only one instruction from the first column
and one instruction from the second column can be issued in one cycle:
______________________________________
Column A Column B
______________________________________
Integer Operate Floating Operate
Floating Load/Store Integer Load/Store
Floating Branch Integer Branch
JSR
______________________________________
That is, the CPU 10 can allow dual issue of an integer load or store
instruction with an integer operate instruction, but not an integer branch
with an integer load or store. Of course, the circuitry 25 also checks to
see if the resources are available before allowing two instructions to
issue in the same cycle.
An important feature is the RISC characteristic of the CPU 10 of FIGS. 1-6.
The instructions executed by this CPU 10 are always of the same size, in
this case 32-bits, instead of allowing variable-length instructions. The
instructions execute an average in one machine cycle (pipelined as
described below, and assuming no stalls), rather than a variable number of
cycles. The instruction set includes only register-to-register
arithmetic/logic type of operations, or register-to-memory (or
memory-to-register) load/store type of operations, and there are no
complex memory addressing modes such as indirect, etc. An instruction
performing an operation in the ALU 40 always gets its operands from the
register file 43 (or from a field of the instruction itself) and always
writes the result to the register file 43; these operands are never
obtained from memory and the result is never written to memory. Loads from
memory are always to a register in register files 43 or 61, and stores to
memory are always from a register in the register files.
Referring to FIG. 7, the CPU 10 has a seven stage pipeline for integer
operate and memory reference instructions. The instruction unit 18 has a
seven stage pipeline to determine instruction cache 21 hit/miss. FIG. 7 is
a pipeline diagram for the pipeline of execution unit 16, instruction unit
18 and address unit 19. The floating point unit 17 defines a pipeline in
parallel with that of the execution unit 16, but ordinarily employs more
stages to execute. The seven stages are referred to as S0-S6, where a
stage is to be executed in one machine cycle (clock cycle). The first four
stages S0, S1, S2 and S3 are executed in the instruction unit 18, and the
last three stages S4, S5 and S6 are executed in one or the other of the
execution unit 16 or address unit 19, depending upon whether the
instruction is an operate or a load/store. There are bypassers in all of
the boxes that allow the results of one instruction to be used as operands
of a following instruction without having to be written to the register
file 43 or 61.
The first stage SO of the pipeline is the instruction fetch or IF stage,
during which the instruction unit 18 fetches two new instructions from the
instruction cache 21, using the PC 33 address as a base. The second stage
S1 in the swap stage, during which the two fetched instructions are
evaluated by the circuit 25 to see if they can be issued at the same time.
The third stage S2 is the decode stage, during which the two instructions
are decoded in the decoders 23 and 24 to produce the control signals 28
and 29 and register addresses 26 and 27. The fourth stage S3 is the
register file 43 access stage for operate instructions, and also is the
issue check decision point for all instructions, and the instruction issue
stage. The fifth stage S4 is cycle one of the computation (in ALU 40, for
example) if it is an operate instruction, and also the instruction unit 18
computes the new PC 33 in address generator 32; if it is a memory
reference instruction the address unit 19 calculates the effective data
stream address using the adder 53. The sixth stage S5 is cycle two of the
computation (e.g., in ALU 40 is written to the register file 43 via the
write port, and is the data cache 59 or instruction cache 21 hit/miss
decision point for instruction stream or data stream references.
The CPU 10 pipeline divides these seven stages S0-S6 of instruction
processing into four static and three dynamic stages of execution. The
first four stages S0-S3 consist of the instruction fetch, swap, decode and
issue logic as just described. These stages S0-S3 are static in that
instructions may remain valid in the same pipeline stage for multiple
cycles while waiting for a resource or stalling for other reasons. These
stalls are also referred to as pipeline freezes. A pipeline freeze may
occur while zero instructions issue, or while one instruction of a pair
issues and the second is held at the issue stage. A pipeline freeze
implies that a valid instruction or instructions is (are) presented to be
issued but can not proceed.
Upon satisfying all issue requirements, instructions are allowed to
continue through the pipeline toward completion. After issuing in S3,
instructions can not be held in a given pipe stage S4-S6. It is up to the
issue stage S3 (circuitry 25) to insure that all resource conflicts are
resolved before an instruction is allowed to continue. The only means of
stopping instructions after the issue stage S3 is an abort condition.
Aborts may result from a number of causes. In general, they may be grouped
into two classes, namely exceptions (including interrupts) and
non-exceptions. The basic difference between the two is that exceptions
require that the pipeline be flushed of all instructions which were
fetched subsequent to the instruction which caused the abort condition,
including dual issued instructi | | |