|
References  |
|
|
| *references marked with an asterisk below are user-added references |
|
U.S. References |
|
|
|
|
|
|
U.S. References |
|
|
Foreign References |
|
|
|
|
|
|
Foreign References |
|
|
Other References |
|
|
| Add a new Other reference: |
| Post related web sites and other references in this section |
| | Reference | Relevancy | Comments | Perleberg, C.H. et al., Branch Target Buffer Design and Optimization, Apr. 1993, IEEE Transactions on Computers, vol. 42, No. 4, pp. 396-412.
cited by examiner
. Oct,2006 |      Your vote accepted [0 after 0 votes] | | IBM, Partial Address Recording in Branch History Tables, IBM Technical Disclosure Bulletin,Jun. 1, 1993, vol. 36, Issue 6B, pp. 379-380. cited by examiner
. Oct,2006 |      Your vote accepted [0 after 0 votes] | | FOLDOC, Free Online Dictionary of Computing, http://wombat.doc.ic.ac.uk/foldoc/foldoc.cgi?query=second+level+cache, Jun. 25, 1997. cited by examiner
. Oct,2006 |      Your vote accepted [0 after 0 votes] | | FOLDOC, Free Online Dictionary of Computing, http://wombat.doc.ic.ac.uk/foldoc/foldoc/cgi?cache, Jun. 25, 1997. cited by examiner
. Oct,2006 |      Your vote accepted [0 after 0 votes] | | U.S. Appl. No. 09/441,630, filed Nov. 16, 1999. cited by other
. Oct,2006 |      Your vote accepted [0 after 0 votes] | | "Branch Buffering"; IBM Technical Disclosure Bulletin, IBM Corp., New York, US; vol. 36, No. 5, May 1, 1993; pp. 129-131. cited by other
. Oct,2006 |      Your vote accepted [0 after 0 votes] | | "Branch Target Buffer Design and Optimization"; Perleberg, Chris H., et al.; IEEE Inc., New York, U.S.; vol. 42, No. 4, Apr. 1, 1993; pp. 396-412. cited by other
. Oct,2006 |      Your vote accepted [0 after 0 votes] | | International Search Reprot, Application No. PCT/US02/20481, International Filing Date Jun. 27, 2002. cited by other. Oct,2006 |      Your vote accepted [0 after 0 votes] | | |
|
|
|
|
Other References |
|
|
|
|
|
References  |
|
|
Claims  |
|
|
What is claimed is:
1. A branch prediction method comprising: receiving a fetch address; detecting a first level cache does not contain a first branch prediction information corresponding to
the fetch address; determining whether a second level cache contains a second branch prediction information corresponding to said fetch address, said second branch prediction information comprising a subset of said first branch prediction information;
rebuilding said first branch prediction information in response to determining said second level cache contains said second branch prediction information, wherein said rebuilding comprises: receiving said second branch prediction information; receiving
a group of instructions corresponding to the fetch address; utilizing the second branch prediction information to identify one or more predicted taken branches within the group of instructions; generating third branch prediction information by decoding
each of the identified one or more predicted taken branches to determine a type of each of the one or more predicted taken branches, and determining a size of any immediate or displacement data for each of the one or more predicted taken branches; and
combining said second branch prediction information with said third branch prediction information; storing said combined second and third branch prediction information as said first branch prediction information in a first entry of said first level
cache, wherein said first entry corresponds to said fetch address.
2. The method of claim 1, further comprising: determining if said first entry of said first level cache is available; evicting contents of said first entry in response to detecting said first entry is not available; and storing a subset of
said contents in said second level cache responsive to said eviction.
3. The method of claim 1, wherein generating said third branch prediction information further comprises determining whether each of the one or more predicted taken branches ends on an even addressed byte or an odd addressed byte.
4. The method of claim 3, wherein said branch instruction is fetched from said second level cache.
5. The method of claim 1, wherein said subset comprises a dynamic bit.
6. The method of claim 5, wherein said subset further comprises a branch marker bit.
7. The method of claim 6, wherein said branch prediction further comprises an end adjustment bit.
8. The method of claim 1, wherein said second level cache and said first level cache do not store duplicate information.
9. A branch prediction mechanism comprising: a first level cache configured to store branch prediction information; a second level cache configured to store a subset of said branch prediction information; circuitry coupled to said first level
cache and said second level cache, wherein said circuitry is configured to: detect said first level cache does not contain a first branch prediction information corresponding to a fetch address; determine whether said second level cache contains a
second branch prediction information corresponding to said fetch address, said second branch prediction information comprising a subset of said first branch prediction information; and rebuild said first branch prediction information in response to
determining said second level cache contains said second branch prediction information, wherein in order to rebuild said first branch prediction information, said circuitry is configured to: receive said second branch prediction information; receive a
group of instructions corresponding to the fetch address; utilize the second branch prediction information to identify one or more predicted taken branches within the group of instructions; generate third branch prediction information by decoding each
of the identified one or more predicted taken branches to determine a type of each of the one or more predicted taken branches, and determining a size of any immediate or displacement data for each of the one or more predicted taken branches; and
combine said second branch prediction information with said third branch prediction information; store said combined second and third branch prediction information as said first branch prediction information in a first entry of said first level cache,
wherein said first entry corresponds to said fetch address.
10. The mechanism of claim 9, wherein said circuitry is further configured to: determine if said first entry of said first level cache is available; evict contents of said first entry in response to detecting said first entry is not available; and store a subset of said contents in said second level cache responsive to said eviction.
11. The mechanism of claim 9, wherein generating said third branch prediction information further comprises determining whether each of the one or more predicted taken branches ends on an even addressed byte or an odd addressed byte.
12. The mechanism of claim 11, wherein said branch instruction is fetched from said second level cache.
13. The mechanism of claim 9, wherein said subset comprises a dynamic bit.
14. The mechanism of claim 13, wherein said subset further comprises a branch marker bit.
15. The mechanism of claim 14, wherein said branch prediction further comprises an end adjustment bit.
16. The mechanism of claim 9, wherein said second level cache and said first level cache do not store duplicate information.
17. A computer system comprising: an interconnect; a memory coupled to said interconnect; a second level cache configured to store branch prediction information; a processor including a first level cache, wherein said processor is configured
to: detect said first level cache does not contain a first branch prediction information corresponding to a fetch address; determine whether said second level cache contains a second branch prediction information corresponding to said fetch address,
said second branch prediction information comprising a subset of said first branch prediction information; rebuild said first branch prediction information in response to determining said second level cache contains said second branch prediction
information, wherein in order to rebuild said first branch prediction information, said processor is configured to: receive said second branch prediction information; receive a group of instructions corresponding to the fetch address; utilize the
second branch prediction information to identify one or more predicted taken branches within the group of instructions; generate third branch prediction information by decoding each of the identified one or more predicted taken branches to determine a
type of each of the one or more predicted taken branches, and determining a size of any immediate or displacement data for each of the one or more predicted taken branches; and combine said second branch prediction information with said third branch
prediction information; store said combined second and third branch prediction information as said first branch prediction in a first entry of said first level cache, wherein said first entry corresponds to said first address.
18. The system of claim 17, wherein said processor is further configured to determine if said first entry of said first level cache is available; evict contents of said first entry in response to detecting said first entry is not available;
and store a subset of said contents in said second level cache responsive to said eviction.
19. The system of claim 17, wherein generating said third branch prediction information further comprises determining whether each of the one or more predicted taken branches ends on an even addressed byte or an odd addressed byte.
20. The system of claim 17, wherein said second level cache and said first level cache do not store duplicate information. |
|
|
|
|
Claims  |
|
|
Description  |
|
|
BACKGROUND OF THE INVENTION
1. Field of the Invention
This invention is related to the field of superscalar microprocessors and, more particularly, to a method and mechanism for branch prediction.
2. Description of the Related Art
Superscalar microprocessors achieve high performance by executing multiple instructions per clock cycle and by choosing the shortest possible clock cycle consistent with the design. As used herein, the term "clock cycle" refers to an interval of
time accorded to various stages of an instruction processing pipeline within the microprocessor. Storage devices (e.g. registers and arrays) capture their values according to the clock cycle. For example, a storage device may capture a value according
to a rising or falling edge of a clock signal defining the clock cycle. The storage device then stores the value until the subsequent rising or falling edge of the clock signal, respectively. The term "instruction processing pipeline" is used herein to
refer to the logic circuits employed to process instructions in a pipelined fashion. Although the pipeline may be divided into any number of stages at which portions of instruction processing are performed, instruction processing generally comprises
fetching the instruction, decoding the instruction, executing the instruction, and storing the execution results in the destination identified by the instruction.
An important feature of a superscalar microprocessor (and a superpipelined microprocessor as well) is its branch prediction mechanism. The branch prediction mechanism indicates a predicted direction (taken or not-taken) for a branch instruction,
allowing subsequent instruction fetching to continue within the predicted instruction stream indicated by the branch prediction. A branch instruction is an instruction which causes subsequent instructions to be fetched from one of at least two
addresses: a sequential address identifying an instruction stream beginning with instructions which directly follow the branch instruction; and a target address identifying an instruction stream beginning at an arbitrary location in memory.
Unconditional branch instructions always branch to the target address, while conditional branch instructions may select either the sequential or the target address based on the outcome of a prior instruction. Instructions from the predicted instruction
stream may be speculatively executed prior to execution of the branch instruction, and in any case are placed into the instruction processing pipeline prior to execution of the branch instruction. If the predicted instruction stream is correct, then the
number of instructions executed per clock cycle is advantageously increased. However, if the predicted instruction stream is incorrect (i.e. one or more branch instructions are predicted incorrectly), then the instructions from the incorrectly predicted
instruction stream are discarded from the instruction processing pipeline and the number of instructions executed per clock cycle is decreased.
In order to be effective, the branch prediction mechanism must be highly accurate such that the predicted instruction stream is correct as often as possible. Frequently, a history of prior executions of a branch is used to form a more accurate
behavior for a particular branch. Such a branch prediction history typically requires maintaining data corresponding to the branch instruction in a storage. In the event the branch prediction data is evicted from the storage, or otherwise lost, it may
be necessary to recreate the execution history for the branch instruction at a later time. One solution to the above problem may be to increase the size of the branch prediction storage. However, increasing the size of branch prediction storage may
require a significant increase in gate area and the size of the branch prediction mechanism. Consequently, valuable data regarding the behavior of a branch may be lost and must be recreated. Consequently, a mechanism for improving branch prediction
capability is desired which does not require a significant increase in the gate count or size of the branch prediction mechanism.
SUMMARY OF THE INVENTION
The problems outlined above are in large part solved by a microprocessor and method as described herein. In one embodiment, a processor is configured with a first level branch prediction cache which is configured to store branch prediction
information corresponding to a group of instructions. In addition, a second level branch prediction cache is utilized to store branch prediction information which is evicted from the first level cache. The second level branch prediction cache is
configured to store only a subset of the information which is evicted from the first level cache. Branch prediction information which is evicted from the first level cache and not stored in the second level cache is discarded. Upon a miss in the first
level cache, a determination is made as to whether the second level cache contains branch prediction information corresponding to the miss. If corresponding branch prediction information is detected in the second level cache, the detected branch
prediction information is fetched from the second level cache and is used to rebuild complete branch prediction information which may then be used in making a prediction. In one embodiment, decode circuitry may be included to perform a decode of
instructions fetched from the second level cache. This decode of instructions from the second level cache may be utilized in the rebuilding of the complete branch prediction information. Advantageously, a reduced size cache may be used to store branch
prediction information evicted from the first level branch prediction cache. Further, when a miss occurs in the first level cache, a complete branch prediction may be quickly rebuilt from the data stored in the second level cache.
BRIEF
DESCRIPTION OF THE DRAWINGS
Other objects and advantages of the invention will become apparent upon reading the following detailed description and upon reference to the accompanying drawings in which:
FIG. 1 is a block diagram of one embodiment of a microprocessor.
FIG. 2 is a block diagram showing one embodiment of a branch prediction unit.
FIG. 3 is a flowchart illustrating a method for branch prediction.
FIG. 4 is a flowchart illustrating a method for utilizing a level two branch prediction cache.
FIG. 5 is a diagram showing a contiguous group of program instructions and corresponding branch prediction entry.
FIG. 6 is a diagram illustrating a relationship between branch marker bits and address offsets.
FIG. 7 is a diagram illustrating a relationship between program instructions, branch marker bits and address offsets.
FIG. 8 is a diagram showing one embodiment of prediction logic.
FIG. 9 is a diagram showing a dynamic logic corollary to the prediction logic shown in FIG. 8.
FIG. 10 is a diagram illustrating branch marker bit utilization.
FIG. 11 is a diagram illustrating branch marker bit utilization.
FIG. 12 is a diagram illustrating branch marker bit utilization.
FIG. 13 is a diagram illustrating branch marker bit utilization.
FIG. 14 is a diagram illustrating a missed prediction.
FIG. 15 is a diagram illustrating branch target information.
FIG. 16 is a block diagram showing one embodiment of a prediction logic unit.
FIG. 17 is a diagram illustrating one embodiment of a target select circuit.
FIG. 18 is a block diagram illustrating one embodiment of a select signal circuit.
FIG. 19 is a diagram illustrating one embodiment of a branch address calculation unit.
FIG. 20 is a block diagram illustrating a relationship between a level one branch prediction storage and a level two branch prediction storage.
FIG. 21 is a block diagram illustrating a relationship between a level one branch prediction storage and a level two branch prediction storage.
FIG. 22 is a block diagram illustrating a relationship between a level one branch prediction storage and a level two branch prediction storage.
FIG. 23 is a block diagram of one embodiment of a branch prediction unit.
FIG. 24 is a diagram of one embodiment of a branch address calculation unit.
FIG. 25 is a block diagram of one embodiment of a missed prediction circuit.
FIG. 26 is a block diagram of a computer system.
While the invention is susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will herein be described in detail. It should be understood, however, that the
drawings and detailed description thereto are not intended to limit the invention to the particular form disclosed, but on the contrary, the intention is to cover all modifications, equivalents and alternatives falling within the spirit and scope of the
present invention as defined by the appended claims.
DETAILED DESCRIPTION
Processor Overview
Turning now to FIG. 1, a block diagram of one embodiment of a processor 10 is shown. Other embodiments are possible and contemplated. As shown in FIG. 1, processor 10 includes a prefetch unit 12, a branch prediction unit 14, an instruction
cache 16, an instruction alignment unit 18, a plurality of decode units 20A 20C, a plurality of reservation stations 22A 22C, a plurality of functional units 24A 24C, a load/store unit 26, a data cache 28, a register file 30, a reorder buffer 32, an MROM
unit 34, and a bus interface unit 37. Elements referred to herein with a particular reference number followed by a letter will be collectively referred to by the reference number alone. For example, decode units 20A 20C will be collectively referred to
as decode units 20.
Prefetch unit 12 is coupled to receive instructions from bus interface unit 37, and is further coupled to instruction cache 16 and branch prediction unit 14. Similarly, branch prediction unit 14 is coupled to instruction cache 16. Still
further, branch prediction unit 14 is coupled to decode units 20 and functional units 24. Instruction cache 16 is further coupled to MROM unit 34 and instruction alignment unit 18. Instruction alignment unit 18 is in turn coupled to decode units 20.
Each decode unit 20A 20C is coupled to load/store unit 26 and to respective reservation stations 22A 22C. Reservation stations 22A 22C are further coupled to respective functional units 24A 24C. Additionally, decode units 20 and reservation stations 22
are coupled to register file 30 and reorder buffer 32. Functional units 24 are coupled to load/store unit 26, register file 30, and reorder buffer 32 as well. Data cache 28 is coupled to load/store unit 26 and to bus interface unit 37. Bus interface
unit 37 is further coupled to an L2 interface to an L2 cache and a bus. Finally, MROM unit 34 is coupled to decode units 20.
Instruction cache 16 is a high speed cache memory provided to store instructions. Instructions are fetched from instruction cache 16 and dispatched to decode units 20. In one embodiment, instruction cache 16 is configured to store up to 64
kilobytes of instructions in a 2 way set associative structure having 64 byte lines (a byte comprises 8 binary bits). Alternatively, any other desired configuration and size may be employed. For example, it is noted that instruction cache 16 may be
implemented as a fully associative, set associative, or direct mapped configuration.
Instructions are stored into instruction cache 16 by prefetch unit 12. Instructions may be prefetched prior to the request thereof from instruction cache 16 in accordance with a prefetch scheme. A variety of prefetch schemes may be employed by
prefetch unit 12. Instructions fetched from the instruction cache are passed to the scanner/aligner. When instructions are fetched for the first time, they are not marked by predecode tags. In this case, the scanner/aligner passes 4 bytes per clock to
the decode unit 20. As decode unit 20 dispatches unpredecoded instructions to the core, the decode unit may generate predecode data corresponding to the instructions which indicates the instruction boundaries.
One encoding of the predecode tags for an embodiment of processor 10 employing a variable byte length instruction set will next be described. A variable byte length instruction set is an instruction set in which different instructions may occupy
differing numbers of bytes. An exemplary variable byte length instruction set employed by one embodiment of processor 10 is the x86 instruction set.
In the exemplary encoding, if a given byte is the last byte of an instruction, the end bit for that byte is set. Instructions which may be directly decoded by decode units 20 are referred to as "fast path" instructions. The remaining x86
instructions are referred to as MROM instructions, according to one embodiment. For example, a fast path instruction including two prefix bytes, a Mod R/M byte, and an immediate byte would have end bits as follows:
End bits 00001
MROM instructions are instructions which are determined to be too complex for decode by decode units 20. MROM instructions are executed by invoking MROM unit 34. More specifically, when an MROM instruction is encountered, MROM unit 34 parses
and issues the instruction into a subset of defined fast path instructions to effectuate the desired operation. MROM unit 34 dispatches the subset of fast path instructions to decode units 20.
Processor 10 employs branch prediction in order to speculatively fetch instructions subsequent to conditional branch instructions. Branch prediction unit 14 is included to perform branch prediction operations. In one embodiment, branch
prediction unit 14 employs a branch target buffer which caches up to three branch target addresses and corresponding taken/not taken predictions per 16 byte portion of a cache line in instruction cache 16. The branch target buffer may, for example,
comprise 2048 entries or any other suitable number of entries. Prefetch unit 12 determines initial branch targets when a particular line is predecoded. Subsequent updates to the branch targets corresponding to a cache line may occur due to the
execution of instructions within the cache line. Instruction cache 16 provides an indication of the instruction address being fetched, so that branch prediction unit 14 may determine which branch target addresses to select for forming a branch
prediction. Decode units 20 and functional units 24 provide update information to branch prediction unit 14. Decode units 20 detect branch instructions which were not predicted by branch prediction unit 14. Functional units 24 execute the branch
instructions and determine if the predicted branch direction is incorrect. The branch direction may be "taken", in which subsequent instructions are fetched from the target address of the branch instruction. Conversely, the branch direction may be "not
taken", in which subsequent instructions are fetched from memory locations consecutive to the branch instruction. When a mispredicted branch instruction is detected, instructions subsequent to the mispredicted branch are discarded from the various units
of processor 10. In an alternative configuration, branch prediction unit 14 may be coupled to reorder buffer 32 instead of decode units 20 and functional units 24, and may receive branch misprediction information from reorder buffer 32. A variety of
suitable branch prediction algorithms may be employed by branch prediction unit 14.
Instructions fetched from instruction cache 16 are conveyed to instruction alignment unit 18. As instructions are fetched from instruction cache 16, the corresponding predecode data is scanned to provide information to instruction alignment unit
18 (and to MROM unit 34) regarding the instructions being fetched. Instruction alignment unit 18 scans the predecode data to align an instruction to each of decode units 20. In one embodiment, instruction alignment unit 18 aligns instructions from two
sets of sixteen instruction bytes to decode units 20. Decode unit 20A receives an instruction which is prior to instructions concurrently received by decode units 20B and 20C (in program order). Similarly, decode unit 20B receives an instruction which
is prior to the instruction concurrently received by decode unit 20C in program order.
Decode units 20 are configured to decode instructions received from instruction alignment unit 18. Register operand information is detected and routed to register file 30 and reorder buffer 32. Additionally, if the instructions require one or
more memory operations to be performed, decode units 20 dispatch the memory operations to load/store unit 26. Each instruction is decoded into a set of control values for functional units 24, and these control values are dispatched to reservation
stations 22 along with operand address information and displacement or immediate data which may be included with the instruction. In one particular embodiment, each instruction is decoded into up to two operations which may be separately executed by
functional units 24A 24C.
Processor 10 supports out of order execution, and thus employs reorder buffer 32 to keep track of the original program sequence for register read and write operations, to implement register renaming, to allow for speculative instruction execution
and branch misprediction recovery, and to facilitate precise exceptions. A temporary storage location within reorder buffer 32 is reserved upon decode of an instruction that involves the update of a register to thereby store speculative register states. If a branch prediction is incorrect, the results of speculatively-executed instructions along the mispredicted path can be invalidated in the buffer before they are written to register file 30. Similarly, if a particular instruction causes an exception,
instructions subsequent to the particular instruction may be discarded. In this manner, exceptions are "precise" (i.e. instructions subsequent to the particular instruction causing the exception are not completed prior to the exception). It is noted
that a particular instruction is speculatively executed if it is executed prior to instructions which precede the particular instruction in program order. Preceding instructions may be a branch instruction or an exception-causing instruction, in which
case the speculative results may be discarded by reorder buffer 32.
The instruction control values and immediate or displacement data provided at the outputs of decode units 20 are routed directly to respective reservation stations 22. In one embodiment, each reservation station 22 is capable of holding
instruction information (i.e., instruction control values as well as operand values, operand tags and/or immediate data) for up to five pending instructions awaiting issue to the corresponding functional unit. It is noted that for the embodiment of FIG.
1, each reservation station 22 is associated with a dedicated functional unit 24. Accordingly, three dedicated "issue positions" are formed by reservation stations 22 and functional units 24. In other words, issue position 0 is formed by reservation
station 22A and functional unit 24A. Instructions aligned and dispatched to reservation station 22A are executed by functional unit 24A. Similarly, issue position 1 is formed by reservation station 22B and functional unit 24B; and issue position 2 is
formed by reservation station 22C and functional unit 24C.
Upon decode of a particular instruction, if a required operand is a register location, register address information is routed to reorder buffer 32 and register file 30 simultaneously. In one embodiment, reorder buffer 32 includes a future file
which receives operand requests from decode units as well. Those of skill in the art will appreciate that the x86 register file includes eight 32 bit real registers (i.e., typically referred to as EAX, EBX, ECX, EDX, EBP, ESI, EDI and ESP). In
embodiments of processor 10 which employ the x86 processor architecture, register file 30 comprises storage locations for each of the 32 bit real registers. Additional storage locations may be included within register file 30 for use by MROM unit 34.
Reorder buffer 32 contains temporary storage locations for results which change the contents of these registers to thereby allow out of order execution. A temporary storage location of reorder buffer 32 is reserved for each instruction which, upon
decode, is determined to modify the contents of one of the real registers. Therefore, at various points during execution of a particular program, reorder buffer 32 may have one or more locations which contain the speculatively executed contents of a
given register. If following decode of a given instruction it is determined that reorder buffer 32 has a previous location or locations assigned to a register used as an operand in the given instruction, the reorder buffer 32 forwards to the
corresponding reservation station either: 1) the value in the most recently assigned location, or 2) a tag for the most recently assigned location if the value has not yet been produced by the functional unit that will eventually execute the previous
instruction. If reorder buffer 32 has a location reserved for a given register, the operand value (or reorder buffer tag) is provided from reorder buffer 32 rather than from register file 30. If there is no location reserved for a required register in
reorder buffer 32, the value is taken directly from register file 30. If the operand corresponds to a memory location, the operand value is provided to the reservation station through load/store unit 26.
In one particular embodiment, reorder buffer 32 is configured to store and manipulate concurrently decoded instructions as a unit. This configuration will be referred to herein as "line-oriented". By manipulating several instructions together,
the hardware employed within reorder buffer 32 may be simplified. For example, a line-oriented reorder buffer included in the present embodiment allocates storage sufficient for instruction information pertaining to three instructions (one from each
decode unit 20) whenever one or more instructions are issued by decode units 20. By contrast, a variable amount of storage is allocated in conventional reorder buffers, dependent upon the number of instructions actually dispatched. A comparatively
larger number of logic gates may be required to allocate the variable amount of storage. When each of the concurrently decoded instructions has executed, the instruction results are stored into register file 30 simultaneously. The storage is then free
for allocation to another set of concurrently decoded instructions. Additionally, the amount of control logic circuitry employed per instruction is reduced because the control logic is amortized over several concurrently decoded instructions. A reorder
buffer tag identifying a particular instruction may be divided into two fields: a line tag and an offset tag. The line tag identifies the set of concurrently decoded instructions including the particular instruction, and the offset tag identifies which
instruction within the set corresponds to the particular instruction. It is noted that storing instruction results into register file 30 and freeing the corresponding storage is referred to as "retiring" the instructions. It is further noted that any
reorder buffer configuration may be employed in various embodiments of processor 10, including using a future file to store the speculative state of register file 30.
As noted earlier, reservation stations 22 store instructions until the instructions are executed by the corresponding functional unit 24. An instruction is selected for execution if: (i) the operands of the instruction have been provided; and
(ii) the operands have not yet been provided for instructions which are within the same reservation station 22A 22C and which are prior to the instruction in program order. It is noted that when an instruction is executed by one of the functional units
24, the result of that instruction is passed directly to any reservation stations 22 that are waiting for that result at the same time the result is passed to update reorder buffer 32 (this technique is commonly referred to as "result forwarding"). An
instruction may be selected for execution and passed to a functional unit 24A 24C during the clock cycle that the associated result is forwarded. Reservation stations 22 route the forwarded result to the functional unit 24 in this case. In embodiments
in which instructions may be decoded into multiple operations to be executed by functional units 24, the operations may be scheduled separately from each other.
In one embodiment, each of the functional units 24 is configured to perform integer arithmetic operations of addition and subtraction, as well as shifts, rotates, logical operations, and branch operations. The operations are performed in
response to the control values decoded for a particular instruction by decode units 20. It is noted that a floating point unit (not shown) may also be employed to accommodate floating point operations. The floating point unit may be operated as a
coprocessor, receiving instructions from MROM unit 34 or reorder buffer 32 and subsequently communicating with reorder buffer 32 to complete the instructions. Additionally, functional units 24 may be configured to perform address generation for load and
store memory operations performed by load/store unit 26. In one particular embodiment, each functional unit 24 may comprise an address generation unit for generating addresses and an execute unit for performing the remaining functions. The two units
may operate independently upon different instructions or operations during a clock cycle.
Each of the functional units 24 also provides information regarding the execution of conditional branch instructions to the branch prediction unit 14. If a branch prediction was incorrect, branch prediction unit 14 flushes instructions
subsequent to the mispredicted branch that have entered the instruction processing pipeline, and causes fetch of the required instructions from instruction cache 16 or main memory. It is noted that in such situations, results of instructions in the
original program sequence which occur after the mispredicted branch instruction are discarded, including those which were speculatively executed and temporarily stored in load/store unit 26 and reorder buffer 32. It is further noted that branch
execution results may be provided by functional units 24 to reorder buffer 32, which may indicate branch mispredictions to functional units 24.
Results produced by functional units 24 are sent to reorder buffer 32 if a register value is being updated, and to load/store unit 26 if the contents of a memory location are changed. If the result is to be stored in a register, reorder buffer
32 stores the result in the location reserved for the value of the register when the instruction was decoded. A plurality of result buses 38 are included for forwarding of results from functional units 24 and load/store unit 26. Result buses 38 convey
the result generated, as well as the reorder buffer tag identifying the instruction being executed.
Load/store unit 26 provides an interface between functional units 24 and data cache 28. In one embodiment, load/store unit 26 is configured with two load/store buffers. The fi | | |