|
Description  |
|
|
Related applications filed concurrently herewith are entitled "Two-Level
Branch Prediction Cache", U.S. patent application Ser. No. 07/485,306, now
abandoned, "Integrated Single-Structure Branch Prediction Cache", U.S.
patent application Ser. No. 07/485,307, now U.S. Pat. No. 5,093,778 and
"Integrated Instruction Queue and Branch Target Cache"U.S. patent
application Ser. No. 485,304.
BACKGROUND OF THE INVENTION
The present invention relates to branch prediction caches, and in
particular to methods and apparatus for handling the modification of an
instruction in the branch prediction cache or already provided to an
instruction pipeline by an instruction farther along the pipeline.
As computer designers have designed increasingly higher performance
implementations of various computer architectures, a number of classes of
techniques have been developed to achieve these increases in performance.
Broadly speaking, many of these techniques can be categorized as forms of
pipelining, caching, and hardware parallelism. Some of these techniques
are generally applicable to and effective in the implementation of most
types of computer architectures, while others are most appropriate in the
context of speeding up the implementations of Complex Instruction Set
Computers (CISC).
The purpose of creating a high-performance implementation of a CISC
architecture system is to achieve the appearance of each instruction
having a processing time of one or a few processor/clock cycles. However,
such timing typically is only approximated. One of the principal reasons
for not achieving a one-cycle processing time is the existence of various
types of dependencies between neighboring instructions. The dependencies
may result in the occurrence of processing delays.
One critical area in the achievement of a one-cycle processing time is in
the handling of control dependencies, i.e. branch-type instructions. In
the context of a CISC architecture implementation these instructions tend
to be difficult insofar as being able to quickly calculate or otherwise
determine the target address of a branch, to quickly resolve the proper
path of subsequent instruction processing in the case of conditional
branches, and in all cases to then quickly restart the fetching of
instructions at the new address. Pipeline processing delays result when
these operations cannot be performed quickly.
To minimize the actual impact of these delays on processing throughput,
various types of prediction and caching techniques are available. The
purpose of a designer in applying these various types of techniques is to
always accurately predict the information to be produced by the above
operations, i.e. branch target address, conditional branch direction, and
the first one or more instructions at the branch target address. The
percentage success rates of these prediction techniques then reduce the
effective delay penalties incurred by the above three operations in
corresponding amounts.
Generally speaking, existing techniques are based on the retention or
caching of information from the prior processing of branch instructions.
When a branch instruction is encountered again, and information from
previous processing of this instruction is still to be found in the
prediction cache structure, the cached information is then used to make an
"intelligent" dynamic prediction for the current occurrence of the branch.
When no such information is to be found in the prediction cache structure,
either a "dumber" static prediction must be made, or normal processing,
with the attendant possibility of incurring delays, must be performed.
Existing high-performance CISC designs use forms of cache structures to
hold various combinations of information with the intention of predicting
one or more of the three types of information mentioned above. In an
aggressive all-encompassing design each entry holds: a record of the
actual target address associated with the last occurrence of the branch; a
copy of the first several instructions at this target address; and, in the
case of conditional branches, a history record of the direction taken by
each of the past couple of branch occurrences.
In parallel with the fetching and/or decoding of a branch instruction, the
instruction is also looked up in the branch prediction cache. Generally,
this look-up is based on the fetch address of the branch or on a closely
related address. As the instruction is being decoded, the branch history
information is used to predict the direction of conditional branches. The
history information determines whether subsequent instruction processing
should continue with the instructions sequentially following the branch,
or with the sequence of instructions starting at the target address.
Whether the branch is conditional or unconditional, if processing is to
continue with the target instruction stream, then the processing of
successive instructions proceed without delay using the branch target
instructions from the cache. At the same time, fetching of further
non-cached instructions is immediately initiated using the predicted
branch target address, plus an appropriate increment.
While this branch prediction design offers the possibility of fast,
efficient processing of the predicted branches, the possibility of
processing a branch instruction based on erroneous information is
introduced. In general, handling mispredicted aspects of processing a
branch must be an integral part of the overall central processing unit
(CPU) design.
There are subtler issues stemming from the nature of the implemented
architecture. In the case of many CISC architectures, one part of a
program may modify other parts. These modified program parts are then
executed. The result is that the modified image of these instructions,
instead of the original image, is then executed.
For some CISC architectures which allow programs to modify itself, this
type of programming practice has become an established practice within a
significant portion of the existing software base. Consequently, to
maintain backward software compatibility, new CPU implementations often
must not only implement the direct semantics of the architecture's
instruction set, but also maintain the appearances of this expected
secondary semantic behavior. In the case of higher performance
implementations this can become a significant, and potentially difficult,
requirement to satisfy.
When cache-based dynamic branch prediction in incorporated into the design
difficulties with high-performance arise. It is possible that the target
address of a branch instruction, which might otherwise be a constant value
for that instruction instance in memory, may be modified by what is
normally an instruction that writes data to memory. Or, for that matter,
the branch instruction opcode may be changed to that of a different type
of branch instruction or to that of a non-branch type of instruction.
Further, even if none of the bytes of a branch instruction itself are
modified, one of the target instructions may be modified. To the extent
that these instructions are fetched from main memory after any such
modification has taken affect, there is no problem. But if, for example, a
copy of the modified instruction is held in a branch prediction cache such
as described above, and is fetched from there instead of main memory, then
a consistency problem exists.
For both branch and target instruction modifications, the design of a
branch prediction cache and associated control circuitry must maintain a
sufficient degree of consistency to ensure proper processing of
instructions. The maintenance of consistency must encompass not only
conventional data/instruction cache consistency, but also consistency with
respect to memory store instructions modifying other instructions which
are executed shortly thereafter.
The consistency problem is essentially similar to that encountered with
more conventional data/instruction cache structures used in
high-performance CPU designs. Data writes by the CPU must be appropriately
reflected in the state and/or contents of any affected cache entries. But
the scope of the problem is more general than this. When other devices
within the system, such as Direct Memory Address (DMA) devices or CPU's,
modify main memory, the issue of cache/main memory consistency again
arises. For a branch prediction cache these other devices are additional
sources or causes of inconsistencies which must also be covered.
In extreme "store-into-instruction-stream" cases, such as a modifying
instruction immediately followed by a branch and then a modified target
instruction, this can be difficult. Particularly for highly pipelined,
high-performance CPU designs, implementing this can prove to be expensive
in terms of additional hardware circuitry, complex to design, and/or
forcing compromise in the overall performance attainable by the design.
Pipelining, particularly the deep pipelining that is common in
high-performance implementations of CISC architectures, results in large
instruction processing latencies and high degrees of overlap between the
processing of successive instructions. Access of a branch prediction cache
and usage of the resultant prediction information and target instructions
generally occurs early in such pipelines. Execution of memory writes by
instructions storing to memory, on the other hand, generally takes place
late in such pipelines.
Consequently, actions such as fetching target instructions from a branch
prediction cache (BPC) can easily occur before architecturally preceding
memory writes modifying such target instructions have actually been
performed. Such actions may even occur before the store addresses have
been generated, this being the earliest point at which potential
consistency problems could be checked for and detected. The result is that
consistency must be maintained with respect to not only the explicit
contents of the branch prediction cache, but also with respect to the
implicit contents associated with instructions currently being processed.
Insofar as these consistency issues apply to target instructions fetched
from the cache, they also apply to instructions temporarily stored in and
then taken from instruction pre-fetch queues. As a transient form of
cache, a pre-fetch queue can lead to similar inconsistencies.
Overall, the need to maintain branch prediction cache and more general
fetched instruction consistency with respect to cases of
"store-into-instruction-stream", as well as with respect to more generic
modifications of main memory blocks, is a difficult problem.
SUMMARY OF THE INVENTION
The present invention provides for the updating of both the instructions in
a branch prediction cache and instructions recently provided to an
instruction pipeline from the cache when an instruction being executed
attempts to change such instructions ("Store-Into-Instruction-Stream").
The branch prediction cache (BPC) includes a tag identifying the address
of instructions causing a branch, a record of the target address which was
branched to on the last occurrence of each branch instruction, and a copy
of the first several instructions beginning at this target address. A
separate instruction cache is provided for normal execution of
instructions. and all of the instructions written into the branch
prediction cache from the system bus must also be stored in the
instruction cache. The instruction cache monitors the system bus for
attempts to write to the address of an instruction contained in the
instruction cache. Upon such a detection, that entry in the instruction
cache is invalidated, and the corresponding entry in the branch prediction
cache is invalidated. A subsequent attempt to use an instruction in the
branch prediction cache which has been invalidated will detect that it is
not valid, and will instead go to main memory to fetch the instruction,
where it has been modified.
The present invention also provides for invalidating an instruction already
provided to an instruction pipeline when that instruction is attempted to
be changed by an instruction currently being executed. This is done by
maintaining a copy of the target instruction address of the branch after
the target instructions are provided to the instruction pipeline. These
instructions (if still present) are then invalidated upon a detection of
an attempt to write to them in main memory, and an error signal is
generated. A control circuit for the instruction pipeline receives this
error signal, and flushes the instruction pipeline of all instructions and
results from the branch target instruction onward, and restarts the
instruction pipeline at the branch target instruction. This stored group
of instructions is referred to as a pre-fetch queue.
The branch target cache contains the basic hardware necessary for
implementing the pre-fetch queue. One or more 24 byte portions of the
branch target cache are simply designated as being used for the pre-fetch
queue. These instructions can then be invalidated just like any other
instruction in the branch target cache.
For a fuller understanding of the nature and advantages of the invention,
reference should be made to the ensuing detailed description taken in
conjunction with the accompanying drawings.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a block diagram of the computer system incorporating the present
invention;
FIG. 2 is a block diagram of the decoder block of FIG. 1;
FIG. 3 is a block diagram of the front end block of FIG. 2, showing the
branch prediction cache of the present invention;
FIG. 4 is a block diagram of the operation or the instruction cache of the
present invention;
FIg. 5 is a diagram showing the contents of the branch prediction cache the
of the present invention;
FIG. 6 is a diagram of the consistency check comparisons done in the
present invention;
FIG. 7 is a block diagram of the branch prediction cache according to the
present invention;
FIG. 8 is a block diagram of the tag RAM of FIG. 7;
FIG. 9 is a diagram of the target address RAM of FIG. 7;
FIG. 10 is a block diagram of the branch target instruction sequence RAM of
FIG. 7;
FIG. 11 is a block diagram of the target instruction valid and branch
history memory of FIG. 7;
FIG. 12 is a diagram of the address selection logic of FIG. 7;
FIG. 13 is a diagram illustrating the pre-fetch queue in the branch target
instruction sequence RAM of the present invention;
FIG. 14 is a flowchart of the operation of the control circuit for
monitoring the pre-fetch queue for store-into instruction stream errors;
FIG. 15 is a diagram of the store-into-stream detection logic; and
FIG. 16 is a diagram of the multiple instruction streams resulting from
multiple branches.
DESCRIPTION OF THE PREFERRED EMBODIMENT
The following System Overview and Pipeline Control System Overview describe
a pipeline system which can be flushed and restarted when a branch error
is detected several cycles after the branch is taken. This is a necessary
element of aspects of the invention set forth in Maintaining Branch
Prediction Cache Consistency.
System Overview
FIG. 1 is a block diagram of a CPU 10 incorporating the present invention.
The CPU is designed to execute an instruction set compatible with that of
the Intel 80386, as described in the Intel 80386 Programmer's Reference
Manual published by Intel Corporation, Santa Clara, California, 1986. Each
block in the diagram corresponds generally to a separate integrated
circuit chip or group of chips in a current embodiment.
An Instruction Decoder (DEC) 12 performs instruction fetch, instruction
decode, and pipeline control. DEC 12 optionally interleaves instruction
pre-fetch of up to three simultaneous instruction streams. DEC 12 contains
a fully associative Branch Prediction Cache (BPC) according to the present
invention. The BPC is an integrated structure which contains dynamic
branch history data, a physical branch target address, and a branch target
buffer for each cache entry. As branch instructions are decoded, the BPC
is consulted for information about that branch. Independent of the
direction predicted, branches are executed in a single cycle and do not
cause pipeline bubbles.
On each cycle, an instruction is selected from one of the three instruction
buffers or a branch target buffer in the BPC. The instruction is decoded,
assembled into an internal 96-bit decoded instruction word, referred to as
a pseudo-op (p-op), and dispatched to the various functional units.
Instruction decode generally proceeds at a single cycle rate. Each p-op
issued by DEC 12 is given a tag which uniquely identifies each instruction
currently outstanding in the machine. Tags are issued in increasing order,
allowing easy determination of relative age of any two outstanding tags.
Bus transactions between chips include the tag of the originating
instruction. Functional units pair up instructions, addresses, and
operands with these tags.
DEC 12 is also responsible for tracking the status of outstanding
instructions, pipeline control, and for invoking exception processing when
needed.
An address Preparation Unit (AP) 15 calculates effective addresses,
performs segment relocation, and implements a demand paged memory
management system. It contains a translation lookaside buffer (TLB).
An Integer Execution Unit (IEU) 17 performs single cycle execution of most
integer instructions. It contains an 8.times.32 multiplier and accumulator
array, as well as microcode for multiply and divide instructions. The
pipeline control architecture allows the IEU to perform parallel and/or
out-of-order execution of integer instructions.
A Numerics Processor (NP) 20 may optionally be included in the CPU. It is a
high performance implementation of the IEEE floating point standard. The
NP is integrated into the pipeline and does not incur any special overhead
for the transfer of instructions and operands. Integer (IEU) and floating
point (NP) instructions execute concurrently.
A Memory and Cache Controller (MCC) 25 is responsible for controlling the
instruction and data caches and implements the cache coherency protocol.
The MCC controls the interface to the System Bus, supporting high speed
single and block mode transfers between cache and memory. The MCC also
contains write reservation tables for integer, floating point, and system
writes, and includes read after write short-circuit paths.
An Instruction Tag (ITAG) chip 27 contains the tag RAM for an Instruction
Cache (ICache) 30. The tag RAM contains the address tag, a "Valid" bit,
and an "Attention" bit for each line in the ICache. The Attention bit
indicates that the DEC chip may also have data from this line cached in
the BPC.
A Data Tag (DTAG) chip 32 contains the tag RAM for a Data Cache (DCache)
35. The tag RAM contains the address tag and line state bits for each line
in the DCache. The possible line states are Absent, Shared Read, Owned
Clean, and Owned Dirty, supporting a write back multiprocessor cache
coherency protocol (modified write once). The tag RAM is dual ported to
allow both CPU and bus snooping cache look-ups in a single cycle.
Each functional unit chip is packaged in a custom ceramic pin grid array
(PGA) which contains power and ground planes and associated decoupling
capacitors Roughly 25% of the pins are dedicated to power and ground. For
0.8 micron to 1.2 micron processes, I/0 delays are comparable to on-chip
critical paths. Inter-chip I/0 is incorporated into the pipeline, and thus
does not add to the machine cycle time. ICache 30 and DCache 35 use
conventional static RAM's.
Communications between the various functional units are carried out over a
number of internal buses. (These include: a 64-bit IFETCH.sub.-- DATA bus
for instruction fetches; a 96-bit P-OP bus for communicating issued p-ops
to the ApU, the lEU, and the Np; a 52-bit pADR bus for communicating
physical addresses; a 64-bit (32 bits in each direction) data cache bus
DIOBUS for data cache transfers; a 32-bit bus DXBUS for inter-chip
exchange; a 64-bit bus for cache/memory updates; and a number of
termination buses (AP.sub.-- TERM, IEU.sub.-- TERM, NP.sub.-- TERM, and
MCC.sub.-- TERM) from the functional units to DEC 12. Some of these buses
are full width and some (e.g. P-OP bus) are half-width (time multiplexed).
Interactions between functional units are generally limited to well
defined transactions on the internal processor buses.
Pipeline Control System Overview
Pipeline control of the processor is distributed across the functional
units mentioned above. No centralized scheduling or score boarding of the
pipeline is performed. DEC 12 does observe certain overall resource
constraints in the architecture and will occasionally hold off on issuing
a decoded instruction which would violate resource limitations. Each
functional unit is responsible for scheduling its own internal operations.
Interlock checking is performed at a local level.
In a deeply pipelined machine, exception detection at various stages in the
pipeline creates significant control difficulties. Each stage must be
careful to hold off modification of state while any other stage may yet
detect an exception on a previous instruction. Special purpose control
logic is common, and careful pipeline simulations must be performed.
The processor deals with this complexity | | |