|
Description  |
|
|
CROSS REFERENCE TO RELATED APPLICATIONS
This application incorporates by reference the disclosures of copending
application Ser. No. 291,888 entitled "Cache Bypass Apparatus" filed on
Dec. 29, 1988 concurrently now abandoned, herewith, copending and of
application Ser. No. 159,016, entitled "Store Queue For A Tightly Coupled
Multiple Processor Configuration With A Two-Level Cache Buffer Storage",
filed Feb. 22, 1988 now U.S. Pat. No. 5,023,776.
BACKGROUND OF THE INVENTION
The present invention pertains to computing systems, and more particularly
to a store buffer apparatus in an improved, high performance,
multiprocessor or uniprocessor computer system.
In the design and development of computer systems, increasing emphasis is
being placed on performance of such systems. The performance is very often
a function of the technology used in manufacturing the integrated circuit
chips which comprise the computer system. One such technology, new in the
development of computer systems, is Complementary Metal Oxide
Semiconductor (CMOS) technology. CMOS technology provides a greater degree
of reliability, serviceability, and availability than seen before in prior
computer systems, due mostly to a reduction in the physical number of
chips which comprise the computer system. Since a scarcity of input/output
pins on chips has been a problem with prior computer systems, a reduction
in the number of chips reduces the number of interconnections
(input/output pins) between chips In addition, performance may also be a
function of the number of processors which comprise the computer system.
In such processors, master storage facilities store data needed by the
central processor units (CPU). In a multiprocessor system, each processor
often included a small cache memory in addition to the master storage
facilities. The cache was used by a particular processor to store blocks
of instructions or data. If the data or instructions were needed by the
processor, a particular item of data or an instruction was retrieved from
the cache in lieu of the master storage facility, since the time required
to perform a fetch from cache is much smaller than the time required to
fetch from master storage. However, if a particular processor retrieved
data and/or instructions from cache and executed the instructions and/or
operated on such data yielding a set of results which required storage in
the master storage facility, it is necessary that the particular processor
store the set of results directly to the master storage, not via the
cache. However, if a first CPU of the multiprocessor system is using the
master storage facility, a second CPU cannot store the set of results in
the master storage facility until the first CPU is finished using the
facility.
In a uniprocessor mode, the master storage facility can also be tied up by
the channel facilities. This way the processor itself can not store the
result back into the master storage facility until the master storage is
freed up by the channel facilities Thus the processor can not execute the
next instruction until the channel facilities finish using the master
storage facility. Therefore, the execution of another instruction in the
second CPU is delayed until the master storage facility is freed up by the
first CPU or by the channel facilities.
Furthermore in a pipeline machine, the processor has the capability to
store results back into the master storage facility every machine cycle.
Usually the rate of storing data into the master storage facility is much
slower than the rate the processor generates the data to be stored away.
Hence the processor has to stop executing instructions while the master
storage facility is still busy storing the result for the current
instruction. In other words, the master storage facility can not keep up
with the processor.
SUMMARY OF THE INVENTION
It is an object of the present invention to introduce an improved computer
system which provides a high level of performance due to its use of a
store buffer for temporarily holding data or instructions from the current
CPU when the master storage facility is being utilized by the current CPU,
or by another CPU, or by the channel facilities, thereby freeing up the
current CPU for execution of other instructions.
In a multiprocessor system including at least two processors and a main
memory, each processor includes a cache and a separate and distinct store
buffer facility in accordance with the present invention. If one processor
is using the master storage facility when the other processor seeks to
store data in the master storage facility, in a multiprocessor system
without a store buffer, the other processor is forced into a hold
condition until such time that the one processor completes its use of
master storage. When the one processor completes its use of the memory,
the other processor may then begin the store operation. Valuable time is
lost during the hold condition. However, in accordance with the present
invention, a store buffer is incorporated into each processor in addition
to its own cache. Therefore, when the other processor attempts to store
data in master storage when master storage is being used by the one
processor (or by the other processor or by the channel facilities), the
other processor stores the data in its store buffer facility
simultaneously with storage of the data in its cache. This store buffer
acts to temporarily hold the data during the time when the master storage
is being tied up by other resources. When the master storage is free, the
data in the store buffer of the other processor is transmitted to master
storage. The store buffer of each processor contains eight entries. Each
entry includes an address portion comprising an effective address and an
absolute address, a status bit portion comprising a plurality of status
bits, and a data portion including data and a plurality of write flags.
The status bits are used to determine if the data stored in an entry of
the store buffer pertains to a current instruction or a previous
instruction and whether or not the data has already been stored in main
memory. Other status bits indicate whether the data in a particular entry
pertains to a sequential store or a non-sequential store and indicate how
many data entries pertain to one particular instruction. Whenever data is
stored in a store buffer of a processor, all the status bits are set
accordingly, that is, one bit is set to indicate if the data is associated
with a present or a previous instruction, one bit is set to indicate
whether the data has already been stored in master storage, and one bit is
set to indicate if the store in the store buffer is made during sequential
or non-sequential store mode. When conditions change within the processor,
the status bits associated with a particular data entry in a store buffer
are changed again. The store buffer is used during sequential and
non-sequential store mode, during fetch conflicts for data and
instructions, when a merge operation is required, when the contents of a
store buffer is dumped into master storage, and during instruction retry
situations.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 illustrates a uniprocessor computer system;
FIG. 2 illustrates a triadic computer system;
FIG. 3 illustrates a detailed construction of the I/D Caches 9 (L1), the
I-unit, E-unit, and Control Store (C/S) illustrated in FIGS. 1 and 2;
FIG. 4 illustrates a further configuration of the I-unit, E-unit, and
Control Store (C/S) of FIGS. 1-3, further including a store buffer
apparatus of the present invention;
FIG. 5 illustrates the contents of the store buffer of FIG. 4;
FIG. 6 illustrates a timing chart of a fetch conflict;
FIG. 7 illustrates a timing chart depicting an L1 cache miss with store
buffer data merge; and
FIG. 8 illustrates the data portion of the store buffer.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT
FIG. 1 illustrates a prior-art uniprocessor computer system which may serve
as an environment for the present invention.
In FIG. 1, the uniprocessor system comprises a main-storage or L3 memory 10
connected to a storage controller (SCL) 12. On one end, the storage
controller 12 is connected to integrated I/0 subsystem controls 14, the
controls 14 being connected to integrated adapters and single card
channels 16. On the other end, the storage controller 12 is connected via
storage bus 17 to a cache storage unit 18, which comprises an instruction
cache and a data cache, collectively termed the L1 cache. The I/D caches
18 are connected to an instruction unit (I-unit), Execution unit (E-unit),
and control store of a processor 20. The processor 20 is also connected to
the storage controller 12 through the address/command (AD/CMD or A/C) bus
30. The uniprocessor system of FIG. 1 also includes the multisystem
channel communication unit 24.
The L3 memory 10 comprises 2 intelligent memory cards. The cards are
intelligent due to the existence of certain specific features: error
checking and correction, extended error checking and correction (ECC)
refresh address registers and counters, and bit spare capability. The
interface to the L3 memory 10 is 8 bytes wide. Memory sizes are 8, 16, 32,
and 64 megabytes. The L3 memory is connected to a storage controller (SCL)
12.
The storage controller 12 comprises three conventional bus arbiters for
granting access to the L3 memory 10, to the I/0 subsystem controls 14, and
to the I/D caches 18. The storage controller further includes a directory
which is used to determine whether to invalidate the lines of the
instruction and data caches of the L1 cache 18 for data. If the desired
data is located in the L1 caches 18 but the data is obsolete, the storage
controller 12 invalidates the obsolete data in the L1 caches 18, thereby
allowing the I/0 subsystem controls 14 to update the data in the L3 memory
10. Thereafter, instruction and execution units 20 must obtain the updated
data from the L3 memory 10. The storage controller 12 further includes a
plurality of buffers for buffering data being input to L3 memory 10 from
the I/0 subsystem controls 14 and for buffering data being input to L3
memory 10 from instruction/execution units 20. The buffer associated with
the instruction/execution units 20 is a 256 byte line buffer which allows
the building of entries 8 bytes at a time for certain types of
instructions, such as sequential operations. This line buffer, when filled
by all the data from one instruction, will cause a block transfer of data
to L3 memory to occur. Therefore, memory operations are reduced from a
number of individual store operations to a much smaller number of line
transfers.
The instruction cache/data caches of unit 18 are each 16K byte caches. The
storage bus 17 to the storage controller 12 is 8 bytes wide; thus, an
in-page operation from the storage controller 12 takes 8 data transfer
cycles. The data cache 18 is a "store through" cache, which means that
data from the instruction/execution units 20 are stored in L3 memory and,
if the corresponding obsolete data is not present in the L1 caches 18, the
data is not brought into and stored in the L1 caches. To assist this
operation, a "store buffer" is present with the L1 data cache 18 which is
capable of buffering up to 8 store operations.
The integrated I/0 subsystem 14 is connected to the storage controller 12
via an 8-byte bus. The storage controller 12 comprises three 64-byte
buffers used to synchronize data coming from the integrated I/0 subsystem
14. That is, the instruction/execution unit 20 and the I/0 subsystem 14
operate on different clocks, the synchronization of the two clocks being
achieved by the three 64-byte buffer structure.
The multisystem channel communication unit 24 is a 4-port channel to
channel adapter, packaged externally to the system.
The uniprocessor system of FIG. 1 operates as follows.
Normally, instructions are resident in the instruction cache (L1 cache) 18,
waiting to be executed. The instruction/execution unit 20 searches a
directory within the L1 cache 18 to determine if the next instruction is
stored therein. If the instruction is not stored in the L1 cache 18, the
instruction/execution unit 20 will generate a storage request to the
storage controller 12. Both the absolute address of the instruction and
the in-page request will be provided to the storage controller 12 through
the AD/CMD bus 30. The storage controller 12 will arbitrate (according to
a conventional algorithm) for access to the bus connected to the L3 memory
10. Eventually, the in-page request and the absolute address from the
instruction/execution unit 20 will be passed to the L3 memory 10, the
request comprising a command indicating that an instruction in L3 memory
is to be fetched for transfer to the instruction/execution unit 20. The L3
memory will latch the request and the absolute address, decode the
request, and select the location in the memory card wherein the
instruction is stored. After a few cycles of delay, the instruction will
be delivered to the storage controller 12 from the L3 memory in 8-byte
segments. The instruction is then transmitted from the storage controller
12 through storage bus 17 to the instruction cache (L1 cache) 18, wherein
it is temporarily stored. The instruction is then transmitted from the
instruction cache 18 to the instruction register within the
instruction/execution unit 20. The instruction is decoded via a
conventional decoder within the instruction unit 20. Quite often, an
operand from memory 10 is needed in order to execute the instruction. The
instruction/execution unit 20 searches the directory in the data cache 18;
if the operand is not found in the directory of the data cache 18, another
storage access is issued by the instruction/execution unit 20 to access
the L3 memory 10 via AD/CMD bus 30, exactly as described above with
respect to the instruction cache. The operand is eventually sent back to
the instruction/execution 20 for execution while at the same time it is
stored back into the data cache 18. If the instruction/execution unit 20
decodes an I/0 instruction in the instruction cache 18, information is
stored in an auxiliary portion of L3 memory 10, sectioned off from
instruction execution. The instruction/execution unit 20 informs the
integrated I/0 subsystem 14 that such information is stored in L3 memory,
and the subsystem 14 processors accesses the L3 memory 10 to fetch the
information.
FIG. 2 shows a triadic (multiprocessor) system. Two L3 memories 10a,10b are
connected to a bus switching unit (BSU) 26, the BSU including an L2 cache
27. The BSU 26 is connected to the integrated I/0 subsystem 14, to shared
channel processors 28, and to three processor units: a first processor
unit including instruction/execution units/control store 20a, a second
processor including instruction/execution units/control store 20b, and a
third processor including instruction/execution units/control store 20c
Each processor unit has an associated instruction/data L1 cache 18a, 18b,
18c. The cache in the BSU 26 is termed the L2 cache 27, and the main
memory 10a/10b is termed the L3 memory.
The BSU 26 connects the three L1 caches 18a, 18b, 18c of the three
processors 20a, 20b, 20c, two L3 memory ports 10a,10b, two shared channel
processors 28, and an integrated I/0 subsystem 14. The BSU 26 comprises
circuits which decide the priority for requests to be handled, such as
storage requests from each of the three processors 20a-20c, requests from
the I/0 subsystem 14 and shared channel processors 28, from circuits which
operate the interfaces, and from circuits to access the L2 cache 27. The
L2 cache 27 is a "store in" cache: when operations occur that access the
L2 cache for data, and when that data is modified during the course of the
operation, the data which is resident in the L2 cache must also be
modified accordingly. (The only exception is that, if the operation
originates from the I/0 subsystem 14, and if the data is resident only in
L3 memory 10a,10b and not in L2 cache 27, the data is modified only in L3
memory, and not in L2 cache).
The interface between the BSU 26 and L3 memories 10a,10b comprises two
16-byte lines in lieu of the single 8-byte port in FIG. 1. However, the
memory 10 of FIG. 1 is identical to the memory cards 10a,10b of FIG. 2.
The two memory cards 10a,10b of FIG. 2 are accessed in parallel.
The shared-channel processor 28 is connected to the BSU 26 via two 8-byte
ports. The shared channel processor 28 is operated at a frequency which is
independent of the BSU 26; the clocks within the BSU are synchronized with
the clocks in the shared channel processor 28 in a manner which is similar
to the clock synchronization between the storage controller 12 and the
integrated I/0 subsystem 14 of FIG. 1.
The multi-processor computer system of FIG. 2 operates in the following
manner.
Assume that a particular instruction/execution unit, say 20a requires an
instruction and searches its associated L1 cache, 18a for the desired
instruction. Assume further that the desired instruction is not resident
in the L1 cache. The instruction execution unit will then request access
to the BSU 26 via AD/CMD bus 30a to search the directory of the L2 cache
27. The BSU 26 contains a conventional arbiter which receives requests
from each of the instruction/execution units 20a, 20b, 20c and from the
shared channel processor 28 and from the integrated I/0 subsystem 14,
granting access to only one of these units at a time. When the particular
instruction/execution unit 20a is granted access to the BSU to search the
L2 cache 27, the BSU 26 searches the directory of the L2 cache 27 for the
desired instruction. If the instruction is found in the L2 cache, it is
returned to the instruction/execution unit 20a through storage bus 17a. If
the L2 cache directory indicates that the desired instruction is not
located within the L2 cache, a request is made to the L3 memory 10a or 10b
for the desired instruction. If the desired instruction is located in the
L3 memory, it is immediately transmitted to the BSU 26, 16 bytes at a
time, and is bypassed to the requesting instruction/execution unit via
storage bus 17a, while simultaneously being stored in the L2 cache 27 of
the BSU 26. Additional functions of the BSU relate to storage consistency
in a multiprocessor system. For example, when a particular
instruction/execution unit such as 20c modifies data, that data must be
made visible to all other instruction/execution units 20a, 20b in the
system. If unit 20c modifies data presently stored in its L1 cache 18c, a
search for that particular data is made in the L2 cache directory 27 of
the BSU 26. If found, the particular data is modified to reflect the
modification in the L1 cache 18c. Furthermore, the other processors 20a
and 20b are permitted to see the modified, correct data now resident in
the L2 cache 27 in order to permit such other processors to modify the
corresponding data resident in their L1 caches 18a and 18b. The unit 20c
cannot re-access the particular data until the other processors 20a and
20b have had a chance to modify their corresponding data accordingly.
FIG. 3 shows the detailed construction of each instruction/execution unit
(20 in FIG. 1 or 20a-20c in FIG. 2) and its corresponding L1 cache (18 in
FIG. 1 or 18a-18c in FIG. 2).
In FIG. 1, and in FIG. 2, the instruction/execution unit 20, 20a, 20b, and
20c is disposed in a block labelled "I-unit E-unit C/S (92KB)". This block
may be termed the "processor", the "instruction processing unit", the
"central processing unit (CPU)", or, as indicated above, the
"instruction/execution unit". For the sake of simplicity in the
description provided below, the block 20, 20a-20c will be called the
"processor". In addition, the "I/D caches (L1)" will be called the "L1
cache".
FIG. 3 provides a detailed construction of the processor 20 (or 20a, 20b,
20c) and of the L1 cache 18 (or 18a, 18b, 18c).
The processor 20 comprises several elements. A control store subsystem 20.1
includes a high-speed fixed control store 20.11 of 84k bytes and an 8k
byte, 2k word, 4-way associative pageable area 20.12. Machine state
controls 20.2 include the global controls 20.21 for the processor, an op
branch table 20.22 connected to the CSAR via the control store origin
address bus and used to generate the initial address for microcoded
instructions. An address generation unit 20.3 comprises 3 chips: an
instruction cache DLAT and directory chip 20.31, a data cache DLAT and
directory chip 20.32, and an address generation chip 20.33 connected to
the L1 cache 18 via the address bus 19.1. The instruction DLAT and
directory chip 20.31 is connected to the instruction cache portion 18.11
of the L1 cache 18 via four "hit" lines 19.2 which indicate that the
requested instruction will be found in the instruction cache portion 18.11
of the L1 cache. Likewise, four "hit" lines 19.3 connect the data DLAT and
directory 20.32 to the data cache 18.22 portion of the L1 cache 18
indicating that the requested data will be found in the data cache 18.22
portion of the L1 cache.
Furthermore the instruction DLAT/Directory chip 20.31 contains the
effective address portion and some status bits of the store buffer EA
20.31b. The data DLAT and Directory chip 30.32 contains the absolute
address portion and all of the status bits of the store buffer AA 20.32b
in accordance with the present invention. The address generation chip 20.3
contains copies 20.34 of the 16 general purpose registers used to generate
addresses, and also includes three storage address registers (SARS) 20.35
used to provide addresses to the microcode for instruction execution. A
fixed point instruction execution unit 20.4 is connected to the data cache
18.2 via the data bus (D-bus) 19.4. A local store stack (local store)
20.41 contains the 16 general-purpose registers mentioned above as well as
a number of working registers used exclusively by the microcode. Condition
registers 20.42 contain the results of a number of arithmetic and shift
operations, and a condition code. Numeral 20.43 denotes a conventional
four-byte arithmetic logic unit (ALU); 20.44 is an 8-byte rotate merge
unit. Branch bit select hardware 20.45 allows the selection of bits from
various registers to determine the direction of a branch operation; the
bits are selected from general purpose registers, working registers, and
the condition registers. A floating-point unit 20.5 executes
floating-point instructions. The floating point processor 20.5 is
disclosed in pending patent application Ser. No. 102,985, entitled
"Dynamic Multiple Instruction Stream Multiple Data Multiple Pipeline
Apparatus for Floating Point Single Instruction Stream Single Data
Architectures", filed on Sep. 30, 1987. The ALU 20.43 contains an adder as
disclosed in pending patent application Ser. No. 066,580, filed Jun. 26,
1987, entitled " A High Performance Parallel Binary Byte Adder". An
externals chip 20.6 includes timers and interrupt hardware for interrupts
from the I/0 subsystem 14 and other units.
In an L1 cache 18, an instruction cache 18.1 comprises a 16k byte/4-way
cache 18.11, a 16-byte instruction buffer 18.12 at the output thereof, and
an 8-byte in-page register 18.13 at the input 17.1 from storage. The
eight-bytes storage bus 17.1 is connected to the instruction cache 18.1
via the in-page register 18.13. The in-page register 18.13 is connected to
the control store subsystem 20.1. It provides data to the subsystem when a
pageable control store miss requires that new data be brought into the
control store. A data cache 18.2 comprises an in-page buffer 18.21 also
connected to the storage bus 17.1. The data cache 18.22 is a 16k-byte
4-way cache. A cache-dataflow unit 18.23 comprises a series of input and
output registers connected to the processor via an 8-byte data bus (D-bus)
19.4, and an 8-element store buffer data 18.24 in accordance with the
present invention.
The following describes the functional operation of one of the processors
20a-20c operating in conjunction with its associated L1 cache, 18a, 18b,
or 18c.
Assume that an instruction to be executed is located in the instruction
cache 18.11. The instruction is fetched from the instruction cache 18.11
and is stored in the instruction buffer 18.12. Every attempt is made to
keep the instruction buffer full at all times. The instruction is fetched
from the instruction buffer 18.12 and is stored in the instruction
registers (INSTR REG) 20.36 of the address-generation unit 20.3, registers
20.46 of the fixed-point execution unit 20.4, and registers 20.23 of the
machine-state controls 20.2. At this point, the instruction decoding
begins. If one or more operands are required, they are fetched from the
GPR COPY 20.34 in the address generation unit 20.3. (Normally, GPR COPY
20.34 is accessed if operands are required for the base and index
registers for an RX instruction.)
The address generation process begins in the next cycle. The base and index
register contents are added to a displacement field from the instruction,
and the effective address is generated and sent to the data cache 18.2
and/or to the instruction cache 18.1. In this example, an operand is
sought, so the effective address will be sent to the data cache 18.2. The
address is also sent to the data DLAT and directory chip 20.32, since, in
this example, an operand is sought. Access to the cache, DLAT, and the
directories begins in the third cycle. The DLAT 20.32a determines if the
address is translatable from an effective address to an absolute address.
Assuming that this translation has been previously performed, it will have
been recorded. The translated address from the DLAT output is compared
with the output of the cache directory 20.32a. If they are equal, the data
DLAT and directory chip 20.32 raises one of the four "hit" lines 19.3. The
hit lines are connected to the data cache 18.22; a "hit" line indicates
which of the four associativity classes contains the data that the
processor 20 wishes to retrieve. On the next cycle, the data-cache 18.22
output is gated through a fetch alignment shifter, in the cache dataflow
18.23. It is shifted appropriately, transmitted along the D-BUS 19.4 to
the fixed point execution unit 20.4, and latched into the ALU 20.43. In
parallel with the shifting process, the first operand is accessed from the
general-purpose registers in local store 20.41. As a result, two operands
are latched in the input of the ALU 20.43, if necessary. In the fifth
cycle, the ALU 20.43 will process (add, subtract, divide, etc.) the two
operands according to the instruction opcode. The output of the ALU 20.43
and the condition registers 20.42 are latched at the end of the fifth
cycle, to indicate an overflow or zero condition. In the sixth cycle, the
output of the ALU 20.43 is written back into the local store 20.41 and
into the GPR copy 20.34 of the address generation unit 20.3 in order to
keep the GPR copy 20.34 synchronized with the content of the local store
20.41. When the decode cycle of this instruction is complete, the decode
cycle of the next instruction may begin. Up to six instructions may be
either decoding or executing at any one time. Certain instructions require
the use of microcode to complete execution. Therefore, during the decode
cycle, the op | | |