|
Description  |
|
|
The present invention relates generally to multiprocessor computer systems
in which the processors share memory resources, and particularly to a
multiprocessor computer system that utilizes an interconnect architecture
and cache coherence methodology to minimize memory access latency so as to
maximize computational throughput.
BACKGROUND OF THE INVENTION
The need to maintain "cache coherence" in multiprocessor systems is well
known. Maintaining "cache coherence" means, at a minimum, that whenever
data is written into a specified location in a shared address space by one
processor, the caches for any other processors which store data for the
same address location are either invalidated, or updated with the new
data.
There are two primary system architectures used for maintaining cache
coherence. One, herein called the cache snoop architecture, requires that
each data processor's cache include logic for monitoring a shared address
bus and various control lines so as to detect when data in shared memory
is being overwritten with new data, determining whether it's data
processor's cache contains an entry for the same memory location, and
updating its cache contents and/or the corresponding cache tag when data
stored in the cache is invalidated by another processor. Thus, in the
cache snoop architecture, every data processor is responsible for
maintaining its own cache in a state that is consistent with the state of
the other caches.
In a second cache coherence architecture, herein called the memory
reference architecture, main memory includes a set of status bits for
every block of data that indicate which data processors, if any, have the
data block stored in cache. The main memory's status bits may store
additional information, such as which processor is considered to be the
"owner" of the data block if the cache coherence architecture requires
storage of such information.
The present invention utilizes a different cache in which each data
processor that has a cache memory maintains a master cache index and a
system controller maintains a duplicate cache index for each such cache
memory. For each memory transaction by a data processor in which there is
a cache miss or other state change requiring communication with either
main memory or other cache memories in order to maintain cache coherence,
the System Controller does a cache index lookup on all the duplicate cache
lookups, the system controller selects the sequence of actions needed to
perform the memory transaction.
In prior art "snoop bus" cache coherence systems, it is impossible for two
overlapping memory transactions to simultaneously access the same address
because only one cache coherent memory transaction can be performed at a
time, and that transaction is broadcast to all data processors so that
they can "snoop" on the address and control busses and thereby keep their
local cache memories consistent with memory transactions performed by
other data processors.
In prior art memory reference architecture cache coherent systems it is
also impossible for two overlapping memory transactions to simultaneously
access the same address because the memory reference logic inherently
serializes such memory transactions.
In the "duplicate cache tag" architecture of the present invention in which
multiple data processors can initiate memory transactions and in which the
interconnect can process multiple memory transactions simultaneously,
there needs to be a mechanism to avoid "coherence hazards." In particular,
the pipelined execution of transactions in the present invention results
in multiple transactions being active simultaneously in the System
Controller. This would lead to coherence hazards in the system if multiple
active transactions shared the same cache index in the Dtags. To avoid
such hazards, the System Controller utilizes special transaction
activation logic that blocks a first memory transaction from becoming
active if a second memory transaction that is already active is using the
same cache index as would be used by the first memory transaction. One
important exception to this transaction activation blocking is that
writeback transactions do not need to be blocked and do not cause other
transactions to be blocked.
SUMMARY OF THE INVENTION
In summary, the present invention is a multiprocessor computer system that
has a multiplicity of sub-systems and a main memory coupled to a system
controller. An interconnect module, interconnects the main memory and
sub-systems in accordance with interconnect control signals received from
the system controller.
At least two of the sub-systems are data processors, each having a
respective cache memory that stores multiple blocks of data and a
respective set of master cache tags (Etags), including one cache tag for
each data block stored by the cache memory.
Each data processor includes a master interface for sending memory access
requests to the system controller and for receiving cache access requests
from the system controller corresponding to memory access requests by
other ones of the data processors. The system controller includes memory
access request logic for processing each memory access request by a data
processor, for determining which one of the cache memories and main memory
to couple to the requesting data processor, for sending corresponding
interconnect control signals to the interconnect module so as to couple
the requesting data processor to the determined one of the cache memories
and main memory, and for sending a reply message to the requesting data
processor to prompt the requesting data processor to transmit or receive
one data packet to or from the determined one of the cache memories and
main memory.
The system controller includes transaction activation logic for activating
each memory transaction request when it meets predefined activation
criteria, and for blocking each memory transaction request until the
predefined activation criteria are met. An active transaction status table
stores status data representing memory transaction requests that have been
activated, including an address value for each activated transaction. The
transaction activation logic includes comparator logic for comparing each
memory transaction request with the active transaction status data for all
activated memory transaction requests so as to detect whether activation
of a particular memory transaction request would violate the predefined
activation criteria. With certain exceptions concerning writeback
transactions, an incoming transaction for accessing a data block that maps
to the same cache line as a pending, previously activated transaction,
will be blocked until the pending transaction that maps to the same cache
line is completed.
BRIEF DESCRIPTION OF THE DRAWINGS
Additional objects and features of the invention will be more readily
apparent from the following detailed description and appended claims when
taken in conjunction with the drawings, in which:
FIG. 1 is a block diagram of a computer system incorporating the present
invention.
FIG. 2 is a block diagram of a computer system showing the data bus and
address bus configuration used in one embodiment of the present invention.
FIG. 3 depicts the signal lines associated with a port in a preferred
embodiment of the present invention.
FIG. 4 is a block diagram of the interfaces and port ID register found in a
port in a preferred embodiment of the present invention.
FIG. 5 is a block diagram of a computer system incorporating the present
invention, depicting request and data queues used while performing data
transfer transactions.
FIG. 6 is a block diagram of the System Controller Configuration register
used in a preferred embodiment of the present invention.
FIG. 7 is a block diagram of a caching UPA master port and the cache
controller in the associated UPA module.
FIGS. 8, 8A, 8B, 8C, and 8D show a simplified flow chart of typical
read/write data flow transactions in a preferred embodiment of the present
invention.
FIG. 9 depicts the writeback buffer and Dtag Transient Buffers used for
handling coherent cache writeback operations.
FIGS. 10A, 10B, 10C, 10D and 10E shows the data packet formats for various
transaction request packets.
FIG. 11 is a state transition diagram of the cache tag line states for each
cache entry in an Etag array in a preferred embodiment of the present
invention.
FIG. 12 is a state transition diagram of the cache tag line states for each
cache entry in an Dtag array in a preferred embodiment of the present
invention.
FIG. 13 depicts the logic circuitry for activating transactions.
FIGS. 14A-14D are block diagrams of status information data structures used
by the system controller in a preferred embodiment of the present
invention.
FIG. 15 is a block diagram of the Dtag lookup and update logic in the
system controller in a preferred embodiment of the present invention.
FIG. 16 is a block diagram of the S.sub.-- Request and S.sub.-- Reply logic
in the system controller in a preferred embodiment of the present
invention.
FIG. 17 is a block diagram of the datapath scheduler in a preferred
embodiment of the present invention.
FIG. 18 is a block diagram of the S.sub.-- Request and S.sub.-- Reply logic
in the system controller in a second preferred embodiment of the present
invention.
DESCRIPTION OF THE PREFERRED EMBODIMENTS
The following is a glossary of terms used in this document.
Cache Coherence: keeping all copies of each data block consistent.
Tag: a tag is a record in a cache index for indicating the status of one
cache line and for storing the high order address bits of the address for
the data block stored in the cache line.
Etag: the primary array of cache tags for a cache memory. The Etag array is
accessed and updated by the data processor module in a UPA port.
Dtag: a duplicate array of cache tags maintained by the system controller.
Interconnect: The set of system components that interconnect data
processors, I/O processors and their ports. The "interconnect" includes
the system controller 110, interconnect module 112, data busses 116,
address busses 114, and reply busses 120 (for S.sub.-- REPLY's), 122 (for
P.sub.-- REPLY's) in the preferred embodiment.
Victim: a data block displaced from a cache line
Dirty Victim: a data block that was updated by the associated data
processor prior to its being displaced from the cache by another data
block. Dirty victims must normally be written back to main memory, except
that in the present invention the writeback can be canceled if the same
data block is invalidated by another data processor prior to the writeback
transaction becoming "Active."
Line: the unit of memory in a cache memory used to store a single data
block.
Invalidate: changing the status of a cache line to "invalid" by writing the
appropriate status value in the cache line's tag.
Master Class: an independent request queue in the UPA port for a data
processor. A data processor having a UPA port with K master classes can
issue transaction requests in each of the K master classes. Each master
class has its own request FIFO buffer for issuing transaction requests to
the System Controller as well as its own distinct inbound data buffer for
receiving data packets in response to transaction requests and its own
outbound data buffer for storing data packets to be transmitted.
Writeback: copying modified data from a cache memory into main memory.
The following is a list of abbreviations used in this document:
DVMA: direct virtual memory access (same as DMA, direct memory access for
purposes of this document)
DVP: dirty victim pending
I/O: input/output
IVP: Invalidate me Advisory
MOESI: the five Etag states: Exclusive Modified (M), Shared Modified (O),
Exclusive Clean (E), Shared Clean (S), Invalid (I).
MOSI: the four Dtag states: Exclusive and Potentially Modified (M), Shared
Modified (O), Shared Clean (S), Invalid (I).
NDP: no data tag present
PA›xxx!: physical address ›xxx!
SC: System Controller
UPA: Universal Port Architecture
Referring to FIG. 1, there is shown a multiprocessor computer system 100
incorporating the computer architecture of the present invention. The
multiprocessor computer system 100 includes a set of "UPA modules." UPA
modules 102 include data processors as well as slave devices such as I/O
handlers and the like. Each UPA module 102 has a port 104, herein called a
UPA port, where "UPA" stands for "universal port architecture." For
simplicity, UPA modules and their associated ports will often be called,
collectively, "ports" or "UPA ports," with the understanding that the port
or UPA port being discussed includes both a port and its associated UPA
module.
The system 100 further includes a main memory 108, which may be divided
into multiple memory banks 109 Bank.sub.0 to Bank.sub.m, a system
controller 110, and an interconnect module 112 for interconnecting the
ports 104 and main memory 108. The interconnect module 112, under the
control of datapath setup signals from the System Controller 110, can form
a datapath between any port 104 and any other port 104 or between any port
104 and any memory bank 109. The interconnect module 112 can be as simple
as a single, shared data bus with selectable access ports for each UPA
port and memory module, or can be a somewhat more complex crossbar switch
having m ports for m memory banks and n ports for n UPA ports, or can be a
combination of the two. The present invention is not dependent on the type
of interconnect module 112 used, and thus the present invention can be
used with many different interconnect module configurations.
A UPA port 104 interfaces with the interconnect module 112 and the system
controller 110 via a packet switched address bus 114 and packet switched
data bus 116 respectively, each of which operates independently. A UPA
module logically plugs into a UPA port. The UPA module 102 may contain a
data processor, an I/O controller with interfaces to I/O busses, or a
graphics frame buffer. The UPA interconnect architecture in the preferred
embodiment supports up to thirty-two UPA ports, and multiple address and
data busses in the interconnect. Up to four UPA ports 104 can share the
same address bus 114, and arbitrate for its mastership with a distributed
arbitration protocol.
The System Controller 110 is a centralized controller and performs the
following functions:
Coherence control;
Memory and Datapath control; and
Address crossbar-like connectivity for multiple address busses.
The System Controller 110 controls the interconnect module 112, and
schedules the transfer of data between two UPA ports 104, or between UPA
port 104 and memory 108. The architecture of the present invention
supports an arbitrary number of memory banks 109. The System Controller
110 controls memory access timing in conjunction with datapath scheduling
for maximum utilization of both resources.
The System Controller 110, the interconnect module 112, and memory 108 are
in the "interconnect domain," and are coupled to UPA modules 102 by their
respective UPA ports 104. The interconnect domain is fully synchronous
with a centrally distributed system clock signal, generated by a System
Clock 118, which is also sourced to the UPA modules 104. If desired, each
UPA module 102 can synchronize its private internal clock with the system
interconnect clock. All references to clock signals in this document refer
to the system clock, unless otherwise noted.
Each UPA address bus 114 is a 36-bit bidirectional packet switched request
bus, and includes 1-bit odd-parity. It carries address bits PA›40:4! of a
41-bit physical address space as well as transaction identification
information.
Referring to FIGS. 1 and 2, there may be multiple address busses 114 in the
system, with up to four UPA ports 104 on each UPA address bus 114. The
precise number of UPA address busses is variable, and will generally be
dependent on system speed requirements. Since putting more ports on an
address bus 114 will slow signal transmissions over the address bus, the
maximum number of ports per address bus will be determined by the signal
transmission speed required for the address bus.
The datapath circuitry (i.e., the interconnect module 112) and the address
busses 114 are independently scaleable. As a result, the number of address
busses can be increased, or decreased, for a given number of processors so
as to optimize the speed/cost tradeoff for the transmission of transaction
requests over the address busses totally independently of decisions
regarding the speed/cost tradeoffs associated with the design of the
interconnect module 112.
FIG. 3 shows the full set of signals received and transmitted by a UPA port
having all four interfaces (described below) of the preferred embodiment.
Table 1 provides a short description of each of the signals shown in FIG.
3.
TABLE 1
______________________________________
UPA Port Interface Signal Definitions
Signal Name Description
______________________________________
Data Bus Signals
UPA.sub.-- Databus›128!
128-bit data bus. Depending on speed
requirements and the bus technology used, a
system can have as many as one 128-bit data bus
for each UPA port, or each data bus can be shared
by several ports.
UPA.sub.-- ECC›16!
Bus for carrying error correction codes.
UPA.sub.-- ECC<15:8> carries the ECC for
UPA.sub.-- Databus<127:64>.
UPA.sub.-- ECC<7:0> carries the ECC for
UPA.sub.-- Databus<63:0>.
UPA.sub.-- ECC.sub.-- Valid
ECC valid. A unidirectional signal from the
System Controller to each UPA port, driven by
the System Controller to indicate whether the
ECC is valid for the data on the data bus.
Address Bus Signals
UPA.sub.-- Addressbus›36!
36-bit packet switched transaction request bus.
See packet format in FIGS. 9A, 9B, 9C.
UPA.sub.-- Req.sub.-- In›3!
Arbitration request lines for up to three other
UPA ports that might be sharing this
UPA.sub.-- Addressbus.
UPA.sub.-- Req.sub.-- Out
Arbitration request from this UPA port.
UPA.sub.-- SC.sub.-- Req.sub.-- In
Arbitration request from System Controller.
UPA.sub.-- Arb.sub.-- Reset.sub.-- L
Arbitration Reset, asserted at the same time that
UPA.sub.-- Reset.sub.-- L is asserted.
UPA.sub.-- AddrValid
There is a separate, bidirectional, address valid
signal line between the System Controller and
each UPA port. It is driven by the port which
wins the arbitration or by the System Controller
when it drives the address bus.
UPA.sub.-- Data.sub.-- Stall
Data stall signal, driven by the System Controller
to each UPA port to indicate, during transmission
of a data packet, whether there is a data stall in
between quad-words of a data packet.
Reply Signals
UPA.sub.-- P.sub.-- Reply›5!
Port's reply packet, driven by a UPA port directly
to the System Controller. There is a dedicated
UPA.sub.-- P.sub.-- Reply bus for each UPA port.
UPA.sub.-- S.sub.-- Reply›6!
System Controller's reply packet, driven by
System Controller directly to the UPA port. There
is a dedicated UPA.sub.-- S.sub.-- Reply bus for each UPA
port.
Miscellaneous Signals:
UPA.sub.-- Port.sub.-- ID›5!
Five bit hardwired UPA Port Identification.
UPA.sub.-- Reset.sub.-- L
Reset. Driven by System Controller at power-on
and on any fatal system reset.
UPA.sub.-- Sys.sub.-- Clk›2!
Differential UPA system clock, supplied by the
system clock to all UPA ports.
UPA.sub.-- CPU.sub.-- Clk›2!
Differential processor clock, supplied by the
system clock controller only to processor UPA
ports.
UPA.sub.-- Speed›3!
Used only for processor UPA ports, this hard-
wired three bit signal encodes the maximum
speed at which the UPA port can operate.
UPA.sub.-- IO.sub.-- Speed
Used only by IO UPA ports, this signal encodes
the maximum speed at which the UPA port can
operate.
UPA.sub.-- Ratio
Used only for processor UPA ports, this signal
encodes the ratio of the system clock to the
processor clock, and is used by the processor to
internally synchronize the system clock and
processor clock if it uses a synchronous internal
interface.
UPA.sub.-- JTAG›5!
JTAG scan control signals, TDI, TMS, TCLK,
TRST.sub.-- L and TDO. TDO is output by the UPA
port, the others are inputs.
UPA.sub.-- Slave.sub.-- Int.sub.-- L
Interrupt, for slave-only UPA ports. This is a
dedicated line from the UPA port to the System
Controller.
UPA.sub.-- XIR.sub.-- L
XIR reset signal, asserted by the System
Controller to signal XIR reset.
______________________________________
A valid packet on the UPA address bus 114 is identified by the driver
(i.e., the UPA port 104 or the System Controller 110) asserting the
UPA.sub.-- Addr.sub.-- valid signal.
The System Controller 110 is connected to each UPA address bus 114 in the
system 100. The UPA ports 104 and System Controller 110 arbitrate for use
of each UPA address bus 114 using a distributed arbitration protocol. The
arbitration protocol is described in patent application Ser. No.
08/414,559, filed Mar. 31, 1995, now U.S. Pat. No. 5,710,891, which is
hereby incorporated by reference.
UPA ports do not communicate directly with other UPA ports on a shared UPA
address bus 114. Instead, when a requesting UPA port generates a request
packet that requests access to an addressed UPA port, the System
Controller 110 forwards a slave access to the addressed UPA port by
retransmitting the request packet and qualifying the destination UPA port
with its UPA.sub.-- Addr.sub.-- valid signal.
A UPA port also does not "snoop" on the UPA address bus to maintain cache
coherence. The System Controller 110 performs snooping on behalf of those
UPA ports whose respective UPA modules include cache memory using a
write-invalidate cache coherence protocol described below. The UPA address
bus 114 and UPA data bus 116 coupled to any UPA port 104 are Independent.
An address is associated with its data through ordering rules discussed
below.
The UPA data bus is a 128-bit quad-word bidirectional data bus, plus 16
additional ECC (error correction code) bits. A "word" is defined herein to
be a 32-bit, 4-byte datum. A quad-word consists of four words, or 16
bytes. In some embodiments, all or some of the data busses 116 in the
system 110 can be 64-bit double word bidirectional data bus, plus 8
additional bits for ECC. The ECC bits are divided into two 8-bit halves
for the 128-bit wide data bus. Although the 64-bit wide UPA data bus has
half as many signal lines, it carries the same number of bytes per
transaction as the 128-bit wide UPA data bus, but in twice the number of
clock cycles. In the preferred embodiment, the smallest unit of coherent
data transfer is 64 bytes, requiring four transfers of 16 bytes during
four successive system clock cycles over the 128-bit UPA data bus.
A "master" UPA port, also called a UPA master port, is herein defined to be
one which can initiate data transfer transactions. All data processor UPA
modules must have a master UPA port 104.
Note that graphics devices, which may include some data processing
capabilities, typically have only a slave interface. Slave interfaces are
described below. For the purposes of this document, a "data processor" is
defined to be a programmable computer or data processing device (e.g., a
microprocessor) that both reads and writes data from and to main memory.
Most, but not necessarily all, "data processors" have an associated cache
memory. For instance, an I/O controller is a data processor and its UPA
port will be a master UPA port. However, in may cases an I/O controller
will not have a cache memory (or at least not a cache memory for storing
data in the coherence domain).
A caching UPA master port is a master UPA port for a data processor that
also has a coherent cache. The caching UPA master port participates in the
cache coherence protocol.
A "slave" UPA port is herein defined to be one which cannot initiate data
transfer transactions, but is the recipient of such transactions. A slave
port responds to requests from the System Controller. A slave port has an
address space associated with it for programmed I/O. A "slave port" within
a master UPA port (i.e., a slave interface within a master UPA port) also
handles copyback requests for cache blocks, and handles interrupt
transactions in a UPA port which contains a data processor.
Each set of 8 ECC bits carry Shigeo Kaneda's 64-bit SEC-DED-S4ED code. The
interconnect does not generate or check ECC. Each UPA port sourcing data
generates the corresponding ECC bits, and the UPA port receiving the data
checks the ECC bits. UPA ports with master capability support ECC.
Slave-only UPA port containing a graphics framebuffer need not support ECC
(See UPA.sub.-- ECC.sub.-- Valid signal).
The UPA data bus 116 is not a globally shared common data bus. As shown in
FIGS. 1 and 2, there may be more than one UPA data bus 116 in the system,
and the precise number is implementation specific. Data is always
transferred in units of 16 bytes per clock-cycle on the 128-bit wide UPA
data bus, and in units of 16 bytes per two clock-cycles on the 64-bit wide
UPA data bus.
The size of each cache line in the preferred embodiment is 64 bytes, or
sixteen 32-bit words. As will be described below, 64 bytes is the minimum
unit of data transfer for all transactions involving the transfer of
cached data. That is, each data packet of cached data transferred via the
interconnect is 64 packets. Transfers of non-cached data can transfer 1 to
16 bytes within a single quad-word transmission, qualified with a 16-bit
bytemask to indicate which bytes within the quad-word contain the data
being transferred.
System Controller 110 schedules a data transfer on a UPA data bus 116 using
a signal herein called the S.sub.-- REPLY. For block transfers, if
successive quadwords cannot be read or written in successive clock cycles
from memory, the UPA.sub.-- Data.sub.-- Stall signal is asserted by System
Controller 110 to the UPA port.
For coherent block read and copyback transactions of 64-byte data blocks,
the quad-word (16 bytes) addressed on physical address bits PA›5:4! is
delivered first, and the successive quad words are delivered in the wrap
order shown in Table 2. The addressed quad-word is delivered first so that
the requesting data processor can receive and begin processing the
addressed quad-word prior to receipt of the last quad-word in the
associated data block. In this way, latency associated with the cache
update transaction is reduced. Non-cached block read and block writes of
64 byte data blocks are always aligned on a 64-byte block boundary
(PA›5:4!=0.times.0).
Note that these 64-byte data packets are delivered without an attached
address, address tag, or transaction tag. Address information and data are
transmitted independently over independent busses. While this is
efficient, in order to match up incoming data packets | | |