|
Claims  |
|
|
What is claimed is:
1. A computer system, comprising:
a system controller;
a main memory coupled to the system controller;
a plurality of data processors, each respective data processor having a
cache memory having a plurality of cache lines for storing a like
plurality data blocks, and a like plurality of master cache tags (Etags),
including one Etag for each cache line in the cache memory of the
respective data processor, and a writeback buffer for storing a dirty
victim data block displaced from the cache memory; the Etag for each cache
line storing an address index and an Etag state value that indicates
whether the data block stored in the cache line includes data modified by
the respective data processor;
each respective data processor including a master interface, coupled to the
system controller, for sending memory transaction requests to the system
controller, the memory transaction requests including read requests and
writeback requests; each memory transaction request specifying an address
for an associated data block to be read or written;
the master interface for each respective data processor further including
cache coherence logic for responding to a cache miss on any cache line in
the cache memory by (A) generating a read request, and (B) when the cache
miss requires a cache line to be victimized which includes modified data,
according to the Etag state value in the corresponding Etag, storing the
data block having the modified data in the writeback buffer and generating
a writeback request;
the system controller including a set of duplicate cache tags (Dtags) for
each respective data processor, each Dtag corresponding to one of the
Etags and storing a Dtag state value and the same address index as the
corresponding Etag; the Dtag state value indicating whether the data block
stored in the corresponding cache line includes data modified by the
corresponding data processor;
the system controller including memory transaction request logic for
processing each memory transaction request by any of the respective data
processors, the memory transaction request logic including transaction
execution circuitry for pipelining and executing the memory transaction
requests from the data processors,
the transaction execution circuitry including Dtag updating circuitry for
updating the contents of a Dtag, if any, corresponding to each data
processor where the Dtag has an address index matching an address
associated with the memory transaction request being executed, wherein the
Dtag updating circuitry can invalidate Dtags corresponding to one or more
data processors, including data processors other than the data processor
that made the memory transaction request being executed by the transaction
execution circuitry;
the transaction execution circuitry further including writeback circuitry
for processing each writeback request from each data processor prior to
activation to determine if the Dtag corresponding to the data processor
that submitted the writeback request and the address specified by the
writeback request is invalid, the writeback circuitry activating the
writeback request if the corresponding Dtag is not invalid and canceling
the writeback request if the corresponding Dtag is invalid;
the system controller's memory transaction request logic including
writeback processing logic for processing the activated writeback request
by writing the data block in the writeback buffer into the main memory and
invalidating the corresponding Dtag.
2. The computer system of claim 1,
the master interface of each data processor including at least two parallel
outgoing request queues for storing memory transaction requests to be sent
to the system controller;
the master interface of each data processor further including cache
coherence logic for responding to a cache miss on any cache line in the
cache memory of the respective data processor by (A) storing a read
request in a first one of the outgoing request queues, and (B) when the
cache miss occurs on a cache line storing a data block that, according to
the Etag state value in the corresponding Etag, includes modified data,
storing the data block having the modified data in the writeback buffer
and storing a writeback request in a second one of the outgoing request
queues.
3. A method of canceling writeback transactions in a packet switched cache
coherent multiprocessor system having a system controller coupled to a
main memory and to a plurality of data processors each having a cache
memory, comprising the steps of:
storing master cache tags (Etags) in each data processor, including one
Etag for each cache line in the cache memory of the respective data
processor, the Etag for each cache line storing an address index and an
Etag state value that indicates whether a data block stored in the cache
line includes data modified by the respective data processor;
storing in a writeback buffer in each data processor a dirty victim data
block displaced from the cache memory of the respective data processor;
storing in the system controller duplicate tags (Dtags) for each respective
data processor, each Dtag corresponding to one of the Etags in a
respective data processor and storing a Dtag state value and the same
address index as the corresponding Etag;
sending memory transaction requests from the data processors to the system
controller, the memory transaction requests including read requests and
writeback requests; each memory transaction request specifying an address
for an associated data block to be read or written;
upon processing a memory transaction request from one of the data
processors requiring Dtag invalidation, invaliding Dtags having an address
index matching an address associated with the memory transaction request
being executed, wherein the invalidated Dtags including Dtags
corresponding to one or more data processors, including data processors
other than the data processor that made the memory transaction request
being executed;
activating a writeback request for processing by the system controller by:
determining if the Dtag corresponding to the data processor that made the
writeback request and the address specified by the writeback request is
invalid;
canceling the writeback request if the Dtag is invalid;
activating the writeback request if the Dtag is not invalid; and
processing activated writeback requests by writing the data block in
respective ones of the writeback buffers into the main memory and
invalidating the corresponding Dtags.
4. The method of claim 3,
wherein the activating step cancels the writeback request by sending a
first reply message to the data processor that instructs the data
processor not to send the dirty victim data block to the main memory, and
activates the writeback request by sending a second reply message to the
data processor that instructs the data processor to send the dirty victim
data block to the main memory. |
|
|
|
|
Claims  |
|
|
Description  |
|
|
The present invention relates generally to multiprocessor computer systems
in which the processors share memory resources, and particularly to a
multiprocessor computer system that utilizes an interconnect architecture
and cache coherence methodology to minimize memory access latency by
parallelizing read and writeback transactions and providing for a
mechanism to cancel pending writeback transactions upon the subsequent
modification of the pending writeback data.
BACKGROUND OF THE INVENTION
The need to maintain "cache coherence" in multiprocessor systems is well
known. Maintaining "cache coherence" means, at a minimum, that whenever
data is written into a specified location in a shared address space by one
processor, the caches for any other processors which store data for the
same address location are either invalidated, or updated with the new
data.
There are two primary system architectures used for maintaining cache
coherence. One, herein called the cache snoop architecture, requires that
each data processor's cache include logic for monitoring a shared address
bus and various control lines so as to detect when data in shared memory
is being overwritten with new data, determining whether it's data
processor's cache contains an entry for the same memory location, and
updating its cache contents and/or the corresponding cache tag when data
stored in the cache is invalidated by another processor. Thus, in the
cache snoop architecture, every data processor is responsible for
maintaining its own cache in a state that is consistent with the state of
the other caches.
In a second cache coherence architecture, herein called the memory
directory architecture, main memory includes a set of status bits for
every block of data that indicate which data processors, if any, have the
data block stored in cache. The main memory's status bits may store
additional information, such as which processor is considered to be the
"owner" of the data block if the cache coherence architecture requires
storage of such information.
In these cache coherence architectures, read-writeback transaction pairs
arise when a read miss requires victimizing a cache line which has
modified data, thereby necessitating a writeback to main memory. In the
prior art, these transactions normally are strictly ordered, with the
victimizing read transaction executing prior to the writeback transaction
in order to allow the requesting processor to receive the data right away.
In addition to the strict ordering, cache coherence architectures of the
prior art required these read and writeback transactions be sequentially
executed, not allowing for any other coherent transactions to be executed
from the same processor between the read and the writeback transactions,
even when transactions are directed to a different cache index.
One problem in these architectures arises when a writeback transaction is
scheduled and the data associated with the writeback becomes invalid, due
to a subsequent data write to the same address (such new data being stored
in a cache line of one of the data processors). In these cases, the
writeback transaction is no longer required, since the data is no longer
valid. However, in these prior art architectures no simple way exists to
cancel the writeback transactions once they are generated. This is because
there does not exist a single shared address bus which a given processor
can snoop on to look for all of these invalidating transactions.
Similarly, cancellation of unexecuted or unscheduled writeback
transactions (ones that are pending as part of a read-writeback
transaction pair), cannot be easily accomplished because of the lack of
visibility to these transactions. Accordingly, unnecessary writebacks are
characteristic of these prior art systems resulting in reduced system
performance.
Accordingly, an architecture which provides for an easy mechanism to cancel
pending writeback transactions upon the occurrence of an invalidating
transaction would yield an improvement in the overall transaction
throughput.
SUMMARY OF THE INVENTION
In summary, the present invention is system and method for cancelling
unnecessary writebacks of dirty victims displaced from cache memory in a
cache coherent multiprocessor computer system. The multiprocessor system
has a multiplicity of sub-systems and a main memory coupled to a system
controller. An interconnect module, interconnects the main memory and
sub-systems in accordance with interconnect control signals received from
the system controller.
At least two of the sub-systems are data processors, each having a
respective cache memory that stores multiple blocks of data and a
respective master cache index. Each master cache index has a set of N
master cache tags (Etags), including one cache tag for each data block
stored by the cache memory.
Each data processor includes a master interface for sending memory
transaction requests, including read and writeback transactions, to the
system controller and for receiving cache access requests from the system
controller corresponding to memory transaction requests by other ones of
the data processors. The data processors each further include a writeback
buffer for storing a victimized cache line until an associated writeback
transaction is completed.
The system controller includes memory transaction request logic for
processing each memory transaction request by a data processor, for
determining which one of the cache memories and main memory to couple to
the requesting data processor, for sending corresponding interconnect
control signals to the interconnect module so as to couple the requesting
data processor to the determined one of the cache memories and main
memory, and for sending a reply message to the requesting data processor
to prompt the requesting data processor to transmit or receive one data
packet to or from the determined one of the cache memories and main
memory.
The system controller maintains a duplicate cache index having a set of N
duplicate cache tags (Dtags) for each of the data processors, the set of N
duplicate cache tags for each data processor having an equal number of
cache tags as the corresponding set of master cache tags. Each master
cache tag denotes a master cache state and an address tag; the duplicate
cache tag corresponding to each master cache tag denotes a second cache
state and the same address tag as the corresponding master cache tag. The
system controller includes an Nth+1 entry in the duplicate cache index,
the Nth+1 entry corresponding to the cache state of a victimized cache
line stored in the writeback buffer of an associated data processor.
The system controller includes logic circuitry for executing memory
transaction requests from the data processors, and includes invalidation
circuitry for processing each writeback request from a given data
processor to determine if the Dtag index corresponding to the victimized
cache line is invalid. In particular, if a first data processor has
performed a memory transaction that invalidates the same cache line that
is the subject of the writeback request from a second data processor, the
writeback is unnecessary because the data in that cache line to be written
back has been invalidated and will be overwritten by the first data
processor. The invalidation circuitry allows the data transfer to main
memory associated with the writeback request to be executed only if the
Dtag index for the address specified in the writeback request is not
invalid and cancels the writeback request if the Dtag index is invalid.
BRIEF DESCRIPTION OF THE DRAWINGS
Additional objects and features of the invention will be more readily
apparent from the following detailed description and appended claims when
taken in conjunction with the drawings, in which:
FIG. 1 is a block diagram of a computer system incorporating the present
invention.
FIG. 2 is a block diagram of a computer system showing the data bus and
address bus configuration used in one embodiment of the present invention.
FIG. 3 depicts the signal lines associated with a port in a preferred
embodiment of the present invention.
FIG. 4 is a block diagram of the interfaces and port ID register found in a
port in a preferred embodiment of the present invention.
FIG. 5 is a block diagram of a computer system incorporating the present
invention, depicting request and data queues used while performing data
transfer transactions.
FIG. 6 is a block diagram of the System Controller Configuration register
used in a preferred embodiment of the present invention.
FIG. 7 is a block diagram of a caching UPA master port and the cache
controller in the associated UPA module.
FIGS. 8, 8A, 8B, 8C, 8D is a simplified flow chart of typical read/write
data flow transactions in a preferred embodiment of the present invention.
FIG. 9 depicts the writeback buffer and Dtag Transient Buffers used for
handling coherent cache writeback operations.
FIGS. 10A, 10B, 10C, 10D and 10E shows the data packet formats for various
transaction request packets.
FIG. 11 is a state transition diagram of the cache tag line states for each
cache entry in an Etag array in a preferred embodiment of the present
invention.
FIG. 12 is a state transition diagram of the cache tag line states for each
cache entry in an Dtag array in a preferred embodiment of the present
invention.
FIG. 13 depicts the logic circuitry for activating transactions. FIGS.
14A-14D are block diagrams of status information data structures used by
the system controller in a preferred embodiment of the present invention.
FIG. 15 is a block diagram of the Dtag lookup and update logic in the
system controller in a preferred embodiment of the present invention.
FIG. 16 is a block diagram of the S.sub.-- Request and S.sub.-- Reply logic
in the system controller in a preferred embodiment of the present
invention.
FIG. 17 is a block diagram of the datapath scheduler in a preferred
embodiment of the present invention.
FIG. 18 is a block diagram of the S.sub.-- Request and S.sub.-- Reply logic
in the system controller in a second preferred embodiment of the present
invention.
DESCRIPTION OF THE PREFERRED EMBODIMENTS
The following is a glossary of terms used in this document.
Cache Coherence: keeping all copies of each data block consistent.
Tag: a tag is a record in a cache index for indicating the status of one
cache line and for storing the high order address bits of the address for
the data block stored in the cache line.
Etag: the primary array of cache tags for a cache memory. The Etag array is
accessed and updated by the data processor module in a UPA port.
Dtag: a duplicate array of cache tags maintained by the system controller.
Interconnect: The set of system components that interconnect data
processors, I/O processors and their ports. The "interconnect" includes
the system controller 110, interconnect module 112, data busses 116,
address busses 114, and reply busses 120 (for S.sub.-- REPLY's), 122 (for
P.sub.-- REPLY's) in the preferred embodiment.
Victim: a data block displaced from a cache line.
Dirty Victim: a data block that was updated by the associated data
processor prior to its being displaced from the cache by another data
block. Dirty victims must normally be written back to main memory, except
that in the present invention the writeback can be canceled if the same
data block is invalidated by another data processor prior to the writeback
transaction becoming "Active."
Line: the unit of memory in a cache memory used to store a single data
block.
Invalidate: changing the status of a cache line to "invalid" by writing the
appropriate status value in the cache line's tag.
Master Class: an independent request queue in the UPA port for a data
processor. A data processor having a UPA port with K master classes can
issue transaction requests in each of the K master classes. Each master
class has its own request FIFO buffer for issuing transaction requests to
the System Controller as well as its own distinct inbound data buffer for
receiving data packets in response to transaction requests and its own
outbound data buffer for storing data packets to be transmitted.
Writeback: copying modified data from a cache memory into main memory.
The following is a list of abbreviations used in this document:
DVMA: direct virtual memory access (same as DMA, direct memory access for
purposes of this document)
DVP: dirty victim pending
I/O: input/output
IVP: Invalidate me Advisory
MOESI: the five Etag states: Exclusive Modified (M), Shared Modified (O),
Exclusive Clean (E), Shared Clean (S), Invalid (I).
MOSI: the four Dtag states: Exclusive and Potentially Modified (M), Shared
Modified (O), Shared Clean (S), Invalid (I).
NDP: no data tag present
PA›xxx!: physical address ›xxx!
SC: System Controller
UPA: Universal Port Architecture
Referring to FIG. 1, there is shown a multiprocessor computer system 100
incorporating the computer architecture of the present invention. The
multiprocessor computer system 100 includes a set of "UPA modules." UPA
modules 102 include data processors as well as slave devices such as I/O
handlers and the like. Each UPA module 102 has a port 104, herein called a
UPA port, where "UPA" stands for "universal port architecture." For
simplicity, UPA modules and their associated ports will often be called,
collectively, "ports" or "UPA ports," with the understanding that the port
or UPA port being discussed includes both a port and its associated UPA
module.
The system 100 further includes a main memory 108, which may be divided
into multiple memory banks 109 Bank.sub.o to Bank.sub.m, a system
controller 110, and an interconnect module 112 for interconnecting the
ports 104 and main memory 108. The interconnect module 112, under the
control of datapath setup signals from the System Controller 110, can form
a datapath between any port 104 and any other port 104 or between any port
104 and any memory bank 109. The interconnect module 112 can be as simple
as a single, shared data bus with selectable access ports for each UPA
port and memory module, or can be a somewhat more complex crossbar switch
having m ports for m memory banks and n ports for n UPA ports, or can be a
combination of the two. The present invention is not dependent on type of
interconnect module 112 used, and thus the present invention can be used
with many different interconnect module configurations.
A UPA port 104 interfaces with the interconnect module 112 and the system
controller 110 via a packet switched address bus 114 and packet switched
data bus 116 respectively, each of which operates independently. A UPA
module logically plugs into a UPA port. The UPA module 102 may contain a
data processor, an I/O controller with interfaces to I/O busses, or a
graphics frame buffer. The UPA interconnect architecture in the preferred
embodiment supports up to thirty-two UPA pods, and multiple address and
data busses in the interconnect. Up to four UPA ports 104 can share the
same address bus 114, and arbitrate for its mastership with a distributed
arbitration protocol.
The System Controller 110 is a centralized controller and performs the
following functions:
Coherence control;
Memory and Datapath control; and
Address crossbar-like connectivity for multiple address busses.
The System Controller 110 controls the interconnect module 112, and
schedules the transfer of data between two UPA ports 104, or between UPA
port 104 and memory 108. The architecture of the present invention
supports an arbitrary number of memory banks 109. The System Controller
110 controls memory access timing in conjunction with datapath scheduling
for maximum utilization of both resources.
The System Controller 110, the interconnect module 112, and memory 108 are
in the "interconnect domain," and are coupled to UPA modules 102 by their
respective UPA ports 104. The interconnect domain is fully synchronous
with a centrally distributed system clock signal, generated by a System
Clock 118, which is also sourced to the UPA modules 104. If desired, each
UPA module 102 can synchronize its private internal clock with the system
interconnect clock. All references to clock signals in this document refer
to the system clock, unless otherwise noted.
Each UPA address bus 114 is a 36-bit bidirectional packet switched request
bus, and includes 1-bit odd-parity. It carries address bits PA›40:4! of a
41-bit physical address space as well as transaction identification
information.
Referring to FIGS. 1 and 2, there may be multiple address busses 114 in the
system, with up to four UPA ports 104 on each UPA address bus 114. The
precise number of UPA address busses is variable, and will generally be
dependent on system speed requirements. Since putting more ports on an
address bus 114 will slow signal transmissions over the address bus, the
maximum number of ports per address bus will be determined by the signal
transmission speed required for the address bus.
The datapath circuitry (i.e., the interconnect module 112) and the address
busses 114 are independently scaleable. As a result, the number of address
busses can be increased, or decreased, for a given number of processors so
as to optimize the speed/cost tradeoff for the transmission of transaction
requests over the address busses totally independently of decisions
regarding the speed/cost tradeoffs associated with the design of the
interconnect module 112.
FIG. 3 shows the full set of signals received and transmitted by a UPA port
having all four interfaces (described below) of the preferred embodiment.
Table 1 provides a short description of each of the signals shown in FIG.
3.
TABLE 1
______________________________________
UPA Port Interface Signal Definitions
Signal Name Description
______________________________________
Data Bus Signals
UPA.sub.-- Databus›128!
128-bit data bus. Depending on speed
requirements and the bus technology used, a
system can have as many as one 128-bit data bus
for each UPA port, or each data bus can be shared
by several ports.
UPA.sub.-- ECC›16!
Bus for carrying error correction codes.
UPA.sub.-- ECC<15:8>carries the ECC for
UPA.sub.-- Databuse<127:64>. UPA.sub.-- ECC<7:0>
carries the ECC for UPA Databus<63:0>.
UPA.sub.-- ECC.sub.-- Valid
ECC valid. A unidirectional signal from the
System Controller to each UPA port, driven
by the System Controller to indicate whether
the ECC is valid for the data on the data bus.
Address Bus Signals
UPA.sub.-- Addressbus›36!
36-bit packet switched transaction request bus.
See packet format in FIGS. 9A, 9B, 9C.
UPA.sub.-- Req.sub.-- In›3!
Arbitration request lines for up to three other
UPA ports that might be sharing this
UPA.sub.-- .sub.-- Addressbus.
UPA.sub.-- Req.sub.-- Out
Arbitration request from this UPA port.
UPA.sub.-- SC.sub.-- Req.sub.-- In
Arbitration request from System Controller.
UPA.sub.-- Arb.sub.-- Reset.sub.-- L
Arbitration Reset, asserted at the same time that
UPA.sub.-- Reset.sub.-- L is asserted.
UPA.sub.-- AddrValid
There is a separate, bidirectional, address valid
signal line between the System Controller and
each UPA port. It is driven by the Port which
wins the arbitration or by the System Controller
when it drives the address bus.
UPA.sub.-- Data.sub.-- Stall
Data stall signal, driven by the System Controller
to each UPA port to indicate, during transmission
of a data packet, whether there is a data stall in
between quad.sub.-- words of a data packet.
Reply Signals
UPA.sub.-- P.sub.-- Reply›5!
Port's reply packet, driven by a UPA port directly
to the System Controller. There is a dedicated
UPA.sub.-- P.sub.-- Reply bus for each UPA port.
UPA.sub.-- S.sub.-- Reply›6!
System Controller's reply packet, driven by
System Controller directly to the UPA port.
There is a dedicated UPA.sub.-- S.sub.-- Reply bus for each
UPA port.
Miscellaneous Signals:
UPA.sub.-- Port.sub.-- ID›5!
Five bit hardwired UPA Port Identification.
UPA.sub.-- Reset.sub.-- L
Reset. Driven by System Controller at power-on
and on any fatal system reset.
UPA.sub.-- Sys.sub.-- Clk›2!
Differential UPA system clock, supplied by the
system clock to all UPA ports.
UPA.sub.-- CPU.sub.-- Clk›2!
Differential processor clock, supplied by the
system clock controller only to processor
UPA ports.
UPA.sub.-- Speed›3!
Used only for processor UPA ports, this
hardwired three bit signal encodes the maximum
speed at which the UPA port can operate.
UPA.sub.-- IO.sub.-- Speed
Used only by IO UPA ports, this signal encodes
the maximum speed at which the UPA port can
operate.
UPA.sub.-- Ratio
Used only for processor UPA ports, this signal
encodes the ratio of the system clock to the
processor clock, and is used by the processor to
internally synchronize the system clock and
processor clock if it uses a synchronous internal
interface.
UPA.sub.-- JTAG›5!
JTAG scan control signals, TDI, TMS, TCLK,
TRST.sub.-- L and TDO. TDO is output by the UPA
port, the others are inputs.
UPA.sub.-- Slave.sub.-- Int.sub.-- L
Interrupt, for slave-only UPA ports. This is a
dedicated line from the UPA port to the System
Controller.
UPA.sub.-- XIR.sub.-- L
XIR reset signal, asserted by the System
Controller to signal XIR reset.
______________________________________
A valid packet on the UPA address bus 114 is identified by the driver
(i.e., the UPA port 104 or the System Controller 110) asserting the
UPA.sub.-- Addr.sub.-- valid signal.
The System Controller 110 is connected to each UPA address bus 114 in the
system 100. The UPA ports 104 and System Controller 110 arbitrate for use
of each UPA address bus 114 using a distributed arbitration protocol. The
arbitration protocol is described in patent application Ser. No.
08/414,559, filed Mar. 31, 1995, which is hereby incorporated by
reference.
UPA ports do not communicate directly with other UPA ports on a shared UPA
address bus 114. Instead, when a requesting UPA port generates a request
packet that requests access to an addressed UPA port, the System
Controller 110 forwards a slave access to the addressed UPA port by
retransmitting the request packet and qualifying the destination UPA port
with its UPA.sub.-- Addr.sub.-- valid signal.
A UPA port also does not "snoop" on the UPA address bus to maintain cache
coherence. The System Controller 110 performs snooping on behalf of those
UPA ports whose respective UPA modules include cache memory using a
write-invalidate cache coherence protocol described below.
The UPA address bus 114 and UPA data bus 116 coupled to any UPA port 104
are independent. An address is associated with its data through ordering
rules discussed below.
The UPA data bus is a 128-bit quad-word bidirectional data bus, plus 16
additional ECC (error correction code) bits. A "word" is defined herein to
be a 32-bit, 4-byte datum. A quad-word consists of four words, or 16
bytes. In some embodiments, all or some of the data busses 116 in the
system 110 can be 64-bit double word bidirectional data bus, plus 8
additional bits for ECC. The ECC bits are divided into two 8-bit halves
for the 128-bit wide data bus. Although the 64-bit wide UPA data bus has
half as many signal lines, it carries the same number of bytes per
transaction as the 128-bit wide UPA data bus, but in twice the number of
clock cycles. In the preferred embodiment, the smallest unit of coherent
data transfer is 64 bytes, requiring four transfers of 16 bytes during
four successive system clock cycles over the 128-bit UPA data bus.
A "master" UPA port, also called a UPA master port, is herein defined to be
one which can initiate data transfer transactions. All data processor UPA
modules must have a master UPA port 104.
Note that graphics devices, which may include some data processing
capabilities, typically have only a slave interface. Slave interfaces are
described below. For the purposes of this document, a "data processor" is
defined to be a programmable computer or data processing device (e.g., a
microprocessor) that both reads and writes data from and to main memory.
Most, but not necessarily all, "data processors" have an associated cache
memory. For instance, an I/O controller is a data processor and its UPA
port will be a master UPA port. However, in may cases an I/O controller
will not have a cache memory (or at least not a cache memory for storing
data in the coherence domain).
A caching UPA master port is a master UPA port for a data processor that
also has a coherent cache. The caching UPA master port participates in the
cache coherence protocol.
A "slave" UPA port is herein defined to be one which cannot initiate data
transfer transactions, but is the recipient of such transactions. A slave
port responds to requests from the System Controller. A slave port has an
address space associated with it for programmed I/O. A "slave port" within
a master UPA port (i.e., a slave interface within a master UPA port) also
handles copyback requests for cache blocks, and handles interrupt
transactions in a UPA port which contains a data processor.
Each set of 8 ECC bits carry Shigeo Kaneda's 64-bit SEC-DED-S4ED code. The
interconnect does not generate or check ECC. Each UPA port sourcing data
generates the corresponding ECC bits, and the UPA port receiving the data
checks the ECC bits. UPA ports with master capability support ECC.
Slave-only UPA port containing a graphics framebuffer need not support ECC
(See UPA.sub.-- ECC.sub.-- Valid signal).
The UPA data bus 116 is not a globally shared common data bus. As shown in
FIGS. 1 and 2, there may be more than one UPA data bus 116 in the system,
and the precise number is implementation specific. Data is always
transferred in units of 16 bytes per clock-cycle on the 128-bit wide UPA
data bus, and in units of 16 bytes per two clock-cycles on the 64-bit wide
UPA data bus.
The size of each cache line in the preferred embodiment is 64 bytes, or
sixteen 32-bit words. As will be described below, 64 bytes is the minimum
unit of data transfer for all transactions involving the transfer of
cached data. That is, each data packet of cached data transferred via the
interconnect is 64 packets. Transfers of non-cached data can transfer 1 to
16 bytes within a single quad-word transmission, qualified with a 16-bit
bytemask to indicate which bytes within the quad-word contain the data
being transferred.
System Controller 110 schedules a data transfer on a UPA data bus 116 using
a signal herein called the S.sub.-- REPLY. For block transfers, if
successive quadwords cannot be read or written in successive clock cycles
from memory, the UPA.sub.-- Data.sub.-- Stall signal is asserted by System
Controller 110 to the UPA port.
For coherent block read and copyback transactions of 64-byte data blocks,
the quad-word (16 bytes) addressed on physical address bits PA›5:4! is
delivered first, and the successive quad words are delivered in the wrap
order shown in Table 2. The addressed quad-word is delivered first so that
the requesting data processor can receive and begin processing the
addressed quad-word prior to receipt of the last quad-word in the
associated data block. In this way, latency associated with the cache
update transaction is reduced. Non-cached block read and block writes of
64 byte data blocks are always aligned on a 64-byte block boundary
(PA›5:4!=0.times.0).
Note that these 64-byte data packets are delivered without an attached
address, address tag, or transaction tag. Address information and data are
transmitted independently over independent busses. While this is
efficient, in order to match up incoming data packets with cache miss data
requests an ordering constraint must be applied: data packets must be
transmitted to a UPA port in the same order as the corresponding requests
within each master class. (There is no ordering requirement for data
requests in different master classes.) When this ordering constraint is
followed, each incoming data packet must be in response to the longest
outstanding cache miss transaction request for the corresponding master
class.
TABLE 2
______________________________________
Quad-word wrap order for block reads on the UPA data bus
Address
First Qword on
Second Qword
| | |