|
Claims  |
|
|
What is claimed is:
1. A multiprocessing computer system comprising:
a plurality of processing nodes interconnected through an interconnect
structure, wherein said plurality of processing nodes includes:
a first processing node configured to generate a read command to read a
designated memory location in a memory associated with a second processing
node,
wherein said second processing node configured to receive said read command
and to responsively transmit a probe command to each of at least one
remaining processing node in said plurality of processing nodes regardless
of whether a copy of said designated memory location is cached within any
of said at least one remaining processing node, and to responsively
transmit a first read response to said first processing node including a
first data packet containing data from said designated memory location,
wherein said at least one remaining processing node includes a third
processing node configured to transmit, in response to said probe command
and upon detecting a modified copy of said designated memory location
cached within said third processing node, a memory cancel response to said
second processing node and a second read response to said first processing
node, said second read response including a second data packet containing
said modified copy of said designated memory location, and
wherein said memory cancel response causes said second processing node to
cancel transmission of said first read response to said first processing
node.
2. The multiprocessing computer system of claim 1, wherein said
interconnect structure includes a first plurality of dual-unidirectional
links.
3. The multiprocessing computer system as in claim 2, wherein each dual
unidirectional link in said first plurality of dual-unidirectional links
interconnects a respective pair of processing nodes from said plurality of
processing nodes.
4. The multiprocessing computer system according to claim 3, further
comprising a plurality of I/O devices, wherein said interconnect structure
further includes a second plurality of dual-unidirectional links, and
wherein each of said plurality of I/O devices is coupled to a respective
processing node through a corresponding one of said second plurality of
dual-unidirectional links.
5. The multiprocessing computer system of claim 4, wherein each
dual-unidirectional link in said first and said second plurality of
dual-unidirectional links performs packetized information transfer and
includes a pair of unidirectional buses comprising:
a transmission bus carrying a first plurality of binary packets; and
a receiver bus carrying a second plurality of binary packets.
6. The multiprocessing computer system according to claim 5 wherein each of
said plurality of processing nodes includes a plurality of interface
ports, wherein at least one of said plurality of interface ports in said
each of said plurality of processing nodes is connected to a corresponding
dual-unidirectional link selected from the group consisting of said first
and said second plurality of dual-unidirectional links.
7. The multiprocessing computer system of claim 1, further comprising:
a plurality of system memories; and
a plurality of memory buses, wherein each of said plurality of system
memories is coupled to a corresponding one of said plurality of processing
nodes through a respective one of said plurality of memory buses.
8. The multiprocessing computer system as in claim 7, wherein each of said
plurality of memory buses is bi-directional.
9. The multiprocessing computer system according to claim 7, wherein a
first memory from said plurality of system memories is coupled to said
second processing node, and wherein said first memory includes data
corresponding to said designated memory location.
10. The multiprocessing computer system according to claim 1, wherein said
memory cancel response causes said second processing node to cancel
transmission of said first read response when said second processing node
receives said memory cancel response prior to transmitting said first read
response.
11. The multiprocessing computer system as in claim 1, wherein said first
read response is transmitted concurrently with said probe command.
12. The multiprocessing computer system according to claim 11, wherein size
of said first data packet is dependent on a type of said read command.
13. The multiprocessing computer system of claim 11, wherein said probe
command causes said each of said at least one remaining processing node to
transmit a corresponding status response to said first processing node.
14. The multiprocessing computer system as in claim 13, wherein said each
of said at least one remaining processing node includes a respective
internal cache memory, and wherein said corresponding status response
comprises one of the following:
(a) a probe response including one of the following:
(i) a first indication of an absence of a cached copy of said designated
memory location in said respective internal cache memory, and
(ii) a second indication when said cached copy of said designated memory
location is in a shared state in said respective internal cache memory;
and
(b) a second read response from said third processing node that includes a
second data packet containing said modified copy of said designated memory
location cached in said respective internal cache memory.
15. The multiprocessing computer system according to claim 14, wherein said
third processing node is configured to transmit said second read response
concurrently with said memory cancel response.
16. The multiprocessing computer system of claim 13, wherein said second
processing node is configured to transmit a target done response to said
first processing node upon receiving said memory cancel response from said
third processing node, wherein said target done response is transmitted
regardless of whether said first read response is transmitted.
17. The multiprocessing computer system as in claim 16, wherein said first
processing node transmits a source done message to said second processing
node upon receiving said target done response and said corresponding
status response from said each of said at least one remaining processing
node.
18. The multiprocessing computer system of claim 17, wherein said source
done message signifies completion of execution of said read command
according to a predetermined data transfer protocol and allows said second
processing node to respond to a subsequent data transfer request addressed
to said designated memory location.
19. In a multiprocessing computer system comprising a plurality of
processing nodes interconnected through an interconnect structure, wherein
said plurality of processing nodes includes a first processing node, a
second processing node, and at least one remaining processing node
including a third processing node, a method for reading data from a memory
location in a memory associated with said second processing node, said
method comprising:
transmitting a read command from said first processing node to said second
processing node to read said data from said memory location;
sending a probe command from said second processing node to each of said at
least one remaining processing node in response to receiving said read
command by said second processing node regardless of whether a copy of
said memory location is cached within any of said at least one remaining
processing node, and transmitting a first read response to said first
processing node in response to said read command, said first read response
including a first data packet containing data from said memory location,
said third processing node transmitting, in response to said probe command
and upon detecting a modified copy of said memory location cached within
said third processing node, a memory cancel response to said second
processing node and a second read response to said first processing node,
said second read response including a second data packet containing said
modified copy of said memory location, and
said memory cancel response causing said second processing node to cancel
transmission of said first read response to said first processing node.
20. The method of claim 21, wherein said memory cancel response causing
said second processing node to cancel transmission of said first read
response when said second processing node receives said memory cancel
response prior to said transmitting said first read response.
21. The method according to claim 19, further comprising:
said probe command causing said each of said at least one remaining
processing node to transmit a corresponding status response to said first
processing node.
22. The method as in claim 21, wherein said corresponding status response
comprises one of the following:
(a) a probe response including one of the following:
(i) a first indication of an absence of a cached copy of said memory
location, and
(ii) a second indication when said cached copy of said memory location is
in a shared state; and
(b) a second read response from said third processing node including a
second data packet containing said modified copy of said memory location.
23. The method of claim 21, wherein said third processing node transmits
said corresponding status response concurrently with said memory cancel
response.
24. The method according to claim 21, further comprising:
said second processing node transmitting a target done response to said
first processing node upon receiving said memory cancel response from said
third processing node, wherein said target done response is transmitted
regardless of whether said first read response is transmitted.
25. The method as in claim 24, further comprising:
said first processing node transmitting a source done message to said
second processing node upon receiving said target done response and said
corresponding status response from said each of said at least one
remaining processing node. |
|
|
|
|
Claims  |
|
|
Description  |
|
|
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention broadly relates to computer systems, and more
particularly, to a messaging scheme to accomplish cache-coherent data
transfers in a multiprocessing computing environment.
2. Description of the Related Art
Generally, personal computers (PCs) and other types of computer systems
have been designed around a shared bus system for accessing memory. One or
more processors and one or more input/output (I/O) devices are coupled to
memory through the shared bus. The I/O devices may be coupled to the
shared bus through an I/O bridge, which manages the transfer of
information between the shared bus and the I/O devices. The processors are
typically coupled directly to the shared bus or through a cache hierarchy.
Unfortunately, shared bus systems suffer from several drawbacks. For
example, since there are multiple devices attached to the shared bus, the
bus is typically operated at a relatively low frequency. Further, system
memory read and write cycles through the shared system bus take
substantially longer than information transfers involving a cache within a
processor or involving two or more processors. Another disadvantage of the
shared bus system is a lack of scalability to larger number of devices. As
mentioned above, the amount of bandwidth is fixed (and may decrease if
adding additional devices reduces the operable frequency of the bus). Once
the bandwidth requirements of the devices attached to the bus (either
directly or indirectly) exceeds the available bandwidth of the bus,
devices will frequently be stalled when attempting to access the bus.
Overall performance may be decreased unless a mechanism is provided to
conserve the limited system memory bandwidth.
A read or a write operation addressed to a non-cache system memory takes
more processor clock cycles than similar operations between two processors
or between a processor and its internal cache. The limitations on bus
bandwidth, coupled with the lengthy access time to read or write to a
system memory, negatively affect the computer system performance.
One or more of the above problems may be addressed using a distributed
memory system. A computer system employing a distributed memory system
includes multiple nodes. Two or more of the nodes are connected to memory,
and the nodes are interconnected using any suitable interconnect. For
example, each node may be connected to each other node using dedicated
lines. Alternatively, each node may connect to a fixed number of other
nodes, and transactions may be routed from a first node to a second node
to which the first node is not directly connected via one or more
intermediate nodes. The memory address space is assigned across the
memories in each node.
Nodes may additionally include one or more processors. The processors
typically include caches that store cache blocks of data read from the
memories. Furthermore, a node may include one or more caches external to
the processors. Since the processors and/or nodes may be storing cache
blocks accessed by other nodes, a mechanism for maintaining coherency
within the nodes is desired.
SUMMARY OF THE INVENTION
The problems outlined above are in large part solved by a computer system
as described herein. The computer system may include multiple processing
nodes, two or more of which may be coupled to separate memories which may
form a distributed memory system. The processing nodes may include caches,
and the computer system may maintain coherency between the caches and the
distributed memory system.
In one embodiment, the present invention relates to a multiprocessing
computer system where the processing nodes are interconnected through a
plurality of dual unidirectional links. Each pair of unidirectional links
forms a coherent link structure that connects only two of the processing
nodes. One unidirectional link in the pair of links sends signals from a
first processing node to a second processing node connected through that
pair of unidirectional links. The other unidirectional link in the pair
carries a reverse flow of signals, i.e. it sends signals from the second
processing node to the first processing node. Thus, each unidirectional
link forms as a point-to-point interconnect that is designed for
packetized information transfer. Communication between two processing
nodes may be routed through more than one remaining nodes in the system.
Each processing node may be coupled to a respective system memory through a
memory bus. The memory bus may be bidirectional. Each processing node
comprises at least one processor core and may optionally include a memory
controller for communicating with the respective system memory. Other
interface logic may be included in one or more processing nodes to allow
connectivity with various I/O devices through one or more I/O bridges.
In one embodiment, one or more I/O bridges may be coupled to their
respective processing nodes through a set of non-coherent dual
unidirectional links. These I/O bridges communicate with their host
processors through this set of non-coherent dual unidirectional links in
much the same way as two directly-linked processors communicate with each
other through a coherent dual unidirectional link.
In one embodiment, when a first processing node sends a read command to a
second processing node to read data from a designated memory location
associated with the second processing node, the second processing node
responsively transmits a probe command to all the remaining processing
nodes in the system. The probe command is transmitted regardless of
whether one or more of the remaining nodes have a copy of the data cached
in their respective cache memories. Each processing node that has a cached
copy of the designated memory location updates its cache tag associated
with that cached data to reflect the current status of the data. Each
processing node that receives a probe command sends, in return, a probe
response indicating whether that processing node has a cached copy of the
data. In the event that a processing node has a cached copy of the
designated memory location, the probe response from that processing node
further includes the state of the cached data--i.e. modified, shared etc.
The target processing node, i.e. the second processing node, sends a read
response to the source processing node, i.e. the first processing node.
This read response contains the data requested by the source node through
the read command. The first processing node acknowledges receipt of the
data by transmitting a source done response to the second processing node.
When the second processing node receives the source done response it
removes the read command (received from the first processing node) from
its command buffer queue. The second processing node may, at that point,
start to respond to a command to the same designated memory location. This
sequence of messaging is one step in maintaining cache-coherent system
memory reads in a multiprocessing computer system. The data read from the
designated memory location may be less than the whole cache block in size
if the read command specifies so.
Upon receiving the probe command, all of the remaining nodes check the
status of the cached copy, if any, of the designated memory location as
described before. In the event that a processing node, other than the
source and the target nodes, finds a cached copy of the designated memory
location that is in a modified state, that processing node responds with a
memory cancel response sent to the target node, i.e. the second processing
node. This memory cancel response causes the second processing node to
abort further processing of the read command, and to stop transmission of
the read response, if it hasn't sent the read response yet. All the other
remaining processing nodes still send their probe responses to the first
processing node. The processing node that has the modified cached data
sends that modified data to the first processing node through its own read
response. The messaging scheme involving probe responses and read
responses thus maintains cache coherency during a system memory read
operation.
The memory cancel response further causes the second processing node to
transmit a target done response to the first processing node regardless of
whether it earlier sent the read response to the first processing node.
The first processing node waits for all the responses to arrive--i.e. the
probe responses, the target done response, and the read response from the
processing node having the modified cached data--prior to completing the
data read cycle by sending a source done response to the second processing
node. In this embodiment, the memory cancel response conserves system
memory bandwidth by causing the second processing node to abort
time-consuming memory read operation when a modified copy of the requested
data is cached at a different processing node. Reduced data transfer
latencies are thus achieved when it is observed that a data transfer
between two processing nodes over the high-speed dual unidirectional link
is substantially faster than a similar data transfer between a processing
node and a system memory that involves a relatively slow speed system
memory bus.
BRIEF DESCRIPTION OF THE DRAWINGS
A better understanding of the present invention can be obtained when the
following detailed description of the preferred embodiment is considered
in conjunction with the following drawings, in which:
FIG. 1 is a block diagram of one embodiment of a computer system.
FIG. 2 shows in detail one embodiment of the interconnect between a pair of
processing nodes from FIG. 1.
FIG. 3 is a block diagram of one embodiment of an information packet.
FIG. 4 is a block diagram of one embodiment of an address packet.
FIG. 5 is a block diagram of one embodiment of a response packet.
FIG. 6 is a block diagram of one embodiment of a command packet.
FIG. 7 is a block diagram of one embodiment of a data packet.
FIG. 8 is a table illustrating exemplary packet types that may be employed
in the computer system of FIG. 1.
FIG. 9 is a diagram illustrating an example flow of packets corresponding
to a memory read operation.
FIG. 10A is a block diagram of one embodiment of a probe command packet.
FIG. 10B is a block diagram for one embodiment of the encoding for the
NextState field in the probe command packet of FIG. 10A.
FIG. 11A is a block diagram of one embodiment of a read response packet.
FIG. 11B shows in one embodiment the relationship between the Probe, Tgt
and Type fields of the read response packet of FIG. 11A.
FIG. 12 is a block diagram of one embodiment of a probe response packet.
FIG. 13 is a diagram illustrating an example flow of packets involving a
memory cancel response.
FIG. 14 is a diagram illustrating an example flow of packets showing a
messaging scheme that combines probe commands and memory cancel response.
FIG. 15 is an exemplary flowchart for the transactions involved in a memory
read operation.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT
Turning now to FIG. 1, one embodiment of a multiprocessing computer system
10 is shown. In the embodiment of FIG. 1, computer system 10 includes
several processing nodes 12A, 12B, 12C, and 12D. Each processing node is
coupled to a respective memory 14A-14D via a memory controller 16A-16D
included within each respective processing node 12A-12D. Additionally,
processing nodes 12A-12D include one or more interface ports 18, also
known as interface logic, to communicate among the processing nodes
12A-12D, and to also communicate between a processing node and a
corresponding I/O bridge. For example, processing node 12A includes
interface logic 18A for communicating with processing node 12B, interface
logic 18B for communicating with processing node 12C, and a third
interface logic 18C for communicating with yet another processing node
(not shown). Similarly, processing node 12B includes interface logic 18D,
18E, and 18F; processing node 12C includes interface logic 18G, 18H, and
181; and processing node 12D includes interface logic 18J, 18K, and 18L.
Processing node 12D is coupled to communicate with an I/O bridge 20 via
interface logic 18L. Other processing nodes may communicate with other I/O
bridges in a similar fashion. I/O bridge 20 is coupled to an I/O bus 22.
The interface structure that interconnects processing nodes 12A-12D
includes a set of dual-unidirectional links. Each dual-unidirectional link
is implemented as a pair of packet-based unidirectional links to
accomplish high-speed packetized information transfer between any two
processing nodes in the computer system 10. Each unidirectional link may
be viewed as a pipelined, split-transaction interconnect. Each
unidirectional link 24 includes a set of coherent unidirectional lines.
Thus, each pair of unidirectional links may be viewed as comprising one
transmission bus carrying a first plurality of binary packets and one
receiver bus carrying a second plurality of binary packets. The content of
a binary packet will primarily depend on the type of operation being
requested and the processing node initiating the operation. One example of
a dual-unidirectional link structure is links 24A and 24B. The
unidirectional lines 24A are used to transmit packets from processing node
12A to processing node 12B and lines 24B are used to transmit packets from
processing node 12B to processing node 12A. Other sets of lines 24C-24H
are used to transmit packets between their corresponding processing nodes
as illustrated in FIG. 1.
A similar dual-unidirectional link structure may be used to interconnect a
processing node and its corresponding I/O device, or a graphic device or
an I/O bridge as is shown with respect to the processing node 12D. A
dual-unidirectional link may be operated in a cache coherent fashion for
communication between processing nodes or in a non-coherent fashion for
communication between a processing node and an external I/O or graphic
device or an I/O bridge. It is noted that a packet to be transmitted from
one processing node to another may pass through one or more remaining
nodes. For example, a packet transmitted by processing node 12A to
processing node 12D may pass through either processing node 12B or
processing node 12C in the arrangement of FIG. 1. Any suitable routing
algorithm may be used. Other embodiments of computer system 10 may include
more or fewer processing nodes than those shown in FIG. 1.
Processing nodes 12A-12D, in addition to a memory controller and interface
logic, may include other circuit elements such as one or more processor
cores, an internal cache memory, a bus bridge, a graphics logic, a bus
controller, a peripheral device controller, etc. Broadly speaking, a
processing node comprises at least one processor and may optionally
include a memory controller for communicating with a memory and other
logic as desired. Further, each circuit element in a processing node may
be coupled to one or more interface ports depending on the functionality
being performed by the processing node. For example, some circuit elements
may only couple to the interface logic that connects an I/O bridge to the
processing node, some other circuit elements may only couple to the
interface logic that connects two processing nodes, etc. Other
combinations may be easily implemented as desired.
Memories 14A-14D may comprise any suitable memory devices. For example, a
memory 14A-14D may comprise one or more RAMBUS DRAMs (RDRAMs), synchronous
DRAMs (SDRAMs), static RAM, etc. The memory address space of the computer
system 10 is divided among memories 14A-14D. Each processing node 12A-12D
may include a memory map used to determine which addresses are mapped to
which memories 14A-14D, and hence to which processing node 12A-12D a
memory request for a particular address should be routed. In one
embodiment, the coherency point for an address within computer system 10
is the memory controller 16A-16D coupled to the memory that is storing the
bytes corresponding to the address. In other words, the memory controller
16A-16D is responsible for ensuring that each memory access to the
corresponding memory 14A-14D occurs in a cache coherent fashion. Memory
controllers 16A-16D may comprise control circuitry for interfacing to
memories 14A-14D. Additionally, memory controllers 16A-16D may include
request queues for queuing memory requests.
Generally, interface logic 18A-18L may comprise a variety of buffers for
receiving packets from one unidirectional link and for buffering packets
to be transmitted upon another unidirectional link. Computer system 10 may
employ any suitable flow control mechanism for transmitting packets. For
example, in one embodiment, each transmitting interface logic 18 stores a
count of the number of each type of buffers within the receiving interface
logic at the other end of the link to which the transmitting interface
logic is connected. The interface logic does not transmit a packet unless
the receiving interface logic has a free buffer to store the packet. As a
receiving buffer is freed by routing a packet onward, the receiving
interface logic transmits a message to the sending interface logic to
indicate that the buffer has been freed. Such a mechanism may be referred
to as a "coupon-based" system.
Turning next to FIG. 2, a block diagram illustrating processing nodes 12A
and 12B is shown to illustrate in more detail one embodiment of the dual
unidirectional link structure connecting the processing nodes 12A and 12B.
In the embodiment of FIG. 2, lines 24A (the unidirectional link 24A)
include a clock line 24AA, a control line 24AB, and a command/address/data
bus 24AC. Similarly, lines 24B (the unidirectional link 24B) include a
clock line 24BA, a control line 24BB, and a command/address/data bus 24BC.
A clock line transmits a clock signal that indicates a sample point for its
corresponding control line and the command/address/data bus. In one
particular embodiment, data/control bits are transmitted on each edge
(i.e. rising edge and falling edge) of the clock signal. Accordingly, two
data bits per line may be transmitted per clock cycle. The amount of time
employed to transmit one bit per line is referred to herein as a "bit
time". The above-mentioned embodiment includes two bit times per clock
cycle. A packet may be transmitted across two or more bit times. Multiple
clock lines may be used depending upon the width of the
command/address/data bus. For example, two clock lines may be used for a
32 bit command/address/data bus (with one half of the command/address/data
bus referenced to one of the clock lines and the other half of the
command/address/data bus and the control line referenced to the other one
of the clock lines.
The control line indicates whether or not the data transmitted upon the
command/address/data bus is either a bit time of a control packet or a bit
time of a data packet. The control line is asserted to indicate a control
packet, and deasserted to indicate a data packet. Certain control packets
indicate that a data packet follows. The data packet may immediately
follow the corresponding control packet. In one embodiment, other control
packets may interrupt the transmission of a data packet. Such an
interruption may be performed by asserting the control line for a number
of bit times during transmission of the data packet and transmitting the
bit times of the control packet while the control line is asserted.
Control packets that interrupt a data packet may not indicate that a data
packet will be following.
The command/address/data bus comprises a set of lines for transmitting the
data, command, response and address bits. In one embodiment, the
command/address/data bus may comprise 8, 16, or 32 lines. Each processing
node or I/O bridge may employ any one of the supported numbers of lines
according to design choice. Other embodiments may support other sizes of
command/address/data bus as desired.
According to one embodiment, the command/address/data bus lines and the
clock line may carry inverted data (i.e. a logical one is represented as a
low voltage on the line, and a logical zero is represented as a high
voltage). Alternatively, these lines may carry non-inverted data (in which
a logical one is represented as a high voltage on the line, and logical
zero is represented as a low voltage). A suitable positive and negative
logic combination may also be implemented.
Turning now to FIGS. 3-7, exemplary packets employed in a cache-coherent
communication (i.e., the communication between processing nodes) according
to one embodiment of computer system 10 are shown. FIGS. 3-6 illustrate
control packets and FIG. 7 illustrates a data packet. Other embodiments
may employ different packet definitions. The control packets and the data
packet may collectively be referred to as binary packets. Each packet is
illustrated as a series of bit times enumerated under the "bit time"
heading. The bit times of the packet are transmitted according to the bit
time order listed. FIGS. 3-7 illustrate packets for an eight-bit
command/address/data bus implementation. Accordingly, eight bits (numbered
seven through zero) of control or data information is transferred over the
eight-bit command/address/data bus during each bit time. Bits for which no
value is provided in the figures may either be reserved for a given
packet, or may be used to transmit packet-specific information.
FIG. 3 illustrates an information packet (info packet) 30. Info packet 30
comprises two bit times on an eight bit link. The command encoding is
transmitted during bit time one, and comprises six bits--denoted by the
command field CMD[5:0]--in the present embodiment. An exemplary command
field encoding is shown in FIG. 8. Each of the other control packets shown
in FIGS. 4, 5 and 6 includes the command encoding in the same bit
positions during bit time 1. Info packet 30 may be used to transmit
messages between processing nodes when the messages do not include a
memory address.
FIG. 4 illustrates an address packet 32. Address packet 32 comprises eight
bit times on an eight bit link. The command encoding is transmitted during
bit time 1, along with a portion of the destination node number denoted by
the field DestNode. The remainder of the destination node number and the
source node number (SrcNode) are transmitted during bit time two. A node
number unambiguously identifies one of the processing nodes 12A-12D within
computer system 10, and is used to route the packet through computer
system 10. Additionally, the source of the packet may assign a source tag
(SrcTag) transmitted during bit times 2 and 3. The source tag identifies
packets corresponding to a particular transaction initiated by the source
node (i.e. each packet corresponding to a particular transaction includes
the same source tag). Thus, for example, when the SrcTag field is of 7-bit
length, the corresponding source node can have up to 128 (2.sup.7)
different transactions in progress in the system. Responses from other
nodes in the system will be associated to corresponding transactions
through the SrcTag field in the responses. Bit times four through eight
are used to transmit the memory address--denoted by the address field Addr
[39:0]--affected by the transaction. Address packet 32 may be used to
initiate a transaction, e.g., a read or a write transaction.
FIG. 5 illustrates a response packet 34. Response packet 34 includes the
command encoding, destination node number, source node number, and source
tag similar to the address packet 32. The SrcNode (source node) field
preferably identifies the node that originated the transaction that
prompted the generation of the response packet. The DestNode (destination
node) field, on the other hand, identifies the processing node--the source
node or the target node (described later)--that is the final receiver of
the response packet. Various types of response packets may include
additional information. For example, a read response packet, described
later with reference to FIG. 11A, may indicate the amount of read data
provided in a following data packet. Probe responses, described later with
reference to FIG. 12, may indicate whether or not a hit was detected for
the requested cache block. Generally, response packet 34 is used for
commands that do not require transmission of the address during the
carrying out of a transaction. Furthermore, response packet 34 may be used
to transmit positive acknowledgement packets to terminate a transaction.
FIG. 6 shows an example of a command packet 36. As mentioned earlier, each
unidirectional link is a pipelined, split-transaction interconnect in
which transactions are tagged by the source node and responses can return
to the source node out of order depending on the routing of packets at any
given instance. A source node sends a command packet to initiate a
transaction. Source nodes contain address-mapping tables and place the
target node number (TgtNode field) within the command packet to identify
the processing node that is the destination of the command packet 36. The
command packet 36 has CMD field, SrcNode field, SrcTag field and Addr
field that are similar to the ones shown and described with reference to
the address packet 32 (FIG. 4).
One distinct feature of the command packet 36 is the presence of the Count
field. In a non-cacheable read or write operation, the size of data may be
less than the size of a cache block. Thus, for example, a non-cacheable
read operation may request just one byte or one quad word (64-bit length)
of data from a system memory or an I/O device. This type of sized read or
write operation is facilitated with the help of the Count field. Count
field, in the present example, is shown to be of three-bit length. Hence,
a given sized data (byte, quad-word etc.) may be transferred a maximum of
eight times. For example, in an 8-bit link, when the value of count field
is zero (binary 000), the command packet 36 may indicate transfer of just
one byte of data over one bit time; whereas, when the value of the count
field is seven (binary 111), a quad word, i.e. eight bytes, may be
transferred for a total of eight bit times. The CMD field may identify
when a cache block is being transferred. In that case, the count field
will have a fixed value; seven in the situation when the cache block is
64-byte size, because eight quad words need be transferred to read or
write a cache block. In the case of an 8-bit wide unidirectional link,
this may require transfer of eight complete data packets (FIG. 7) over 64
bit times. Preferably, the data packet (described later with reference to
FIG. 7) may immediately follow a write command packet or a read response
packet (described later) and the data bytes may be transferred in an
increasing address order. Data transfers of a single byte or a quad word
may not cross a naturally aligned 8 or 64 byte boundary, respectively.
FIG. 7 illustrates a data packet 38. Data packet 38 includes eight bit
times on an eight bit link in the embodiment of FIG. 7. Data packet 38 may
comprise a 64-byte cache block, in which case it would take 64 bit times
(on an 8-bit link) to complete the cache block transfer. Other embodiments
may define a cache block to be of a different size, as desired.
Additionally, data may be transmitted in less than a cache block size for
non-cacheable reads and writes as mentioned earlier with reference to the
command packet 36 (FIG. 6). Data packets for transmitting data less than
cache block size require fewer bit times.
FIGS. 3-7 illustrate packets for an eight-bit link. Packets for 16 and 32
bit links may be formed by concatenating consecutive bit times illustrated
in FIGS. 3-7. For example, bit time one of a packet on a 16-bit link may
comprise the information transmitted during bit times one and two on the
eight-bit link. Similarly, bit time one of the packet on a 32-bit link may
comprise the information transmitted during bit times one through four on
the eight-bit link. Formulas 1 and 2 below illustrate the formation of bit
time one of a 16 bit link and bit time one of a 32 bit link in terms of
bit times for an eight bit link.
BT1.sub.16 [15:0]=BT2.sub.8 [7:0].parallel.BT1.sub.8 [7:0] (1)
BT1.sub.32 [31:0]=BT4.sub.8 [7:0].parallel.BT3.sub.8
[7:0].parallel.BT2.sub.8 [7:0].parallel.BT1.sub.8 [7:0] (2)
Turning now to FIG. 8, a table 40 is shown illustrating commands employed
for one exemplary embodiment of the dual-unidirectional link structure
within computer system 10. Table 40 includes a command code column
illustrating the command encodings (the CMD field) assigned to each
command, a command column naming the command, and a packet type column
indicating which of the command packets 30-38 (FIGS. 3-7) is used for that
command. A brief functional explanation for some of the commands in FIG. 8
is given below.
A read transaction is initiated using one of the Rd(Sized), RdBlk, RdBlkS
or RdBlkMod commands. The sized read command, Rd(Sized), is used for
non-cacheable reads or reads of data other than a cache block in size. | | |