|
Claims  |
|
|
What is claimed is:
1. A multiprocessing computer system comprising:
a plurality of processing nodes interconnected through an interconnect structure, wherein said plurality of processing nodes includes:
a first processing node configured to initiate a first read operation to read data from a designated memory location;
a second processing node configured to respond to said first read operation by initiating a second read operation to read and transfer said data from said designated memory location to said first processing node; and
a third processing node configured to transmit a memory cancel response to said second processing node upon detecting within said third processing node a modified copy of said designated memory location, and wherein said memory cancel response
causes said second processing node to abort further processing of said second read operation;
wherein said second processing node is configured to transfer said data read during said second read operation by transmitting a first read response to said first processing node, and wherein said memory cancel response causes said second
processing node to cancel transmission of said first read response when said second processing node receives said memory cancel response prior to transmitting said first read response.
2. The multiprocessing computer system as in claim 1, wherein said interconnect structure includes a first plurality of dual-unidirectional links.
3. The multiprocessing computer system of claim 2, wherein each dual-unidirectional link in said first plurality of dual-unidirectional links interconnects a respective pair of processing nodes from said plurality of processing nodes.
4. The multiprocessing computer system according to claim 3, further comprising a plurality of I/O devices, wherein said interconnect structure further includes a second plurality of dual-unidirectional links, and wherein each of said plurality
of I/O devices is coupled to a respective processing node through a corresponding one of said second plurality of dual-unidirectional links.
5. The multiprocessing computer system of claim 4, wherein each dual-unidirectional link in said first and said second plurality of dual-unidirectional links performs packetized information transfer and includes a pair of unidirectional buses
comprising:
a transmission bus carrying a first plurality of binary packets; and
a receiver bus carrying a second plurality of binary packets.
6. The multiprocessing computer system of claim 5, wherein each of said plurality of processing nodes includes:
a plurality of circuit elements comprising:
a processor core,
a cache memory,
a memory controller,
a bus bridge,
a graphics logic,
a bus controller, and
a peripheral device controller; and
a plurality of interface ports, wherein each of said plurality of circuit elements is coupled to at least one of said plurality of interface ports.
7. The multiprocessing computer system according to claim 6, wherein at least one of said plurality of interface ports in said each of said plurality of processing nodes is coupled to a corresponding dual-unidirectional link selected from the
group consisting of said first and said second plurality of dual-unidirectional links.
8. The multiprocessing computer system as in claim 1, further comprising:
a plurality of system memories; and
a plurality of memory buses, wherein each of said plurality of system memories is coupled to a corresponding one of said plurality of processing nodes through a respective one of said plurality of memory buses.
9. The multiprocessing computer system of claim 8, wherein each of said plurality of memory buses is bi-directional.
10. The multiprocessing computer system according to claim 8, wherein a first memory from said plurality of system memories is coupled to said second processing node, wherein said first memory includes said designated memory location, and
wherein said second processing node accesses said first memory during said second read operation.
11. The multiprocessing computer system as in claim 1, wherein said second processing node is configured to respond to said first read operation by transmitting a probe command to said third processing node.
12. The multiprocessing computer system of claim 11, wherein said second processing node is configured to transmit said probe command regardless of whether a copy of said designated memory location is cached within said third processing node.
13. The multiprocessing computer system according to claim 11, wherein said probe command causes said third processing node to transmit a read response to said first processing node.
14. The multiprocessing computer system as in claim 13, wherein said read response includes a data packet containing said modified copy of said designated memory location cached within said third processing node.
15. The multiprocessing computer system as in claim 1, wherein a size of said data read during said second read operation is dependent on a type of said first read operation.
16. The multiprocessing computer system according to claim 1, wherein said first read response includes a data packet containing said data read during said second read operation.
17. The multiprocessing computer system as in claim 1, wherein said third processing node is configured to transmit a second read response concurrently with said memory cancel response, wherein said second read response is transmitted to said
first processing node.
18. The multiprocessing computer system of claim 17, wherein said second read response includes a data packet containing said modified copy of said designated memory location cached within said third processing node.
19. The multiprocessing computer system according to claim 18, wherein said second processing node is configured to transmit a target done response to said first processing node upon receiving said memory cancel response from said third
processing node, wherein said target done response is transmitted regardless of whether said first read response is transmitted.
20. The multiprocessing computer system as in claim 19, wherein said first processing node is configured to transmit a source done message to said second processing node upon receiving said target done response and said second read response.
21. The multiprocessing computer system of claim 20, wherein said source done message signifies completion of said first read operation according to a predetermined data transfer protocol and allows said second processing node to respond to a
subsequent data transfer operation involving said designated memory location.
22. In a multiprocessing computer system comprising a plurality of processing nodes interconnected through an interconnect structure, wherein said plurality of processing nodes includes a first processing node, a second processing node, and a
third processing node, a method for selectively reading a content of a memory location in a memory associated with said second processing node, said method comprising:
initiating a first read operation by said first processing node to read said content of said memory location;
further initiating a second read operation by said second processing node to respond to said first read operation, wherein said second processing node reads and transfers said content of said memory location to said first processing node during
said second read operation, the second read operation including a first read response from said second processing node to said first processing node, wherein said first read response includes a first data packet for said content of said memory location;
said third processing node transmitting a memory cancel response to said second processing node upon detecting within said third processing node a modified copy of said memory location;
said memory cancel response causing said second processing node to abort further processing of said second read operation; and
said memory cancel response causing said second processing node to cancel transmission of said first read response when said second processing node receives said memory cancel response prior to said transmitting said first read response.
23. The method as in claim 22, wherein a size of said first data packet is dependent on a type of said first read operation.
24. The method according to claim 22, further comprising:
said third processing node transmitting a second read response concurrently with said memory cancel response, wherein said second read response is transmitted to said first processing node.
25. The method as in claim 24, wherein said second read response includes a second data packet containing said modified copy of said memory location cached within said third processing node.
26. The method of claim 25, further comprising:
said second processing node transmitting a target done response to said first processing node upon receiving said memory cancel response from said third processing node, wherein said target done response is transmitted regardless of whether said
first read response is transmitted.
27. The method according to claim 26, further comprising:
said first processing node transmitting a source done message to said second processing node upon receiving said target done response and said second read response. |
|
|
|
|
Claims  |
|
|
Description  |
|
|
BACKGROUND OF THE
INVENTION
1. Field of the invention
The present invention broadly relates to computer systems, and more particularly, to a messaging scheme to accomplish cache-coherent data transfers in a multiprocessing computing environment.
2. Description of the Related Art
Generally, personal computers (PCs) and other types of computer systems have been designed around a shared bus system for accessing memory. One or more processors and one or more input/output (I/O) devices are coupled to memory through the
shared bus. The I/O devices may be coupled to the shared bus through an I/O bridge, which manages the transfer of information between the shared bus and the I/O devices. The processors are typically coupled directly to the shared bus or through a cache
hierarchy.
Unfortunately, shared bus systems suffer from several drawbacks. For example, since there are multiple devices attached to the shared bus, the bus is typically operated at a relatively low frequency. Further, system memory read and write cycles
through the shared system bus take substantially longer than information transfers involving a cache within a processor or involving two or more processors. Another disadvantage of the shared bus system is a lack of scalability to larger number of
devices. As mentioned above, the amount of bandwidth is fixed (and may decrease if adding additional devices reduces the operable frequency of the bus). Once the bandwidth requirements of the devices attached to the bus (either directly or indirectly)
exceeds the available bandwidth of the bus, devices will frequently be stalled when attempting to access the bus. Overall performance may be decreased unless a mechanism is provided to conserve the limited system memory bandwidth.
A read or a write operation addressed to a non-cache system memory takes more processor clock cycles than similar operations between two processors or between a processor and its internal cache. The limitations on bus bandwidth, coupled with the
lengthy access time to read or write to a system memory, negatively affect the computer system performance.
One or more of the above problems may be addressed using a distributed memory system. A computer system employing a distributed memory system includes multiple nodes. Two or more of the nodes are connected to memory, and the nodes are
interconnected using any suitable interconnect. For example, each node may be connected to each other node using dedicated lines. Alternatively, each node may connect to a fixed number of other nodes, and transactions may be routed from a first node to
a second node to which the first node is not directly connected via one or more intermediate nodes. The memory address space is assigned across the memories in each node.
Nodes may additionally include one or more processors. The processors typically include caches that store cache blocks of data read from the memories. Furthermore, a node may include one or more caches external to the processors. Since the
processors and/or nodes may be storing cache blocks accessed by other nodes, a mechanism for maintaining coherency within the nodes is desired.
SUMMARY OF THE INVENTION
The problems outlined above are in large part solved by a computer system as described herein. The computer system may include multiple processing nodes, two or more of which may be coupled to separate memories which may form a distributed
memory system. The processing nodes may include caches, and the computer system may maintain coherency between the caches and the distributed memory system.
In one embodiment, the present invention relates to a multiprocessing computer system where the processing nodes are interconnected through a plurality of dual unidirectional links. Each pair of unidirectional links forms a coherent link
structure that connects only two of the processing nodes. One unidirectional link in the pair of links sends signals from a first processing node to a second processing node connected through that pair of unidirectional links. The other unidirectional
link in the pair carries a reverse flow of signals, i.e. it sends signals from the second processing node to the first processing node. Thus, each unidirectional link forms as a point-to-point interconnect that is designed for packetized information
transfer. Communication between two processing nodes may be routed through more than one remaining nodes in the system.
Each processing node may be coupled to a respective system memory through a memory bus. The memory bus may be bidirectional. Each processing node comprises at least one processor core and may optionally include a memory controller for
communicating with the respective system memory. Other interface logic may be included in one or more processing nodes to allow connectivity with various I/O devices through one or more I/O bridges.
In one embodiment, one or more I/O bridges may be coupled to their respective processing nodes through a set of non-coherent dual unidirectional links. These I/O bridges communicate with their host processors through this set of non-coherent
dual unidirectional links in much the same way as two directly-linked processors communicate with each other through a coherent dual unidirectional link.
In one embodiment, when a first processing node sends a read command to a second processing node to read data from a designated memory location associated with the second processing node, the second processing node responsively transmits a probe
command to all the remaining processing nodes in the system. The probe command is transmitted regardless of whether one or more of the remaining nodes have a copy of the data cached in their respective cache memories. Each processing node that has a
cached copy of the designated memory location updates its cache tag associated with that cached data to reflect the current status of the data. Each processing node that receives a probe command sends, in return, a probe response indicating whether that
processing node has a cached copy of the data. In the event that a processing node has a cached copy of the designated memory location, the probe response from that processing node further includes the state of the cached data--i.e. modified, shared
etc.
The target processing node, i.e. the second processing node, sends a read response to the source processing node, i.e. the first processing node. This read response contains the data requested by the source node through the read command. The
first processing node acknowledges receipt of the data by transmitting a source done response to the second processing node. When the second processing node receives the source done response it removes the read command (received from the first
processing node) from its command buffer queue. The second processing node may, at that point, start to respond to a command to the same designated memory location. This sequence of messaging is one step in maintaining cache-coherent system memory
reads in a multiprocessing computer system. The data read from the designated memory location may be less than the whole cache block in size if the read command specifies so.
Upon receiving the probe command, all of the remaining nodes check the status of the cached copy, if any, of the designated memory location as described before. In the event that a processing node, other than the source and the target nodes,
finds a cached copy of the designated memory location that is in a modified state, that processing node responds with a memory cancel response sent to the target node, i.e. the second processing node. This memory cancel response causes the second
processing node to abort further processing of the read command, and to stop transmission of the read response, if it hasn't sent the read response yet. All the other remaining processing nodes still send their probe responses to the first processing
node. The processing node that has the modified cached data sends that modified data to the first processing node through its own read response. The messaging scheme involving probe responses and read responses thus maintains cache coherency during a
system memory read operation.
The memory cancel response further causes the second processing node to transmit a target done response to the first processing node regardless of whether it earlier sent the read response to the first processing node. The first processing node
waits for all the responses to arrive--i.e. the probe responses, the target done response, and the read response from the processing node having the modified cached data--prior to completing the data read cycle by sending a source done response to the
second processing node. In this embodiment, the memory cancel response conserves system memory bandwidth by causing the second processing node to abort time-consuming memory read operation when a modified copy of the requested data is cached at a
different processing node. Reduced data transfer latencies are thus achieved when it is observed that a data transfer between two processing nodes over the high-speed dual unidirectional link is substantially faster than a similar data transfer between
a processing node and a system memory that involves a relatively slow speed system memory bus.
BRIEF DESCRIPTION OF THE DRAWINGS
A better understanding of the present invention can be obtained when the following detailed description of the preferred embodiment is considered in conjunction with the following drawings, in which:
FIG. 1 is a block diagram of one embodiment of a computer system.
FIG. 2 shows in detail one embodiment of the interconnect between a pair of processing nodes from FIG. 1.
FIG. 3 is a block diagram of one embodiment of an information packet.
FIG. 4 is a block diagram of one embodiment of an address packet.
FIG. 5 is a block diagram of one embodiment of a response packet.
FIG. 6 is a block diagram of one embodiment of a command packet.
FIG. 7 is a block diagram of one embodiment of a data packet.
FIG. 8 is a table illustrating exemplary packet types that may be employed in the computer system of FIG. 1.
FIG. 9 is a diagram illustrating an example flow of packets corresponding to a memory read operation.
FIG. 10A is a block diagram of one embodiment of a probe command packet.
FIG. 10B is a block diagram for one embodiment of the encoding for the NextState field in the probe command packet of FIG. 10A.
FIG. 11A is a block diagram of one embodiment of a read response packet.
FIG. 11B shows in one embodiment the relationship between the Probe, Tgt and Type fields of the read response packet of FIG. 11A.
FIG. 12 is a block diagram of one embodiment of a probe response packet.
FIG. 13 is a diagram illustrating an example flow of packets involving a memory cancel response.
FIG. 14 is a diagram illustrating an example flow of packets showing a messaging scheme that combines probe commands and memory cancel response.
FIG. 15 is an exemplary flowchart for the transactions involved in a memory read operation.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT
Turning now to FIG. 1, one embodiment of a multiprocessing computer system 10 is shown. In the embodiment of FIG. 1, computer system 10 includes several processing nodes 12A, 12B, 12C, and 12D. Each processing node is coupled to a respective
memory 14A-14D via a memory controller 16A-16D included within each respective processing node 12A-12D. Additionally, processing nodes 12A-12D include one or more interface ports 18, also known as interface logic, to communicate among the processing
nodes 12A-12D, and to also communicate between a processing node and a corresponding I/O bridge. For example, processing node 12A includes interface logic 18A for communicating with processing node 12B, interface logic 18B for communicating with
processing node 12C, and a third interface logic 18C for communicating with yet another processing node (not shown). Similarly, processing node 12B includes interface logic 18D, 18E, and 18F; processing node 12C includes interface logic 18G, 18H, and
181; and processing node 12D includes interface logic 18J, 18K, and 18L. Processing node 12D is coupled to communicate with an I/O bridge 20 via interface logic 18L. Other processing nodes may communicate with other I/O bridges in a similar fashion.
I/O bridge 20 is coupled to an I/O bus 22.
The interface structure that interconnects processing nodes 12A-12D includes a set of dual-unidirectional links. Each dual-unidirectional link is implemented as a pair of packet-based unidirectional links to accomplish high-speed packetized
information transfer between any two processing nodes in the computer system 10. Each unidirectional link may be viewed as a pipelined, split-transaction interconnect. Each unidirectional link 24 includes a set of coherent unidirectional lines. Thus,
each pair of unidirectional links may be viewed as comprising one transmission bus carrying a first plurality of binary packets and one receiver bus carrying a second plurality of binary packets. The content of a binary packet will primarily depend on
the type of operation being requested and the processing node initiating the operation. One example of a dualunidirectional link structure is links 24A and 24B. The unidirectional lines 24A are used to transmit packets from processing node 12A to
processing node 12B and lines 24B are used to transmit packets from processing node 12B to processing node 12A. Other sets of lines 24C-24H are used to transmit packets between their corresponding processing nodes as illustrated in FIG. 1.
A similar dual-unidirectional link structure may be used to interconnect a processing node and its corresponding I/O device, or a graphic device or an I/O bridge as is shown with respect to the processing node 12D. A dual-unidirectional link may
be operated in a cache coherent fashion for communication between processing nodes or in a non-coherent fashion for communication between a processing node and an external I/O or graphic device or an I/O bridge. It is noted that a packet to be
transmitted from one processing node to another may pass through one or more remaining nodes. For example, a packet transmitted by processing node 12A to processing node 12D may pass through either processing node 12B or processing node 12C in the
arrangement of FIG. 1. Any suitable routing algorithm may be used. Other embodiments of computer system 10 may include more or fewer processing nodes than those shown in FIG. 1.
Processing nodes 12A-12D, in addition to a memory controller and interface logic, may include other circuit elements such as one or more processor cores, an internal cache memory, a bus bridge, a graphics logic, a bus controller, a peripheral
device controller, etc. Broadly speaking, a processing node comprises at least one processor and may optionally include a memory controller for communicating with a memory and other logic as desired. Further, each circuit element in a processing node
may be coupled to one or more interface ports depending on the functionality being performed by the processing node. For example, some circuit elements may only couple to the interface logic that connects an I/O bridge to the processing node, some other
circuit elements may only couple to the interface logic that connects two processing nodes, etc. Other combinations may be easily implemented as desired.
Memories 14A-14D may comprise any suitable memory devices. For example, a memory 14A-14D may comprise one or more RAMBUS DRAMs (RDRAMs), synchronous DRAMs (SDRAMs), static RAM, etc. The memory address space of the computer system 10 is divided
among memories 14A-14D. Each processing node 12A-12D may include a memory map used to determine which addresses are mapped to which memories 14A-14D, and hence to which processing node 12A-12D a memory request for a particular address should be routed.
In one embodiment, the coherency point for an address within computer system 10 is the memory controller 16A-16D coupled to the memory that is storing the bytes corresponding to the address. In other words, the memory controller 16A-16D is responsible
for ensuring that each memory access to the corresponding memory 14A-14D occurs in a cache coherent fashion. Memory controllers 16A-16D may comprise control circuitry for interfacing to memories 14A-14D. Additionally, memory controllers 16A-16D may
include request queues for queuing memory requests.
Generally, interface logic 18A-18L may comprise a variety of buffers for receiving packets from one unidirectional link and for buffering packets to be transmitted upon another unidirectional link. Computer system 10 may employ any suitable flow
control mechanism for transmitting packets. For example, in one embodiment, each transmitting interface logic 18 stores a count of the number of each type of buffers within the receiving interface logic at the other end of the link to which the
transmitting interface logic is connected. The interface logic does not transmit a packet unless the receiving interface logic has a free buffer to store the packet. As a receiving buffer is freed by routing a packet onward, the receiving interface
logic transmits a message to the sending interface logic to indicate that the buffer has been freed. Such a mechanism may be referred to as a "coupon-based" system.
Turning next to FIG. 2, a block diagram illustrating processing nodes 12A and 12B is shown to illustrate in more detail one embodiment of the dual unidirectional link structure connecting the processing nodes 12A and 12B. In the embodiment of
FIG. 2, lines 24A (the unidirectional link 24A) include a clock line 24AA, a control line 24AB, and a command/address/data bus 24AC. Similarly, lines 24B (the unidirectional link 24B) include a clock line 24BA, a control line 24BB, and a
command/address/data bus 24BC.
A clock line transmits a clock signal that indicates a sample point for its corresponding control line and the command/address/data bus. In one particular embodiment, data/control bits are transmitted on each edge (i.e. rising edge and falling
edge) of the clock signal. Accordingly, two data bits per line may be transmitted per clock cycle. The amount of time employed to transmit one bit per line is referred to herein as a "bit time". The above-mentioned embodiment includes two bit times
per clock cycle. A packet may be transmitted across two or more bit times. Multiple clock lines may be used depending upon the width of the command/address/data bus. For example, two clock lines may be used for a 32 bit command/address/data bus (with
one half of the command/address/data bus referenced to one of the clock lines and the other half of the command/address/data bus and the control line referenced to the other one of the clock lines.
The control line indicates whether or not the data transmitted upon the command/address/data bus is either a bit time of a control packet or a bit time of a data packet. The control line is asserted to indicate a control packet, and deasserted
to indicate a data packet. Certain control packets indicate that a data packet follows. The data packet may immediately follow the corresponding control packet. In one embodiment, other control packets may interrupt the transmission of a data packet.
Such an interruption may be performed by asserting the control line for a number of bit times during transmission of the data packet and transmitting the bit times of the control packet while the control line is asserted. Control packets that interrupt
a data packet may not indicate that a data packet will be following.
The command/address/data bus comprises a set of lines for transmitting the data, command, response and address bits. In one embodiment, the command/address/data bus may comprise 8, 16, or 32 lines. Each processing node or I/O bridge may employ
any one of the supported numbers of lines according to design choice. Other embodiments may support other sizes of command/address/data bus as desired.
According to one embodiment, the command/address/data bus lines and the clock line may carry inverted data (i.e. a logical one is represented as a low voltage on the line, and a log | | |