|
Claims  |
|
|
What we claim is:
1. A multi-processor computer system having a plurality of nodes coupled via a switch, each of the multi-processing nodes further comprising at least two processors, the
multi-processor computer system comprising:
a shared memory apportioned into a plurality of blocks, each one of the plurality of nodes for storing a portion of the shared memory;
a directory, at each one of the plurality of nodes, coupled to the portion of shared memory of the node, for storing status information for each of the plurality of blocks in the portion of the shared memory of the associated node, the status
information for indicating which of the plurality of nodes stores a copy of the associated block of shared memory;
a bus, at each one of the plurality of nodes, for coupling the at least two processors to the directory, the bus for carrying information including memory reference commands including an ownership command for gaining ownership of one of the
blocks of the shared memory;
table, at each of the plurality of nodes, to track a status of the memory reference commands as to allow multiple memory reference commands to be in transit at a given time; and
means, at each of the plurality of nodes, for inhibiting the transmission of an ownership command, issued by one of the least two processors of the respective node, requesting access to a given block of the shared memory in another one of the
plurality of nodes by monitoring the table.
2. The multi-processor computing system according to claim 1, wherein the means for inhibiting precludes the ownership command from being transmitted to the another one of the plurality of nodes when there is an outstanding read to the given
block.
3. The multi-processing computing system according to claim 1, wherein the table further comprises:
a plurality of entries, each of the entries of the table having:
an address field, for storing an address of a memory reference command issued by one of the at least two processors associated with the node to the portions of shared memory not associated with the respective node;
a transaction field, for indicating a type of transaction associated with the memory reference command; and
a status field comprising at least one bits for indicating a status of the transaction associated with the memory reference command.
4. The computer system according to claim 3, further comprising:
means for comparing the type of transaction of a memory reference command to a block against the types of transactions of entries in the table having an address corresponding to the address of the block to determine whether to issue the memory
reference command to the block.
5. The computer system according to claim 4, wherein the memory reference command is an ownership command, and wherein the transmission of the memory reference command is delayed until all memory reference commands in the table having addresses
corresponding to the address of the block that are read type transactions are satisfied.
6. The computer system according to claim 5, wherein the delayed memory reference command is dropped if an invalidate type transaction having an address corresponding to the address of the memory reference command is received at the table.
7. The computer system according to claim 5, wherein the memory reference command is failed if the transaction type of one of the entries in the table having the address corresponding to the memory reference command is an ownership command.
8. The multi-processing computer system according to claim 1, wherein the ownership command is a change to dirty request.
9. A method for maintaining order between a plurality of requests issued to a common address in a multi-processor computer system, the multi-processor computer system comprising a plurality of multi-processor nodes coupled via a switch, each of
the multi-processor nodes comprising at least two processors and a portion of a s shared memory, the method comprising the steps of:
coupling a directory, at each one of the plurality of nodes, to the portion of shared memory of the node, to store status information for each of a plurality of blocks in the portion of the shared memory of the associated node, the status
information for indicating which of the plurality of nodes stores a copy of the associated block of shared memory;
issuing, by the at least two processors, memory reference command including an ownership command for gaining ownership of one of the blocks of the shared memory;
coupling a table, at each of the plurality of nodes, to track a status of the memory reference commands as to allow multiple memory reference commands to be in transit at a given time; and
inhibiting, at each one of the plurality of nodes, the transmission of an ownership command requesting access to a given block of the portion of shared memory in another one of the plurality nodes by monitoring the table.
10. The method according to claim 9, wherein the step of inhibiting precludes the ownership command from being transmitted to the another one of the plurality of nodes when there is an outstanding read to the given block.
11. The method according to claim 10, wherein the table comprises a plurality of entries, each of the entries of the table having:
an address field, for storing an address of a memory reference command issued by one of the at least two processors associated with the node to the portions of shared memory not associated with the respective node;
a transaction field, for indicating a type of transaction associated with the memory reference command; and
a status field comprising at least one bits for indicating a status of the transaction associated with the memory reference command.
12. The method according to claim 11, further comprising the step of:
comparing the type of transaction of a memory reference command to a block against the types of transactions of entries in the table having an address corresponding to the address of the block to determine whether to issue the memory reference
command to the block.
13. The method according to claim 12, wherein the memory reference command is an ownership command, and wherein the transmission of the memory reference command is delayed until all memory reference commands in the table having addresses
corresponding to the address of the block that are read type transactions are satisfied.
14. The method according to claim 13, wherein the delayed memory reference command is dropped if an invalidate type transaction having an address corresponding to the address of the memory reference command is received at the table.
15. The method according to claim 12, wherein the memory reference command is failed if the transaction type of one of the entries in the table having the address corresponding to the memory reference command is an ownership command.
16. The method according to claim 15, wherein each of the memory reference commands comprise a plurality of transactions and wherein a corresponding
one of a plurality of channels is assigned to each of the plurality of transactions of each of the memory reference commands, and wherein memory reference commands are forwarded to the table in the same channel and wherein the channel is
ordered.
17. The method according to claim 11 further comprising the step of:
selectively forwarding a given request from each of the multi-processor nodes responsive to a type of other requests in the table having the same addresses.
18. The method according to claim 17, wherein the step of selectively forwarding a given request delays the issuance of the given request when the given request is a request for ownership of data associated with the address and there are read
type requests to the address in the table.
19. The method according to claim 17, wherein the given request is dropped from the list if an invalidate is received for the address.
20. The method according to claim 17, wherein the step of selectively forwarding a given request drops the given request if any of the other requests in the table having an address corresponding the address of the given request are ownership
commands.
21. The method according to claim 9, wherein the ownership command is a change to dirty request. |
|
|
|
|
Claims  |
|
|
Description  |
|
|
FIELD OF THE INVENTION
This invention relates in general to the field of computer architecture and more specifically to distributed shared-memory multi-processing systems.
BACKGROUND OF THE INVENTION
As it is known in the art, symmetric multi-processing computers allow for high performance application processing. Typical symmetric multi-processing computer systems include a number of processors coupled together by a bus. One characteristic
of a symmetric multi-processing system is that memory space is shared among all of the processors. One or more operating systems are stored in memory and control the distribution of processes or threads among the various processors.
By allowing many different processors to execute different processes or threads simultaneously, the execution speed of a given application may be greatly increased. In theory the performance of a system could be improved by simply increasing the
number of processors in the multi-processing system. In reality, the continued addition of processors past a certain saturation point serves merely to increase communication bottlenecks and thereby limit the overall performance of the system.
For example, referring now to FIG. 1A, a typical prior art multi-processor system 2 including eight processors coupled together via a common interconnect bus is shown. During operation, each of the processors 3a-3h communicate with the other
processors and with a shared memory 4 via a shared interconnect bus 5. The symmetric multi-processing arrangement of FIG. 1A has been adequate for multi-processors built to date. However, with the advent of faster microprocessors, a common shared
interconnect is not capable of sufficiently exercising the full performance potential of the coupled microprocessors. Because the only communication link between the processors and memory is the shared bus, the bus may rapidly become saturated with
requests from the processors, thereby increasing delays as each processor attempts to gain access to the system bus. Therefore, although the processors may be able to operate at enhanced speeds, the limiting factor in terms of performance is the
available bandwidth of the system bus.
Communication bandwidth is a key factor in the performance of SMP systems. Since bandwidth may not be uniform between pairs or subsets of nodes in the SMP system, the industry uses a "bisection bandwidth" measurement for determining the
communication bandwidth of an SMP system. Bisection bandwidth is determined in the following manner. All possible ways of partitioning the system into two portions of equal compute power (equal
number of processors) are ascertained. For each partition, the sustainable bandwidth between the two partitions is determined. The minimum of all of the sustainable bandwidths is the bisection bandwidth of the interconnect. The minimum
bandwidth between the two partitions indicates the communication bandwidth sustainable by the multiprocessor system in the presence of worst-case communication patterns. Thus, a large bisection bandwidth is desirable.
Several interconnection architectures or "topologies" have been used in the prior art to overcome bus saturation problems. These topologies include meshes, touri, hypercubes and enhanced hypercubes.
As an example, a mesh interconnect is shown as system 7 in FIG. 1B. The major advantage of the mesh network is its simplicity and ease of wiring. Each node is connected to a small number of other neighboring nodes. However, the mesh
interconnect has three significant drawbacks. First, messages must on average traverse a large number of nodes to get to their destination, and as a result the communication latency is high. Second, the bisection bandwidth does not scale as well for a
mesh topology as it does for other topologies. Finally, because each of the messages may traverse different paths within the mesh, there are no natural ordering points within an SMP system, and therefore the cache coherence protocols required to
implement the mesh topology are often quite complex.
The torus, hypercube, and enhanced hypercube topologies are all topologies wherein the nodes are interconnected in various complex arrangements, for example in a torus arrangement or a cube arrangement. The torus, hypercube and enhanced
hypercube interconnects are more complex than the mesh interconnect, but offer better latency and bandwidth than the mesh interconnect. However, like the mesh interconnect, the torus, hypercube and enhanced hypercube topologies do not provide natural
ordering points, and thus a complex cache coherence protocol must be implemented for each of those systems.
In shared-memory multiprocessor systems, processors typically employ private caches to store data determined likely to be accessed in the future. Since processors may read data from their private cache and may update data in the private cache
without writing it back to memory, a mechanism is needed to ensure that the private chaches of each of the processors are kept consistent, or coherent. The mechanism that is used to ensure coherency of data in the SMP system is referred to as the cache
coherence protocol.
Besides the topology, bandwidth, and latency of the physical interconnect the efficiency of the cache coherence protocol is a key factor in system performance. Cache coherency protocols may introduce latencies, bottlenecks, inefficiencies or
complexity in several ways.
The latency of load and store operations is often directly affected by the protocol of the design. For example, in some protocols, a store operation is not considered complete until all invalidate messages have made it to their target processors
and acknowledgment messages have made it all the way back to the original processor. The latency of stores here is much higher than a protocol wherein the original processor does not have to wait for the Invalidates to make it to their destination.
Further, the acknowledgments consume a significant fraction of the system bandwidth.
Bottlenecks are often introduced due to high occupancy of controllers. "Occupancy" is a term of art; it indicates the amount of time a controller is unavailable after it receives a request. In some protocols, when a directly controller receives
a request corresponding to a memory location, it becomes unavailable for other requests to the same memory location until certain acknowledgments corresponding to the former command arrive at the directory. If the controller receives conflicting
requests at a higher than average rate, it becomes a bottleneck.
The design of the cache coherence protocol also affects hardware complexity. For instance, some protocols introduce deadlock and fairness problems, which are then addressed with additional mechanisms. This results in added hardware complexity.
It is desirable to provide a symmetric multiprocessing system that minimizes the latency of operations, provides large communication bandwidth, provides low controller occupancy, and can scale to a large number of processors.
SUMMARY OF THE INVENTION
The present invention is advantageously employed in a symmetric multi-processing system where multiple multi-processor nodes including at least one processor and a portion of a shared memory are coupled together via a switch. In each of the
multi-processor nodes a transaction tracking table (TTT) is maintained. The TTT may reside in a global port of the node, connecting the node to the switch, or alternatively may reside at each of the at least one processors of the multi-processing node.
The TTT may be used to enforce an order of requests that are forwarded out of the multi-processing node. In particular, if one of the requests is a request for ownership of data associated with a block, such as a change to dirty request, the TTT
may delay or drop the request for ownership depending upon the status of other outstanding read requests to the same address. The TTT may also be used to drop the change to dirty requests if a subsequent invalidate for that same address is received at
the associated node. In addition, the TTT may fail other requests to the same address that are received after the request for ownership, thereby ensuring data coherency. By ensuring that change to dirty requests from local processors at a node are
handled correctly with regard to remote accesses, it can be guaranteed that the local accesses are never given priority over remote accesses to the same address but are handled in a proper order to ensure memory coherency.
According to one aspect of the invention, a multi-processor computer system having a plurality of nodes coupled via a switch is provided. Each of the multi-processing nodes includes at least two processors. The multi-processor computer system
includes a shared memory apportioned into a plurality of blocks, with each one of the plurality of nodes for storing a portion of the shared memory. A directory is included at each one of the plurality of nodes and is coupled to the portion of shared
memory of the node. The directory is for storing status information for each of the plurality of blocks in the portion of the shared memory of the associated node, wherein the status information indicates which of the plurality of nodes stores a copy of
the associated block of shared memory. A bus, at each one of the plurality of nodes, couples the at least two processors to the directory. The bus carries information including memory reference commands for the shared memory. The memory reference
commands include an ownership command for gaining ownership of one of the blocks of the shared memory. The multi-processor computer system further includes means, at each of the plurality of nodes, for inhibiting the transmission of an ownership
command, issued by one of the at least two processors of the respective node, requesting access to a given block of the shared memory in another one of the plurality of nodes.
According to another aspect of the invention, a method for maintaining order between a plurality of requests issued to a common address in a multi-processor computer system is provided. The multi-processor computer system includes a plurality of
multi-processor nodes coupled via a switch with each of the multi-processor nodes comprising at least two processors and a portion of a shared memory. The method includes the steps of coupling a directory, at each one of the plurality of nodes, to the
portion of shared memory of the node, to store status information for each of a plurality of blocks in the portion of the shared memory of the associated node, the status information for indicating which of the plurality of nodes stores a copy of the
associated block of shared memory, issuing, by the at least two processors, memory reference commands including an ownership command for gaining ownership of one of the blocks of the shared memory and inhibiting, at each one of the plurality of nodes,
the transmission of an ownership command requesting access to a given block of the portion of shared memory in another one of the plurality of nodes.
BRIEF DESCRIPTION OF THE DRAWINGS
The above-mentioned and other features of the invention will now become more apparent by reference to the following description taken in connection with the accompanying drawings in which:
FIGS. 1A-1B are block diagrams of two prior art symmetric multi-processor computer systems;
FIG. 2 is a block diagram of one embodiment of a multi-processor computer node of one embodiment of the invention comprising a switch;
FIG. 3 is a block diagram illustrating the data path of the switch of FIG. 1 composing a number of Simultaneous Insertion Buffers;
FIG. 4A is a block diagram of one embodiment of one of the Simultaneous Insertion Buffers of FIG. 3;
FIG. 4B is a block diagram of one implementation of logic for controlling one of the Simultaneous Input Buffers of FIG. 4;
FIG. 5 is a block diagram of a second embodiment of one of the Simultaneous Insertion Buffers of FIG. 3;
FIG. 6 is a block diagram of the multi-processor computer node of FIG. 2, augmented for connection into a larger network of similar nodes;
FIG. 7A is one embodiment of an SMP system implemented using multiple nodes similar to the multi-processor node of FIG. 6;
FIG. 7B is another embodiment of an SMP system implemented using multiple nodes similar to the multi-processor node of FIG. 6;
FIG. 8 is a block diagram of a global port of FIG. 6;
FIG. 9 illustrates an entry in a directory of the multi-processor node of FIG. 6;
FIG. 10 illustrates a Transaction Tracking Table (TTT) for use in the global port of FIG. 8;
FIG. 11 is a block diagram of a hierarchical switch for coupling the multiple nodes in FIG. 7A;
FIG. 12A is a block diagram of one embodiment of interconnect logic for the hierarchical switch that eliminates deadlock;
FIG. 12B is a flow diagram of the operation of the interconnect logic of FIG. 12A;
FIG. 13 is a flow diagram of the method used in the interconnect logic of FIG. 12A to assert flow control to stop data being transmitted from one of the multi-processing nodes;
FIG. 14 is a timing diagram illustrating the transfer of address and data packets on the busses to and from the hierarchical switch;
FIG. 15 is a block diagram of one embodiment of buffer logic for maintaining order at the hierarchical switch;
FIG. 16 is a block diagram of another embodiment of buffer logic for maintaining order for the hierarchical switch;
FIG. 17 is a flow diagram illustrating one method of operating the buffer logic of FIG. 16;
FIG. 18 is a block diagram of another embodiment of buffer logic for maintaining order at the hierarchical switch;
FIG. 19 is a table illustrating the translation of processor instructions to network instructions for use in the SMP of FIGS. 7A or 7B;
FIGS. 20A, 20B, 20C, 20D, 20E, 20F, 20G, 20H, 20I and 20J illustrate a number of communication flows for transferring packets between nodes in the SMP of FIGS. 7A or 7B;
FIG. 21 is a block diagram illustrating the layout of a memory module for use in the multi-processor system of FIGS. 2 or 6;
FIG. 22 is a timing diagram illustrating the control logic used by the memory module of FIG. 21 for delayed write operations;
FIG. 23 is a flow diagram illustrating the use of discrete transactions that are mapped to channels for maintaining cache coherency in one embodiment of the invention;
FIG. 24 is a block diagram illustrating one implementation of a shared queue structure for handling virtual channels in the SMP of FIGS. 7A or 7B;
FIG. 25 is a block diagram illustrating an implementation of individual channel buffering in the nodes and hierarchical switches of the SMP of FIGS. 7A or 7B;
FIG. 26 is a block diagram for illustrating the problems that may arise if some amount of ordering between virtual channels in not maintained;
FIGS. 27A-20C are block diagrams illustrating the flow and ordering constraints on the Q1 channel for providing coherent communication in the SMP of FIGS. 7A or 7B;
FIGS. 28A and 28B are a block diagram illustrating the ambiguity problems that arise because of the coarse vector presence bits of the directory entries of the SMP of FIGS. 7A and 7B;
FIG. 29 is a block diagram illustrating the method used to prevent data ambiguity from arising as a result of the problem described in FIG. 28;
FIG. 30 is a block diagram for illustrating a coherency issue that arises from packets on different channels being received out of sequence;
FIG. 31 is a block diagram illustrating the use of Fill Markers for preventing the coherency problem described in FIG. 29;
FIG. 32 is an entry in the TTT reflecting the status of an instruction during the flow described with regard to FIG. 31;
FIGS. 33A-33B are block diagrams illustrating the operation of Change to Dirty commands in the SMP system;
FIG. 34 is a block diagram illustrating the use of Shadow commands for remedying the problem described with regard to FIG. 33;
FIG. 35 is an entry in the TTT reflecting the status of an instruction during the flow described with regard to FIG. 34; and
FIG. 36 is a flow diagram illustrating permissible sequential ordering of instructions in the example described in FIG. 35.
DESCRIPTION OF THE PREFERRED EMBODIMENT
According to one embodiment of the invention, a hierarchical Symmetric Multi-Processing (SMP) system includes a number of SMP nodes coupled together via a high performance switch. Thus, each of the SMP nodes act as a building block in the SMP
system. Below, the components and operation of one SMP node building block is first described, followed by a description of the operation of the SMP system and subsequently a description of a cache coherence protocol that is used to maintain memory
coherency in the large SMP system.
SMP NODE BUILDING BLOCK
Referring now to FIG. 2, a multi-processor node 10 includes four processor modules 12a, 12b, 12c, and 12d. Each processor module comprises a central processing unit (CPU). In a preferred embodiment, Alpha.RTM. 21264 processor chips,
manufactured by Digital Equipment Corporation.RTM. are used, although other types of processor chips capable of supporting the below described coherency protocol may alternatively be used.
Multi-processor node 10 includes a memory 13, which may include a number of memory modules 13a-13d. The memory may provide 32 GBytes of storage capacity, with each of the 4 memory modules storing 8 Gigabytes. Each of the memory modules is
apportioned into a number of blocks of memory, where each block may include, for example 64 bytes of data. Data is generally retrieved from memory in blocks.
In addition, multi-processing node 10 includes an I/O processor (IOP) module 14 for controlling transfer of data between external devices (not shown) and the multi-processor node 10 via a coupled I/O bus 14a. In one embodiment of the invention,
the I/O bus may operate according to the Peripheral Computer Interconnect (PCI) protocol. The IOP 14 includes an IOP cache 14c and an IOP tag store 14b. The IOP cache 14c provides temporary storage for data from memory 13 that is transferred to
external devices on the PCI bus 14a. The IOP tag store 14b is a 64 entry tag store for storing coherency information for data being moved between external devices, processors and memory.
The coherency of data stored in the memory 13 of the multi-processor node is maintained by means of a Duplicate Tag store (DTAG) 20. The DTAG 20 is shared by all of the processors 12a-12d, and is apportioned into 4 banks, where each bank is
dedicated to storing status information corresponding
to data used by an associated one of the processors.
The DTAG, Memory and IOP are coupled to a logical bus referred to as the Arb bus 17. Memory block requests issued by the processor are routed via the local switch 15 to the Arb bus 17. The DTAG 20 and IOP 14 look up the state of the block in
the processors' and IOP's caches and atomically update their state for the memory block. The Arb bus 17 acts as a serialization point for all memory references. The order in which memory request appear on the Arb bus is the order in which processors
perceive the results of the requests.
The processor modules 12a-12d, memory modules 13a-13d and IOP module 14 are coupled together via a local, 9 port switch 15. Each of the interfacing modules 12a-12d, 13a-13d and 14 are connected to the local switch by means of a like number of
bi-directional, clock forwarded data links 16a-16i. In one embodiment, each of the data links forwards 64 bits of data and 8 bits of error correcting code (ECC) one each edge of a system clock operating at a rate of 150 MHZ. Thus, the data bandwidth of
each of the data links 16a-16i is 2.4 Gigabytes/sec.
Local switch 15 includes an Quad Switch Address control chip (QSA chip) 18 and a Quad Switch data slice chip (QSD chip) 19. QSA chip 18 includes an arbiter (QS Arb) 11 for controlling address paths between the processor modules, IOP, and memory. In addition, QSA chip 18 provides control to the QSD chip 19 to control the flow of data through the local switch 15 as described below.
QSD chip 19 provides a switch interconnect for all data paths between the processor modules, memory modules and IOP. Although not shown in FIG. 2, as will be described below, if the multi-processor node 10 were coupled to other multi-processor
nodes via a global port, the QSD and OSA would additionally provide a switch interconnect for the global port. Each of the processors may request data from one of the available resources, such as the memory devices 13a-13d, other processors 12a-12d, IOP
14 or alternatively resources in other multi-processor nodes via the global port. Thus, the local switch 15 should be able to accommodate simultaneous input from a variety of resources while maintaining the high bus bandwidth of 2.4 GBytes.
The local switch is able to handle mulitple concurrent transactions. Since each transaction typically uses multiple resources (such as memory banks, datapaths, queues), the control functions of the local switch can be very complex. For
instance, a transaction may require a memory bank to be available in stage 0 of the transaction, the datapath from memory bank to processor port be available in stage 1, and the datapath from processor port to processor be available in stage 2. The
local switch arbiter (QSA Arb 11 in the QSA 18) arbitrates among requests in such a manner that once a transaction is initiated, resources required by a transaction in each stage are available as required.
More significantly, the arbiter guarantees that all requests and processors get fair access to the resources by ensuring that particular requests do not fail to win arbitration for a long time (potentially indefinitely) while others make
progress. For instance, consider a transaction T that requires three resources A, B, and C. Transaction T may not win arbitration until all three resources are guaranteed to be available in the appropriate stages of the transaction. If the artiber
bases its decision only on the availability of resources, then it is possible that T may not succeed for a long time while other transactions which consume one of A, B, or C (along with other resources D, E, etc). continue to win arbitration.
Guaranteeing fair arbitration in a switch with a large number of concurrent requests, each using multiple resources to complete, is computationally complex and likely to increase delays in the high speed datapath. In the apparatus disclosed
herein, the QSA arb 11 arbitrates for only one resource (the memory bank) before scheduling a particular transaction. A second resource, which is a queue leading up to the processors, does not need to be checked for availability at the time of
arbitration by the QSA arb 11 for the first resource. This is because the architecture of the QSD guarantees that datapaths and queue slots leading up to the queue are always available. The fair arbitration for resources may be provided without much
complexity in the QSA arb 11.
According to one embodiment of the invention, the QSD is able to simultaneously receive input from all of the sources (processors, memory, IOP and global port) without requiring any upfront arbitration for the buffers leading up to corresponding
destinations. All sources of data may then independently forward data to the switch without having to arbitrate for access to the datapath or queue slots in the switch because the QSD includes a number of simultaneous insertion buffers capable of
receiving, substantially simultaneously, data from all of the sources. Two embodiments of simultaneous insertion buffers are described below.
SIMULTANEOUS INSERTION BUFFER SWITCH
As described above, the processor 12a-12d, IOP 14 and memory devices 13a-13d in the multi-processing node each serve as resources for handling requests from the processors and IOP in the multi-processing node. Data is transferred between each of
the resource elements and the requesting elements in the form of packets. Each packet comprises 512 bits of data and 64 bits of ECC. As described above, each of the data links carries 64 bits of data and 8 bits of ECC on each edge of a 150 MHZ clock.
Thus, external to the QSD there are 8 data transfer cycles per packet. Internal to the QSD, however, data is gathered only on one edge of the clock. Thus, for each clocking cycle of logic internal to the QSD, there are potentially 128 bits of data
received from the data links. Since each packet comprises 512 bits of data and 64 bits of ECC, internal to the QSD there are 4 data transfer cycles for each packet, with 128 bits of data and 16 bits of ECC being transferred from a processor, IOP or
memory device to the QSD each QSD clocking cycle.
Referring now to FIG. 3, the QSD 19 is shown in more detail to include five Simultaneous Insertion Buffers (SIBs) 25a-25e. Each SIB is dedicated to one of the requester elements, i.e., processors 12a-12d or the IOP. Each SIB controls the data
path for transfer of packets between its associated requestor element and the other resource elements in the node; i.e., processors 12a-12d, memory 13a-13d, IOP 14 and advantageously a global port. The global port acts as an interconnect to other
multi-processor nodes and is described in detail below. The SIBs allow for the simultaneous receipt of packets by the requester from any of the resources coupled to the switch without requiring arbitration between the requesters for access to the
switch.
As described previously, the QSA Arb 11 is coupled to provide control to the switch 19. Included in QSA Arb 11 is a main arbiter 27. The main arbiter 27 manages the data movement between the resources (the IOP, processors 12a-12d and memory
13a-13d) and the switch 19. Each of the processors 12a-12d and IOP 14 issues requests for access to one of the resources on lines 28a-28e that are forwarded to the main arbiter 27. The main arbiter in turn forwards these requests to the associated
resources when each resource is able to receive a request. Once the resource has received the request, no arbitration for the switch 19 is required because each of the SIBs are capable of receiving input from all of the inputs substantially
simultaneously, i.e., within the same data cycle.
Also included in the QSA Arb 11 is a number of individual arbiters 23a-23d. Each of the arbiters 23a-23d is used to manage a datapath between an associated one of the processors 12a-12d and their corresponding SIB 25b-25e, respectively. A
similar arbitrer (not shown) is included in the IOP 14 for managing the datapath between IOP 14 and SIB 25a. As each processor is able to receive data from their associated SIB, the associated arbiter forwards the data on the coupled datapath.
Accordingly, by using simultaneous insertion buffers within the switch 19, the arbitration pathway between a requestor and a resource may be broken up into two distinct sections; a first arbitration section where the main arbiter 27 arbitrates
for a resource in response to a request from a processor independent of the availability of the requesting processor to receive data from the coupled resource, and a second arbitration section where the arbiter associated with the processor arbitrates
for access to the processor for forwarding data from the switch. With such an arrangement, because the arbitration is segregated it can be ensured that fair access to each of the coupled resources is provided.
Referring now to FIG. 4A, a more detailed diagram of one embodiment of the SIB 25a is shown to include an input arbiter 36 coupled to provide mux select signals <31:0> on line 36a to eight coupled multiplexers 34a-34h, where four of the mux
select signals are forwarded to each of the eight multiplexers to select one of nine inputs at each multiplexer. All of the SIBs 25a-25d are similarly architected, and thus only one is described in detail. As described above, there are potentially ten
resources coupled to the SIB. One of the ten resources is a requestor device that receives output from the SIB, while the other nine resources provide input to the SIB. Therefore, each of the multiplexers 34a-34h receives input from nine resources
coupled to the SIB. The inputs from three of the coupled processors are received on lines Px, Py, and Pz. Another input, from either the fourth processor (when the SIB is associated with the IOP device) or from the IOP device (when the SIB is
associated with one of the processors) is received on line PW/IOP. The inputs from memory banks 13a-13d are received on lines mem0, mem1, mem2 and mem3, respectively, and input from the global port is received on line global port.
Each output from each of the multiplexers 34a-34h is coupled to one of eight banks of a buffer 32. Each bank has eight entries, with each entry storing 128 bits of data and 16 bits of ECC. Thus, each packet of data that is received by the SIB
is written to four different banks in the same row of the buffer 32. As described below, the input arbiter 36 maintains status bits for indicating the banks of the buffer that are available for storing data. Thus, each cycle that 128 bits of packet
data are received from one or more resources, the input arbiter 36 selects one of the possible nine resource inputs at each multiplexer 34a-34h for forwarding the cycle of packet data to the associated bank 32a-32h depending upon the availability status
of the banks. The input arbiter also provides bypass data on line 36b to a multiplexer 30. When the status bits in the input arbiter indicate that all of the banks 32a-32h are empty, one of the nine resource inputs may be bypassed directly to the
associated requester via the input arbiter 36.
Each of the banks 32a-32h are coupled to multiplexer 30. Multiplexer 30 is controlled by an output arbiter 38. When the requestor associated with the SIB 25a is ready to receive data from the SIB, and a portion of a packet has been written into
an entry in the SIB, the output arbiter forwards one of the eight entries from the banks 32a-32h to the requester. Alternatively, the output arbiter forwards the bypass data on line 36b to the requestor if none of the banks have data pending transfer
and data is available on line 36b from the input arbiter.
During operation, when the first 128 bits of packet data are received at the SIB, one of the eight banks is selected for storing the first 128 bits of packet data. According to one embodiment of the invention, during each of the next three
cycles that 128 bits of packet data are received, the bank adjacent to the bank that was used to perform the previous write is selected for writing the next 128 of packet data. For example, if bank 32a were selected as an available bank for writing a
first cycle of packet data from source mem0, the second cycle of packet data would be written to bank 32b, the third to bank 32c, and the fourth to bank 32d. The selection of which bank to use for writing the subsequent cycles of packet data is thus
performed on a rotating basis, starting at a bank selected by the input arbiter and continuing at an adjacent bank for each successive packet write. As a result, the received packet is spread across four banks in a common row of the buffer 32.
Because eight banks are provided, and because, in one embodiment of the invention, the maximum number of resource reads that may be outstanding at any one requestor is eight, it can be ensured that at least one bank will be available to every
resource for every write cycle. Therefore, if, at a given instant in time, all eight outstanding read responses were received by the switch, banks 32a-32h could each be used to accommodate the first packet data cycle of the write, with the selection of
banks rotating for the next three write cycles.
In one embodiment of the invention, each buffer in a SIB operates under the First-In, First-Out (FIFO) protocol. Because two portions of packets may be received simultaneously, an order is selected for them to be `read` into the switch. Since
logic in the requestor that arbitrates for the resource does not communicate with the SIB and does not communicate with other requestors for arbitrating for the resource, a standard rule is followed to ensure data integrity. For example, a rule such as
`data from a lower number input resource is always written to the switch before data from a higher number input resource` may be followed, where the resources are assigned a fixed priority number.
As mentioned above, in the embodiment of the SIB shown in FIG. 4A, the use of eight banks has been described because eight corresponds to the number of outstanding memory requests that a requestor can have at any given instant of time. If,
however, the design constraints require that fewer banks be provided, the design could easily be modified by one of skill in the art to allow for multiple chunks of data to be written to different locations in a common bank simultaneously using
interleaving or a similar technique. Therefore, the present invention is not limited to the particular embodiment illustrated in FIG. 4A.
As described above, during operation the input arbiter maintains status information regarding the availability of entries in the bank to select an appropriate bank for writing data from a resource. An example embodiment of an input arbiter 36
for controlling the inputs to the SIB is shown in FIG. 4B. In FIG. 4B, although nine input resources were described above, for clarity purposes, logic for controlling the writing of only two resource inputs is shown. When input packet data is received
on lines 35, an indication signal, such as `input1`, is forwarded to a latch chain 40, which comprises 4 latches, flip flops, or similar state devices. The | | |