|
Description  |
|
|
FIELD OF THE INVENTION
The invention relates to multiprocessor systems and, more particularly, to the generation of commit-signals in a multiprocessing system.
BACKGROUND OF THE INVENTION
Multiprocessing systems, such as symmetric multi-processors, provide a computer environment wherein software applications may operate on a plurality of processors using a single address space or shared memory abstraction. In a shared memory
system, each processor can access any data item without a programmer having to worry about where the data is or how to obtain its value; this frees the programmer to focus on program development, e.g., algorithms, rather than managing partitioned data
sets and communicating values. Interprocessor synchronization is typically accomplished in a shared memory system between processors performing read and write operations to "synchronization variables" either before and after accesses to "data
variables".
For instance, consider the case of a processor P1 updating a data structure and processor P2 reading the updated structure after synchronization. Typically, this is accomplished by P1 updating data values and subsequently setting a semaphore or
flag variable to indicate to P2 that the data values have been updated. P2 checks the value of the flag variable and, if set, subsequently issues read operations (requests) to retrieve the new data values. Note the significance of the term
"subsequently" used above; if P1 sets the flag before it completes the data updates or if P2 retrieves the data before it checks the value of the flag, synchronization is not achieved. The key is that each processor must individually impose an order on
its memory references for such synchronization techniques to work. The order described above is referred to as a processor's inter-reference order. Commonly used synchronization techniques require that each processor be capable of imposing an
inter-reference order on its issued memory reference operations.
P1: P2: Store Data, New-Value L1: Load Flag Store Flag, 0 BNZ L1
The inter-reference order imposed by a processor is defined by its memory reference ordering model or, more commonly, its consistency model. The consistency model for a processor architecture specifies, in part, a means by which the
inter-reference order is specified. Typically, the means is realized by inserting a special memory reference ordering instruction, such as a Memory Barrier (MB) or "fence", between sets of memory reference instructions. Alternatively, the means may be
implicit in other opcodes, such as in "test-and-set". In addition, the model specifies the precise semantics (meaning) of the means. Two commonly used consistency models include sequential consistency and weak-ordering, although those skilled in the
art will recognize that there are other models that may be employed, such as release consistency.
Sequential Consistency
In a sequentially consistent system, the order in which memory reference operations appear in an execution path of the program (herein referred to as the "I-stream order") is the inter-reference order. Additional instructions are not required to
denote the order simply because each load or store instruction is considered ordered before its succeeding operation in the I-stream order.
Consider the program example below. The program performs as expected on a sequentially consistent system because the system imposes the necessary inter-reference order. That is, P1's first store instruction is ordered before P1's store-to-flag
instruction. Similarly, P2's load flag instruction is ordered before P2's load data instruction. Thus, if the system imposes the correct inter-reference ordering and P2 retrieves the value 0 for the flag, P2 will also retrieve the new value for data.
Weak Ordering
In a weakly-ordered system, an order is imposed between selected sets of memory reference operations, while other operations are considered unordered. One or more MB instructions are used to indicate the required order. In the case of an MB
instruction defined by the Alpha.RTM. 21264 processor instruction set, the MB denotes that all memory reference instructions above the MB (i.e., pre-MB instructions) are ordered before all reference instructions after the MB (i.e., post-MB
instructions). However, no order is required between reference instructions that are not separated by an MB.
P1 P2 Store Data1, New-value1 L1: Load Flag Store Data2, New-value2 MB BNZ L1 Store Flag, 0 MB Load Data1 Load Data2
In above example, the MB instruction implies that each of P1's two pre-MB store instructions are ordered before P1's store-to-flag instruction. However, there is no logical order required between the two pre-MB store instructions. Similarly,
P2's two post-MB load instructions are ordered after the Load flag; however, there is no order required between the two post-MB loads. It can thus be appreciated that weak ordering reduces the constraints on logical ordering of memory references,
thereby allowing a processor to gain higher performance by potentially executing the unordered sets concurrently.
The prior art includes other types of barriers as described in literature and as implemented on commercial processors. For example, a write-MB (WMB) instruction on an Alpha microprocessor requires only that pre-WMB store instructions be
logically ordered before any post-WMB stores. In other words, the WMB instruction places no constraints at all on load instructions occurring before or after the WMB.
In order to increase performance, modern processors do not execute memory reference instructions one at a time. It is desirable that a processor keep a large number of memory references outstanding and issue, as well as complete, memory
reference operations out-of-order. This is accomplished by viewing the consistency model as a "logical order", i.e., the order in which memory reference operations appear to happen, rather than the order in which those references are issued or
completed. More precisely, a consistency model defines only a logical order on memory references; it allows for a variety of optimizations in implementation. It is thus desired to increase performance by reducing latency and allowing (on average) a
large number of outstanding references, while preserving the logical order implied by the consistency model.
In prior systems, a memory barrier instruction is typically passed upon "completion" of an operation. For example, when a source processor issues a read operation, the operation is considered complete when data is received at the source
processor. When executing a store instruction, the source processor issues a memory reference operation to acquire exclusive ownership of the data; in response to the issued operation, system control logic generates "probes" to invalidate old copies of
the data at other processors and to request forwarding of the data from the owner processor to the source processor. Here the operation completes only when all probes reach their destination processors and the data is received at the source processor.
Broadly stated, these prior systems rely on completion to impose inter-reference ordering. For instance, in a weakly-ordered system employing MB instructions, all pre-MB operations must be complete before the MB is passed and post-MB operations
may be considered. Essentially, "completion" of an operation requires actual completion of all activity, including receipt of data, corresponding to the operation. Such an arrangement is inefficient and, in the context of inter-reference ordering,
adversely affects latency.
Therefore, the present invention is directed to increasing the efficiency of a shared memory multiprocessor system by relaxing the completion requirement while preserving the consistency model. The invention is further directed to improving the
performance of a shared memory system by reducing the latency associated with memory barriers.
SUMMARY OF THE INVENTION
The invention relates to a mechanism for reducing the latency of inter-reference ordering between sets of memory reference operations in a distributed shared memory multiprocessor system including a plurality of symmetric multiprocessor (SMP)
nodes interconnected by a hierarchical switch. The mechanism generally comprises a structure for optimizing the generation of a commit-signal by control logic of the multiprocessor system in response to a local memory reference operation issued by a
processor of a SMP node. The mechanism generally enables generation of commit-signals for local commands by control logic at the local node, rather than at the hierarchical switch. By reducing the latency of commit-signals for local operations, the
latency of inter-reference ordering is reduced, thereby enhancing the performance of the system.
Typically commit-signals for all memory reference operations are generated by the is control logic after ordering at an ordering point which, for the distributed shared memory system, is associated with the hierarchical switch. Since the latency
of commit-signals after ordering at the hierarchical switch is substantial, it is desirable that ordering be accomplished at the local node. In accordance with the invention, the optimizing structure facilitates ordering at the local node.
In the illustrative embodiment, the novel structure is a LoopComSig table that indicates whether the memory reference operation issued by the processor affects any non-local processor of the system. The memory reference operation affects other
processors if the operation has an invalidate component (i.e., a probe) generated by the ordering point of the local switch and sent over the hierarchical switch to invalidate copies of the data in those processors' caches. If the operation has an
invalidate component, an entry is created in the table for the memory reference operation. When the invalidate component is totally ordered at the hierarchical switch, an invalidate acknowledgment (Inval-Ack) is returned from the hierarchical switch to
the local ordering point. The Inval-Ack is used to remove the entry from the LoopComSig table.
According to the invention, generation of the commit-signal for a local memory reference operation is optimized when there is no entry in the LoopComSig table for the same address as the memory reference operation. For example, when a memory
reference operation to local address x is issued from the processor to the SMP node control logic, the table is examined using the address x. If there is no entry (i.e., no match on the address x), the SMP node control logic generates a commit-signal and
immediately sends it back to the processor. This is an example of a type 0 commit-signal which corresponds to a local command that is characterized by no external probes generated for address x and no external probes outstanding for address x.
If, on the other hand, the SMP node control logic finds an entry in the table for address x, a "loop" commit-signal is forwarded out to the hierarchical switch for total ordering. The commit-signal is totally ordered and then returned back to
node. This is an is example of a type 3 commit-signal which corresponds to a local command which is characterized by having no external probes generated to address x but having outstanding probes to that address.
BRIEF DESCRIPTION OF THE
DRAWINGS
The above and further advantages of the invention may be better understood by referring to the following description in conjunction with the accompanying drawings in which like reference numbers indicate identical or functionally similar
elements:
FIG. 1 is a schematic block diagram of a first multiprocessor node embodiment comprising a plurality of processors coupled to an input/output processor and a memory by a local switch;
FIG. 2 is a schematic block diagram of the local switch comprising a plurality of ports coupled to the respective processors of FIG. 1;
FIG. 3 is a schematic diagram of an embodiment of a commit-signal implemented as a commit-signal packet;
FIG. 4 is a schematic block diagram of a second multiprocessing system embodiment comprising a plurality of multiprocessor nodes interconnected by a hierarchical switch;
FIG. 5 is a schematic block diagram of the hierarchical switch of FIG. 4;
FIG. 6 is a schematic block diagram of an augmented multiprocessor node comprising a plurality of processors interconnected with a shared memory, an IOP and a global port interface via a local switch;
FIG. 7 illustrates an embodiment of a LoopComSig table in accordance with the present invention;
FIG. 8 is a schematic diagram of an incoming command packet modified with a multicast-vector;
FIG. 9 is a schematic block diagram illustrating a total ordering property of an illustrative embodiment of the hierarchical switch;
FIG. 10 is a flowchart illustrating the sequence of steps for generating and issuing a type 0 commit-signal according to the present invention;
FIG. 11 is a flowchart illustrating the sequence of steps for generating and issuing a type 1 commit-signal according to the present invention;
FIG. 12 is a flowchart illustrating the sequence of steps for generating and issuing a type 2 commit-signal according to the present invention;
FIG. 13 is a flowchart illustrating the sequence of steps for generating and issuing a type 3 commit-signal according to the invention; and
FIG. 14 is a schematic top-level diagram of an alternate embodiment of the second multiprocessing system which may be advantageously used with the present invention.
DETAILED DESCRIPTION OF THE ILLUSTRATIVE EMBODIMENTS
As described herein, a symmetric multi-processing (SMP) system includes a number of SMP nodes interconnected via a high performance switch. Each SMP node thus functions as a building block in the SMP system. Below, the structure and operation
of an SMP node embodiment that may be advantageously used with the present invention is first described, followed by a description of the SMP system embodiment.
SMP Node:
FIG. 1 is a schematic block diagram of a first multiprocessing system embodiment, such as a small SMP node 100, comprising a plurality of processors (P) 102-108 coupled to an input/output (I/O) processor 130 and a memory 150 by a local switch
200. The memory 150 is preferably organized as a single address space that is shared by the processors and apportioned into a number of blocks, each of which may include, e.g., 64 bytes of data. The I/O processor, or IOP 130, controls the transfer of
data between external devices (not shown) and the system via an I/O bus 140. Data is transferred between the components of the SMP node in the form of packets. As used herein, the term "system" refers to all components of the SMP node excluding the
processors and IOP. In an embodiment of the invention, the I/O bus may operate according to the conventional Peripheral Computer Interconnect (PCI) protocol.
Each processor is a modern processor comprising a central processing unit (CPU), denoted 112-118, that preferably incorporates a traditional reduced instruction set computer (RISC) load/store architecture. In the illustrative embodiment
described herein, the CPUs are Alpha.RTM. 21264 processor chips manufactured by Digital Equipment Corporation (e, although other types of processor chips may be advantageously used. The load/store instructions executed by the processors are issued to
the system as memory reference, e.g., read and write, operations. Each operation may comprise a series of commands (or command packets) that are exchanged between the processors and the system. As described further herein, characteristics of modern
processors include the ability to issue memory reference operations out-of-order, to have more than one memory reference outstanding at a time and to accommodate completion of the memory reference operations in arbitrary order.
In addition, each processor and IOP employs a private cache (denoted 122-128 and 132, respectively) for storing data determined likely to be accessed in the future. The caches are preferably organized as write-back caches apportioned into, e.g.,
64-byte cache lines accessible by the processors; it should be noted, however, that other cache organizations, such as write-through caches, may be used in connection with the principles of the invention. It should be further noted that memory reference
operations issued by the processors are preferably directed to a 64-byte cache line granularity. Since the IOP 130 and processors 102-108 may update data in their private caches without updating shared memory 150, a cache coherence protocol is utilized
to maintain consistency among the caches.
The cache coherence protocol of the illustrative embodiment is preferably a conventional write-invalidate, ownership-based protocol. "Write-Invalidate" implies that when a processor modifies a cache line, it invalidates stale copies in other
processors' caches rather than updating them with the new value. The protocol is termed an "ownership protocol" because there is always an identifiable owner for a cache line, whether it is shared memory, one of the processors or the IOP entities of the
system. The owner of the cache line is responsible for supplying the up-to-date value of the cache line when requested. A processor/IOP may own a cache line in one of two states: "exclusively" or "shared". If a processor has exclusive ownership of a
cache line, it may update it without informing the system. Otherwise, it must inform the system and potentially invalidate copies in the other caches.
A shared data structure 160 is provided for capturing and maintaining status information corresponding to the states of data used by the system. In the illustrative embodiment, the shared data structure is configured as a conventional duplicate
tag store (DTAG) 160 that cooperates with the individual caches of the system to define the coherence protocol states of the data in the system. The protocol states of the DTAG 160 are administered by a coherence controller 180, which may be implemented
as a plurality of hardware registers and combinational logic configured to produce a sequential logic circuit, such as a state machine. It should be noted, however, that other configurations of the controller and shared data structure may be
advantageously used herein.
The DTAG 160, coherence controller 180, IOP 130 and shared memory 150 are interconnected by a logical bus referred to an Arb bus 170. Memory reference operations issued by the processors are routed via the local switch 200 to the Arb bus 170.
The order in which the actual memory reference commands appear on the Arb bus is the order in which processors perceive the results of those commands. In accordance with this embodiment of the invention, though, the Arb bus 170 and the coherence
controller 180 cooperate to provide an ordering point, as described herein.
The commands described herein are defined by the Alpha.RTM. memory system interface and may be classified into three types: requests, probes, and responses. Requests are commands that are issued by a processor when, as a result of executing a
load or store instruction, it must obtain a copy of data. Requests are also used to gain exclusive ownership to a data item (cache line) from the system. Requests include Read (Rd) commands, Read/Modify (RdMod) commands, Change-to-Dirty (CTD) commands,
Victim commands, and Evict commands, the latter of which specify removal of a cache line from a respective cache.
Probes are commands issued by the system to one or more processors requesting data and/or cache tag status updates. Probes include Forwarded Read (Frd) commands, Forwarded Read Modify (FRdMod) commands and Invalidate (Inval) commands. When a
processor P issues a request to the system, the system may issue one or more probes (via probe packets) to other processors. For example if P requests a copy of a cache line (a Rd request), the system sends a probe to the owner processor (if any). If P
requests exclusive ownership of a cache line (a CTD request), the system sends Inval probes to one or more processors having copies of the cache line. If P requests both a copy of the cache line as well as exclusive ownership of the cache line (a RdMod
request) the system sends a FRd probe to a processor currently storing a dirty copy of a cache line of data. In response to the Frd probe, the dirty copy of the cache line is returned to the system. A FRdMod probe is also issued by the system to a
processor storing a dirty copy of a cache line. In response to the FRdMod probe, the dirty cache line is returned to the system and the dirty copy stored in the cache is invalidated. An Inval probe may be issued by the system to a processor storing a
copy of the cache line in its cache when the cache line is to be updated by another processor.
Responses are commands from the system to processors/IOPs which carry the data requested by the processor or an acknowledgment corresponding to a request. For Rd and RdMod requests, the response is a Fill and FillMod response, respectively, each
of which carries the requested data. For a CTD request, the response is a CTD-Success (Ack) or CTD-Failure (Nack) response, indicating success or failure of the CTD, whereas for a Victim request, the response is a Victim-Release response.
FIG. 2 is a schematic block diagram of the local switch 200 comprising a plurality of ports 202-210, each of which is coupled to a respective processor (P1-P4) 102-108 and IOP 130 via a full-duplex, bi-directional clock forwarded data link. Each
port includes a respective input queue 212-220 for receiving, e.g., a memory reference request issued by its processor and a respective output queue 222-230 for receiving, e.g., a memory reference probe issued by system control logic associated with the
switch. An arbiter 240 arbitrates among the input queues to grant access to the Arb bus 170 where the requests are ordered into a memory reference request stream. In the illustrative embodiment, the arbiter selects the requests stored in the input
queues for access to the bus in accordance with an arbitration policy, such as a conventional round-robin algorithm.
The following example illustrates the typical operation of multiprocessing system including switch 200. A Rd request for data item x is received at the switch 200 from P1 and loaded into input queue 212. The arbiter 240 selects the request in
accordance with the arbitration algorithm. Upon gaining access to the Arb bus 170, the selected request is routed to the ordering point 250 wherein the states of the corresponding cache lines are interrogated in the DTAG 160. Specifically, the
coherence controller 180 examines the DTAG to determine which entity of the system "owns" the cache line and which entities have copies of the line. If processor P3 is the owner of the cache line x and P4 has a copy, the coherence controller generates
the necessary probes (e.g., a Fill x and Inval x) and forwards them to the output queues 226 and 228 for transmission to the processors.
Because of operational latencies through the switch and data paths of the system, memory reference requests issued by P1 may complete out-of-order. In some cases, out-of-order completion may affect the consistency of data in the system,
particularly for updates to a cache line. Memory consistency models provide formal specifications of how such updates become visible to the entities of the multiprocessor system. In the illustrative embodiment of the present invention, a weak ordering
consistency model is described, although it will be apparent to those skilled in the art that other consistency models may be used.
In a weakly-ordered system, inter-reference ordering is typically imposed by a memory barrier (MB) instruction inserted between memory reference instructions of a program executed by a processor. The MB instruction separates and groups those
instructions of a program that need ordering from the rest of the instructions. The semantics of weak ordering mandate that all pre-MB memory reference operations are logically ordered before all post-MB references. For example, the following program
instructions are executed by P1 and P2:
P1 P2 St x Ld flag, 0 St y MB St z Rd x MB Rd y St flag, 0 Rd z
In the case of P1's program, it is desired to store (via a write operation) all of the data items x, y and z before modifying the value of the flag; the programmer indicates this intention by placing the MB instruction after St z. According to
the weak-ordering semantics, the programmer doesn't care about the order in which the pre-MB store instructions issue as memory reference operations, nor does she care about the order in which the post-MB references appear to the system. Essentially,
the programmer only cares that every pre-MB store instruction appears before every post-MB instruction. At P2, a load (via a read operation) flag is performed to test for the value 0. Testing of the flag is ordered with respect to acquiring the data
items x, y and z as indicated by the MB instruction. Again, it is not necessary to impose order on the individual post-MB instructions
To ensure correct implementation of the consistency model, prior systems inhibit program execution past the MB instruction until actual completion of all pre-MB operations have been confirmed to the processor. Maintaining inter-reference order
from all pre-MB operations to all post-MB operations typically requires acknowledgment responses and/or return data to signal completion of the pre-MB operations. The acknowledgment responses may be gathered and sent to the processor issuing the
operations. The pre-MB operations are considered completed only after all responses and data are received by the requesting processor. Thus, referring to the example above with respect to operation of a prior multiprocessing system, once P1 has
received the data and acknowledgment responses (e.g., an Inval acknowledgment) corresponding to an operation, the operation is considered complete.
Since each memory reference operation may consist of a number of commands, the latency of inter-reference ordering is a function of the extent to which each command must complete before the reference is considered ordered. The present invention
relates to a mechanism for reducing the latency of inter-reference ordering between sets of memory reference operations in a multiprocessor system having a shared memory that is distributed among a plurality of processors configured to issue and complete
those operations out-of-order.
The mechanism generally comprises a commit-signal that is generated by the ordering point 250 of the multiprocessor system in response to a memory reference operation issued by a requesting processor for particular data. FIG. 3 is a schematic
diagram of a commit-signal that is preferably implemented as a commit-signal packet structure 300 characterized by the assertion of a single, commit-signal ("C") bit 310 to processor. It will be apparent to those skilled in the art that the commit
signal may be manifested in a variety of forms, including a discrete signal on a wire, and in another embodiment, a packet identifying the operation corresponding to the commit signal. Program execution may proceed past the MB instruction once
commit-signals for all pre-MB operations have been received by the processor, thereby increasing the performance of the system. The commit-signal facilitates inter-reference ordering by indicating the apparent completion of the memory reference
operation to the entities of the system.
Referring again to the above example including the program instructions executed by P1, generation of a commit-signal by the ordering point 250 in response to each RdMod request for data items x, y and z (corresponding to each store instruction
for those data items) issued by P1 occurs upon successful arbitration and access to the Arb bus 170, and total ordering of those requests with respect to all memory reference requests appearing on the bus. Total ordering of each memory reference request
constitutes a commit-event for the requested operation. The commit-signal 300 is preferably transmitted to P1 upon the occurrence of, or after, the commit-event.
The ordering point 250 determines the state of the data items throughout the system and generates probes (i.e., probe packets) to invalidate copies of the data and to request forwarding of the data from the owner to the requesting processor P1.
For example, the ordering point may generate FRdMod probe to P3 (i.e., the owner) and Inval probes to P2 and P4. The ordering point also generates the commit-signal at this time for transmission to the P1. The commit-signal and probe packets are loaded
into the output queues and forwarded to the respective processors in single, first-in, first-out (FIFO) order; in the case of P1, the commit-signal is loaded into queue 222 and forwarded to P1 along with any other probes pending in the queue. As an
optimization, the commit-signal 300 may be "piggy backed" on top of one of these probe packets; in the illustrative embodiment of such an optimization, the C-bit of a probe packet may be asserted to indicate that a commit-signal is being sent.
As a further example using an illustrative processor algorithm, issuance of each memory reference operation by P1 increments a counter 270 and receipt by P1 of each commit-signal responsive to the issued memory reference decrements the counter.
When program execution reaches the MB instruction and the counter realizes a value of zero, the previously issued operations are considered committed and execution of the program may proceed past the MB. Those probes originating before the commit-signal
are ordered by the ordering point 250 before the commit-signal and, thus, are dealt with before the commit signal is received by the processor.
In the prior art, the Inval probes forwarded to P2 and P4 return Inval acknowledgments that travel over control paths to the switch where they are loaded into P1 's output queue 222. The FillMod data provided by P3 in response to the FRdMod
probe is also forwarded to the switch 200 and enqueued in queue 222 for delivery to P1. If delivery of the acknowledgments or FillMod data to P1 are delayed because of operational latencies through the paths and queues of the system, P1 must still wait
for these responses until proceeding with further program instruction execution, thus decreasing the performance of the prior art system.
However since the novel commit-signal is sent to P1 in parallel with the probes sent to P2-P4, P1 may receive the commit-signal before receiving any acknowledgments or data. According to the invention, P1 does not have to wait to receive the
data or acknowledgments before proceeding past the MB instruction; it only has to wait on the commit-signal for the RdMod request. That is, P1 has committed once it receives the appropriate commit-signal and it may proceed past the MB to commence
executing the next program instruction. This feature of the invention provides a substantial performance enhancement for the system.
SMP System:
FIG. 4 is a schematic block diagram of a second multiprocessing system embodiment, such as a large SMP system 400, comprising a plurality of SMP nodes 602-616 interconnected by a hierarchical switch 500. Each of the nodes is coupled to the
hierarchical switch by a respective full-duplex, bi-directional, clock forwarded hierarchical switch (HS) link 622-636. Data is transferred between the nodes in the form of packets. In order to couple to the hierarchical switch, each SMP node is
augmented to include a global port interface. Also, in order to provide a distributed shared memory environment, each node is configured with an address space and a directory for that address space. The address space is generally partitioned into
memory space and 10 space. The processors and IOP of each node utilize private caches to store data strictly for memory-space addresses; 10 space data is not cached in private caches. Thus, the cache coherency protocol employed in system 400 is
concerned solely with memory space commands.
As used herein with the large SMP system embodiment, all commands originate from either a processor or an IOP, where the issuing processor or IOP is referred to as the "source processor." The address contained in a request command is referred to
as the "requested address." The "home node" of the address is the node whose address space maps to the requested address. The request is termed "local" if the source processor is on the home node of the requested address; otherwise, the request is
termed a "global" request. The Arb bus at the home node is termed the "home Arb bus". The "home directory" is the directory corresponding to the requested address. The home directory and memory are thus coupled to the home Arb bus for the requested
address.
A memory reference operation (request) emanating from a processor or IOP is first routed to the home Arb bus. The request is routed via the local switch if the request is local; otherwise, it is considered a global request and is routed over the
hierarchical switch. In this latter case, the request traverses the local switch and the GP link to the global port, passes over the HS link to the hierarchical switch, and is then forwarded over the GP link and local switch of the home node to the home
Arb bus.
FIG. 5 is a schematic block diagram of the hierarchical switch 500 comprising a plurality of input ports 502-516 and a plurality of output ports 542-556. The input ports 502-516 receive command packets from the global ports of the nodes coupled
to the switch, while the output ports 542-556 forward packets to those global ports. In the illustrative embodiment of the hierarchical switch 500, associated with each input port is an input (queue) buffer 522-536 for temporarily storing the received
commands. Although the drawing illustrates one buffer for each input port, buffers may be alternatively shared among any number of input ports. An example of a hierarchical switch (including the logic associated with the ports) that is suitable for use
in the illustrative embodiment of the invention is described in copending and commonly-assigned U.S. patent application Ser. No. 08/957,298, filed Oct. 24, 1997 and titled, Order Supporting Mechanism For Use In A Switch-Based Multi-Processor System,
which application is hereby incorporated by reference as though fully set forth herein.
In the large SMP system, the ordering point is associated with the hierarchical switch 500. According to the present invention, the hierarchical switch 500 is configured to support novel ordering properties in order that commit signals may be
gainfully employed. The ordering properties are imposed by generally controlling the order of command packets passing through the switch. For example, command packets from any of the input buffers 522-536 may be forwarded in various specified orders to
any of the output ports 542-556 via multiplexer circuits 562-576.
As describe herein, the ordering properties apply to commands that contain probe components (Invals, FRds, and FrdMods). These commands are referred to as probe-type commands. One ordering property of the hierarchical switch is that it imposes
an order on incoming probe-type commands. That is, it enqueues them into a logical FIFO queue based on time of arrival. For packets that arrive concurrently (in the same clock), it picks an arbitrary order and places them in the FIFO queue. A second
ordering property of the switch is its ability to "atomically" multicast all probe-type packets. All probe-type packets are multicast to target nodes as well as to the home node and the source node. In this context, "atomic multicast" means that for
any pair of probe-type commands A and B, either all components of A appear before all components of B or vice versa. Together, these two properties result in a total ordering of all probe-type packets. The total ordering is accomplished using the input
buffers in conjunction with control logic and multiplexers.
FIG. 6 is a schematic block diagram of an augmented SMP node 600 comprising a plurality of processors (P) 102-108 interconnected with a shared memory 150, an IOP 130 and a global port interface 610 via a local switch 625. The processor, shared
memory and IOP entities are similar to the those entities of FIG. 1. The local switch 625 is augmented (with respect to switch 200) to include an additional port coupling the interface 610 by way of a full-duplex, clock forwarded global port (GP) data
link 612. In addition to the DTAG 160, an additional shared data structure, or directory (DIR) 650, is coupled to Arb bus 170 to administer the distributed shared memory environment of the large system 400.
The global port interface 610 includes a loop commit-signal (LoopComSig) table 700 for monitoring outstanding probe-type commands from the SMP node. All probe-type commands are multicast by the hierarchical switch to all target nodes as well as
to the home node and the source node. The component sent to the source node servers as the commit signal whereas the one to the home node (when the home node is not the source node) servers as the probe-delivery-acknowledgment (probe-ack). In the
illustrative embodiment, the LoopComSig table 700 is implemented as a content addressable memory device, although other configurations and structures of the table may be used. Each time a probe-type command is sent to the global port, an entry is
created in the LoopComSig table; when a corresponding probe-ack returns to the node's Arb bus, the entry is cleared.
Thus, the LoopComSig table is used to determine if a probe-type command corresponding to a particular address x is outstanding from the node at any specific time. This information is used to optimize the generation of commit signals for local
commands as follows: In the case of a local command appearing on the home Arb bus, if the coherence controller determines that no probe-type commands need to be sent to other nodes and if there are no outstanding probe-type commands as indicated by the
LoopComSig table, then the commit-signal is sent directly to the source processor. In the embodiment that does not include the LoopComSig table, commit signals for local commands always originate at the hierarchical switch. Using the LoopComSig table,
the coherence controller is able to generate commit signals locally and hence reduce the latency of commit signals for a substantial fraction of loc | | |