|
Claims  |
|
|
What is claimed is:
1. Apparatus for reducing latency of inter-reference ordering between memory reference operations of a shared memory multiprocessor system having a consistency model, the
system having a plurality of processors, each processor capable of issuing a memory reference operation to the system, the apparatus comprising:
a commit-signal structure generated by control logic of the multiprocessor system in response to each issued memory reference operation, the commit-signal generated substantially sooner than completion of the memory reference operation, the
commit-signal indicating apparent completion of the memory reference operation rather than actual completion of the operation;
wherein each processor employs the commit signal to impose inter-reference ordering.
2. The apparatus of claim 1 wherein the consistency model is weak-ordering and wherein inter-reference ordering is imposed by a memory barrier (MB) instruction inserted between sets of the memory reference instructions of a program executed by
the issuing processor, the sets of memory reference instructions issued to the system as pre-MB memory reference operations and post-MB memory reference operations.
3. The apparatus of claim 2 wherein ordering of an issued memory reference operation constitutes a commit-event for the issued operation and wherein the commit-signal is transmitted to the issuing processor upon the occurrence of, or after, the
commit-event.
4. The apparatus of claim 3 wherein program execution is permitted to proceed past the MB instruction once commit signals for all pre-MB memory reference operations have been received by the issuing processor.
5. The apparatus of claim 4 further comprising a counter of the issuing processor, the processor (i) incrementing the counter upon the issuance of each pre-MB memory reference operation and (ii) decrementing the counter upon receipt of each
commit-signal responsive to an issued pre-MB memory reference operation, such that when execution of the program reaches the MB instruction and a value of the counter is zero, program execution is permitted to proceed past the MB instruction.
6. The apparatus of claim 1 wherein the consistency model is sequential consistency and wherein inter-reference ordering is generally imposed by requiring completion of a current memory reference operation prior to issuing a next memory
reference operation.
7. The apparatus of claim 6 wherein ordering of an issued memory reference operation constitutes a commit-event for the issued operation and wherein the commit-signal is transmitted to the issuing processor upon the occurrence of, or after, the
commit-event.
8. The apparatus of claim 7 wherein receipt of the transmitted commit-signal by the issuing processor constitutes commitment of the issued memory reference operation and wherein the next memory reference operation is permitted to issue once
commitment of the current memory reference operation has been received by the issuing processor.
9. Apparatus for reducing the latency of inter-reference ordering between memory reference operations issued by a processor to a symmetric multiprocessing (SMP) node having a shared memory that is distributed among a plurality of processors,
each processor having a private cache and a programming interface to the distributed shared memory, the apparatus comprising:
a local switch of the SMP node, the local switch interconnecting the processors and shared memory;
an ordering point coupled to the local switch and configured to totally order the memory reference operations issued to the node; and
a commit-signal structure generated by the ordering point in response to each memory reference operation issued to the node, the commit-signal indicating the apparent completion of the memory reference operation to the processor issuing the
operation.
10. The apparatus of claim 9 wherein the ordering point comprises:
an arbiter (Arb) bus coupled to the switch, the Arb bus functioning as a serialization point for the memory reference operations issued to the node; and
a coherence controller coupled to the Arb bus, the coherence controller generating memory reference probes in response to issued memory reference operations.
11. The apparatus of claim 10 wherein the local switch comprises:
a plurality of ports, each coupled to a respective processor via a full-duplex, bi-directional clock forwarded data link;
a plurality of input queues, each associated with a respective port, for receiving memory reference operations issued by its respective processor; and
a plurality of output queues, each associated with a respective port, for receiving the memory reference probes generated by the coherence controller, each of the output queues configured to receive the commit-signal generated by the ordering
point, the output queues further configured to transmit the probes and commit-signal to the respective processors in first-in, first-out (FIFO) order.
12. The apparatus of claim 11 wherein the local switch further comprises an arbiter configured to implement an arbitration policy for arbitrating among the input queues to grant access to the Arb bus.
13. The apparatus of claim 12 wherein the arbitration policy is a round-robin algorithm.
14. Apparatus for reducing the latency of inter-reference ordering between memory reference operations issued by a processor to a symmetric multiprocessing (SMP) system comprising a plurality of SMP nodes, each node having a shared memory that
is distributed among a plurality of processors, each processor having a private cache and a programming interface to the distributed shared memory, the apparatus comprising:
a hierarchical switch of the SMP system, the hierarchical switch interconnecting the plurality of SMP nodes;
an ordering point of the hierarchical switch configured to atomically multicast and totally order the memory reference operations issued to the system; and
a commit-signal structure generated by the ordering point in response to each memory reference operation issued to the system, the commit-signal comprising one of a plurality of commit-signal types, each type of commit-signal indicating the
apparent completion of the memory reference operation to the processor issuing the operation.
15. The apparatus of claim 14 wherein the hierarchical switch comprises:
a plurality of input ports, each coupled to a respective node via a full-duplex, bi-directional clock forwarded link, for receiving memory reference operations issued by a processor of its respective node, the memory reference operations issued
as commands by the processor;
a plurality of output ports, each coupled to the respective port via the bi-directional link, for forwarding the commands to at least one SMP node; and
a plurality of input buffers, each associated with a respective input port, for temporarily storing the received commands.
16. The apparatus of claim 15 wherein the hierarchical switch further comprises a plurality of multiplexer circuits having a plurality of inputs and an output, each of input of each multiplexer coupled to a respective input buffer of the
hierarchical switch such that a command received from any of the input buffers may be forwarded in any order to any of the output ports.
17. The apparatus of claim 16 wherein the ordering point of the hierarchical switch comprises the input buffers and multiplexer circuits.
18. The apparatus of claim 17 wherein one of the commit-signal types generated by the ordering point of the hierarchical switch comprises a type 1 commit-signal corresponding to a global command issued by a processor to an address that is
assigned to a SMP node other than the node on which the issuing processor resides.
19. The apparatus of claim 17 wherein one of the commit-signal types generated by the ordering point of the hierarchical switch comprises a type 2 commit-signal corresponding to a local command issued by a processor to an address that is
assigned to the SMP node on which the issuing processor resides, the SMP node comprising a local ordering point for generating external memory reference probes to the address in response to the issued local command, the type 2 commit-signal characterized
by generating the external probes and having outstanding probes.
20. The apparatus of claim 17 wherein one of the commit-signal types generated by the ordering point of the hierarchical switch comprises a type 3 commit-signal corresponding to a local command issued by a processor to an address that is
assigned to the SMP node on which the issuing processor resides, the SMP node comprising a local ordering point for generating external memory reference probes to the address in response to the issued local command, the type 2 commit-signal characterized
by having no generated external probes but having outstanding probes to the address.
21. Apparatus for reducing latency of inter-reference ordering between memory reference operations of a shared memory multiprocessor system having a consistency model, the system having a plurality of processors, each processor capable of
issuing a memory reference operation to the system, the apparatus comprising:
a commit-signal structure generated by control logic of the multiprocessor system in response to each issued memory reference operation, the commit-signal generated substantially sooner than completion of the memory reference operation, the
commit-signal indicating apparent completion of the memory reference operation rather than actual completion of the operation;
a memory barrier (MB) instruction inserted between sets of the memory reference instructions of a program executed by the issuing processor, the sets of memory reference instructions issued to the system as pre-MB memory reference operations and
post-MB memory reference operations to impose inter-reference ordering in cooperation with said commit signal. |
|
|
|
|
Claims  |
|
|
Description  |
|
|
FIELD OF THE INVENTION
The invention relates to multiprocessor systems and, more particularly, to the efficient ordering of memory reference operations issued by a processor of a multiprocessor system.
BACKGROUND OF THE INVENTION
Multiprocessing systems, such as symmetric multi-processors, provide a computer environment wherein software applications may operate on a plurality of processors using a single address space or shared memory abstraction. In a shared memory
system, each processor can access any data item without a programmer having to worry about where the data is or how to obtain its value; this frees the programmer to focus on program development, e.g., algorithms, rather than managing partitioned data
sets and communicating values. Interprocessor synchronization is typically accomplished in a shared memory system between processors performing read and write operations to "synchronization variables" either before and after accesses to "data
variables".
For instance, consider the case of a processor P1 updating a data structure and processor P2 reading the updated structure after synchronization. Typically, this is accomplished by P1 updating data values and subsequently setting a semaphore or
flag variable to indicate to P2 that the data values have been updated. P2 checks the value of the flag variable and, if set, subsequently issues read operations (requests) to retrieve the new data values. Note the significance of the term
"subsequently" used above; if P1 sets the flag before it completes the data updates or if P2 retrieves the data before it checks the value of the flag, synchronization is not achieved. The key is that each processor must individually impose an order on
its memory references for such synchronization techniques to work. The order described above is referred to as a processor's inter-reference order. Commonly used synchronization techniques require that each processor be capable of imposing an
inter-reference order on its issued memory reference operations.
______________________________________ P1 P2 Store Data, New-value L1: Load Flag Store Flag, 0 BNZ L1 Load Data ______________________________________
*The inter-reference order imposed by a processor is defined by its memory reference ordering model or, more commonly, its consistency model. The consistency model for a processor architecture specifies, in part, a means by which the
inter-reference order is specified. Typically, the means is realized by inserting a special memory reference ordering instruction, such as a Memory Barrier (MB) or "fence", between sets of memory reference instructions. Alternatively, the means may be
implicit in other opcodes, such as in "test-and-set". In addition, the model specifies the precise semantics (meaning) of the means. Two commonly used consistency models include sequential consistency and weak-ordering, although those skilled in the
art will recognize that there are other models that may be employed, such as release consistency.
Sequential Consistency
In a sequentially consistent system, the order in which memory reference operations appear in an execution path of the program (herein referred to as the "I-stream order") is the inter-reference order. Additional instructions are not required to
denote the order simply because each load or store instruction is considered ordered before its succeeding operation in the I-stream order.
Consider the program example below. The program performs as expected on a sequentially consistent system because the system imposes the necessary inter-reference order. That is, P1's first store instruction is ordered before P1's store-to-flag
instruction. Similarly, P2's load flag instruction is ordered before P2's load data instruction. Thus, if the system imposes the correct inter-reference ordering and P2 retrieves the value 0 for the flag, P2 will also retrieve the new value for data.
Weak Ordering
In a weakly-ordered system, an order is imposed between selected sets of memory reference operations, while other operations are considered unordered. One or more MB instructions are used to indicate the required order. In the case of an MB
instruction defined by the Alpha.RTM. 21264 processor instruction set, the MB denotes that all memory reference instructions above the MB (i.e., pre-MB instructions) are ordered before all reference instructions after the MB (i.e., post-MB
instructions). However, no order is required between reference instructions that are not separated by an MB.
______________________________________ P1: P2: Store Data1, New-value1 L1: Load Flag Store Data2, New-value2 MB BNZ L1 Store Flag, 0 MB Load Data1 Load Data2 ______________________________________
In above example, the MB instruction implies that each of P1's two pre-MB store instructions are ordered before P1's store-to-flag instruction. However, there is no logical order required between the two pre-MB store instructions. Similarly,
P2's two post-MB load instructions are ordered after the Load flag; however, there is no order required between the two post-MB loads. It can thus be appreciated that weak ordering reduces the constraints on logical ordering of memory references,
thereby allowing a processor to gain higher performance by potentially executing the unordered sets concurrently.
The prior art includes other types of barriers as described in literature and as implemented on commercial processors. For example, a write-MB (WMB) instruction on an Alpha microprocessor requires only that pre-WMB store instructions be
logically ordered before any post-WMB stores. In other words, the WMB instruction places no constraints at all on load instructions occurring before or after the WMB.
In order to increase performance, modern processors do not execute memory reference instructions one at a time. It is desirable that a processor keep a large number of memory references outstanding and issue, as well as complete, memory
reference operations out-of-order. This is accomplished by viewing the consistency model as a "logical order", i.e., the order in which memory reference operations appear to happen, rather than the order in which those references are issued or
completed. More precisely, a consistency model defines only a logical order on memory references; it allows for a variety of optimizations in implementation. It is thus desired to increase performance by reducing latency and allowing (on average) a
large number of outstanding references, while preserving the logical order implied by the consistency model.
In prior systems, a memory barrier instruction is typically contingent upon "completion" of an operation. For example, when a source processor issues a read operation, the operation is considered complete when data is received at the source
processor. When executing a store instruction, the source processor issues a memory reference operation to acquire exclusive ownership of the data; in response to the issued operation, system control logic generates "probes" to invalidate old copies of
the data at other processors and to request forwarding of the data from the owner processor to the source processor. Here the operation completes only when all probes reach their destination processors and the data is received at the source processor.
Broadly stated, these prior systems rely on completion to impose inter-reference ordering. For instance, in a weakly-ordered system employing MB instructions, all pre-MB operations must be complete before the MB is passed and post-MB operations
may be considered. Essentially, "completion" of an operation requires actual completion of all activity, including receipt of data and acknowledgments for probes, corresponding to the operation. Such an arrangement is inefficient and, in the context of
inter-reference ordering, adversely affects latency.
Therefore, the present invention is directed to increasing the efficiency of a shared memory multiprocessor system by relaxing the completion requirement while preserving the consistency model. The invention is further directed to improving the
performance of a shared memory system by reducing the latency associated with memory barriers.
SUMMARY OF THE INVENTION
The invention relates to a mechanism for reducing the latency of inter-reference ordering between sets of memory reference operations in a multiprocessor system having a shared memory. The mechanism generally comprises a commit-signal that is
generated by control logic of the multiprocessor system in response to an issued memory reference operation. According to the invention, the commit-signal facilitates inter-reference ordering; moreover, the commit signal indicates the apparent
completion of the memory reference operation, rather than actual completion of the operation. Notably, the apparent completion of an operation occurs substantially sooner than the actual completion of an operation, thereby improving performance of the
multiprocessor system.
In the illustrative embodiment, inter-reference ordering may be imposed by a memory barrier (MB) instruction inserted between memory reference instructions of a program executed by a processor. Execution of these instructions within the
processor may cause out-of-order issuance and completion of external memory reference operations due to operational latencies throughout the system. To ensure correct implementation of the consistency model, prior systems inhibit program execution past
the MB instruction until actual completion of all pre-MB operations have been confirmed to the processor. According to the present invention, however, program execution may proceed past the MB instruction once the commit signals for the operations have
been received by the processor.
As described herein, the multiprocessing system may comprise (i) a symmetric multiprocessing (SMP) node wherein the processor and shared memory entities are interconnected by a local switch or (ii) a SMP system wherein a plurality of nodes are
interconnected by a hierarchical switch. Each processor preferably has a private cache for storing data and changes to the data as a result of the memory reference operations are reflected among the entities via the transmission of probe commands in
accordance with a conventional cache coherence protocol. Notably, associated with the system is an ordering point. Specifically, in the SMP node, the ordering point is associated with the local switch. In the SMP system, the ordering point is
associated with the hierarchical switch.
As an example of a SMP node with an ownership-based, write-invalidate cache coherence protocol, a requesting processor issues a memory reference operation to the system requesting particular data. Upon determining which entity is the owner of
the data and which entities have valid copies of the data, the ordering point totally orders the memory reference operation with respect to the other issued references using, e.g., a conventional arbitration or prioritization policy. Thereafter, the
ordering point generates probes to invalidate any copies of the data at appropriate processors and/or to request forwarding of the data from the owner processor to the requesting processor, as required by the memory operation. Significantly, the
ordering point also generates the commit-signal at this time for transmission to the requesting processor. Probes and commit signals are generated atomically for transmission to the appropriate processors. The net result is that all memory operations
appear totally ordered.
Ordering of the requested memory reference operation with respect to memory references issued by other processors of the system constitutes a commit-event for the requested operation. For the SMP node embodiment, the commit-event is the point at
which the memory reference operation is ordered at the local switch, whereas for the SMP system the commit-event occurs when the memory reference operation is ordered at the hierarchical switch. In accordance with the present invention, the
commit-signal is preferably transmitted to the requesting processor upon the occurrence of, or after, such a commit-event.
As a further example using an illustrative processor algorithm, issuance of each memory reference operation by the processor preferably increments a counter and receipt by the processor of each commit-signal responsive to the issued memory
reference decrements the counter. When program execution reaches the MB instruction and the counter realizes a value of zero, the previously issued operations are considered committed and execution of the program may proceed past the MB.
BRIEF
DESCRIPTION OF THE DRAWINGS
The above and further advantages of the invention may be better understood by referring to the following description in conjunction with the accompanying drawings in which like reference numbers indicate identical or functionally similar
elements:
FIG. 1 is a schematic block diagram of a first multiprocessor node embodiment comprising a plurality of processors coupled to an input/output processor and a memory by a local switch;
FIG. 2 is a schematic block diagram of the local switch comprising a plurality of ports coupled to the respective processors of FIG. 1;
FIG. 3 is a schematic diagram of an embodiment of a novel commit-signal implemented as a commit-signal packet in accordance with the invention;
FIG. 4 is a schematic block diagram of a second multiprocessing system embodiment comprising a plurality of multiprocessor nodes interconnected by a hierarchical switch;
FIG. 5 is a schematic block diagram of the hierarchical switch according to the present invention;
FIG. 6 is a schematic block diagram of an augmented multiprocessor node comprising a plurality of processors interconnected with a shared memory, an IOP and a global port interface via a local switch;
FIG. 7 illustrates an embodiment of a LoopComSig table in accordance with the present invention;
FIG. 8 is a schematic diagram of an incoming command packet modified with a multicast-vector in accordance with the present invention;
FIG. 9 is a schematic block diagram illustrating a total ordering property of an illustrative embodiment of the hierarchical switch;
FIG. 10 is a flowchart illustrating the sequence of steps for generating and issuing a type 0 commit-signal according to the present invention;
FIG. 11 is a flowchart illustrating the sequence of steps for generating and issuing a type 1 commit-signal according to the present invention;
FIG. 12 is a flowchart illustrating the sequence of steps for generating and issuing a type 2 commit-signal according to the present invention;
FIG. 13 is a flowchart illustrating the sequence of steps for generating and issuing a type 3 commit-signal according to the invention; and
FIG. 14 is a schematic top-level diagram of an alternate embodiment of the second multiprocessing system which may be advantageously used with the present invention.
DETAILED DESCRIPTION OF THE ILLUSTRATIVE EMBODIMENTS
As described herein, a symmetric multi-processing (SMP) system includes a number of SMP nodes interconnected by a hierarchical switch. Each SMP node thus functions as a building block in the SMP system. Below, the structure and operation of an
SMP node embodiment that may be advantageously used with the present invention is first described, followed by a description of the SMP system embodiment.
SMP Node
FIG. 1 is a schematic block diagram of a first multiprocessing system embodiment, such as a small SMP node 100, comprising a plurality of processors (P) 102-108 coupled to an input/output (I/O) processor 130 and a memory 150 by a local switch
200. The memory 150 is preferably organized as a single address space that is shared by the processors and apportioned into a number of blocks, each of which may include, e.g., 64 bytes of data. The I/O processor, or IOP 130, controls the transfer of
data between external devices (not shown) and the system via an I/O bus 140. Data is transferred between the components of the SMP node in the form of packets. As used herein, the term "system" refers to all components of the SMP node excluding the
processors and IOP. In an embodiment of the invention, the I/O bus may operate according to the conventional Peripheral Computer Interconnect (PCI) protocol.
Each processor is a modern processor comprising a central processing unit (CPU), denoted 112-118, that preferably incorporates a traditional reduced instruction set computer (RISC) load/store architecture. In the illustrative embodiment
described herein, the CPUs are Alpha.RTM. 21264 processor chips manufactured by Digital Equipment Corporation.RTM., although other types of processor chips may be advantageously used. The load/store instructions executed by the processors are issued to
the system as memory reference, e.g., read and write, operations. Each operation may comprise a series of commands (or command packets) that are exchanged between the processors and the system. As described further herein, characteristics of modern
processors include the ability to issue memory reference operations out-of-order, to have more than one memory reference outstanding at a time and to accommodate completion of the memory reference operations in arbitrary order.
In addition, each processor and IOP employs a private cache (denoted 122-128 and 132, respectively) for storing data determined likely to be accessed in the future. The caches are preferably organized as write-back caches apportioned into, e.g.,
64-byte cache lines accessible by the processors; it should be noted, however, that other cache organizations, such as write-through caches, may be used in connection with the principles of the invention. It should be further noted that memory reference
operations issued by the processors are preferably directed to a64-byte cache line granularity. Since the IOP 130 and processors 102-108 may update data in their private caches without updating shared memory 150, a cache coherence protocol is utilized
to maintain consistency among the caches.
The cache coherence protocol of the illustrative embodiment is preferably a conventional write-invalidate, ownership-based protocol. "Write-Invalidate" implies that when a processor modifies a cache line, it invalidates stale copies in other
processors' caches rather than updating them with the new value. The protocol is termed an "ownership protocol" because there is always an identifiable owner for a cache line, whether it is shared memory, one of the processors or the IOP entities of the
system. The owner of the cache line is responsible for supplying the up-to-date value of the cache line when requested. A processor/IOP may own a cache line in one of two states: "exclusively" or "shared". If a processor has exclusive ownership of a
cache line, it may update it without informing the system. Otherwise, it must inform the system and potentially invalidate copies in the other caches.
A shared data structure 160 is provided for capturing and maintaining status information corresponding to the states of data used by the system. In the illustrative embodiment, the shared data structure is configured as a conventional duplicate
tag store (DTAG) 160 that cooperates with the individual caches of the system to define the coherence protocol states of the data in the system. The protocol states of the DTAG 160 are administered by a coherence controller 180, which may be implemented
as a plurality of hardware registers and combinational logic configured to produce a sequential logic circuit, such as a state machine. It should be noted, however, that other configurations of the controller and shared data structure may be
advantageously used herein.
The DTAG 160, coherence controller 180, IOP 130 and shared memory 150 are interconnected by a logical bus referred to an Arb bus 170. Memory reference operations issued by the processors are routed via the local switch 200 to the Arb bus 170.
The order in which the actual memory reference commands appear on the Arb bus is the order in which processors perceive the results of those commands. In accordance with this embodiment
of the invention, though, the Arb bus 170 and the coherence controller 180 cooperate to provide an ordering point, as described herein.
The commands described herein are defined by the Alpha.RTM. memory system interface and may be classified into three types: requests, probes, and responses. Requests are commands that are issued by a processor when, as a result of executing a
load or store instruction, it must obtain a copy of data. Requests are also used to gain exclusive ownership to a data item (cache line) from the system. Requests include Read (Rd) commands, Read/Modify (RdMod) commands, Change-to-Dirty (CTD) commands,
Victim commands, and Evict commands, the latter of which specify removal of a cache line from a respective cache.
Probes are commands issued by the system to one or more processors requesting data and/or cache tag status updates. Probes include Forwarded Read (Frd) commands, Forwarded Read Modify (FRdMod) commands and Invalidate (Inval) commands. When a
processor P issues a request to the system, the system may issue one or more probes (via probe packets) to other processors. For example if P requests a copy of a cache line (a Rd request), the system sends a probe to the owner processor (if any). If P
requests exclusive ownership of a cache line (a CTD request), the system sends Inval probes to one or more processors having copies of the cache line. If P requests both a copy of the cache line as well as exclusive ownership of the cache line (a RdMod
request) the system sends a FRd probe to a processor currently storing a dirty copy of a cache line of data. In response to the Frd probe, the dirty copy of the cache line is returned to the system. A FRdMod probe is also issued by the system to a
processor storing a dirty copy of a cache line. In response to the FRdMod probe, the dirty cache line is returned to the system and the dirty copy stored in the cache is invalidated. An Inval probe may be issued by the system to a processor storing a
copy of the cache line in its cache when the cache line is to be updated by another processor.
Responses are commands from the system to processors/IOPs which carry the data requested by the processor or an acknowledgment corresponding to a request. For Rd and RdMod requests, the response is a Fill and FillMod response, respectively, each
of which carries the requested data. For a CTD request, the response is a CTD-Success (Ack) or CTD-Failure (Nack) response, indicating success or failure of the CTD, whereas for a Victim request, the response is a Victim-Release response.
FIG. 2 is a schematic block diagram of the local switch 200 comprising a plurality of ports 202-210, each of which is coupled to a respective processor (P1-P4) 102-108 and IOP 130 via a full-duplex, bi-directional clock forwarded data link. Each
port includes a respective input queue 212-220 for receiving, e.g., a memory reference request issued by its processor and a respective output queue 222-230 for receiving, e.g., a memory reference probe issued by system control logic associated with the
switch. An arbiter 240 arbitrates among the input queues to grant access to the Arb bus 170 where the requests are ordered into a memory reference request stream. In the illustrative embodiment, the arbiter selects the requests stored in the input
queues for access to the bus in accordance with an arbitration policy, such as a conventional round-robin algorithm.
The following example illustrates the typical operation of multiprocessing system including switch 200. A Rd request for data item x is received at the switch 200 from P1 and loaded into input queue 212. The arbiter 240 selects the request in
accordance with the arbitration algorithm. Upon gaining access to the Arb bus 170, the selected request is routed to the ordering point 250 wherein the states of the corresponding cache lines are interrogated in the DTAG 160. Specifically, the
coherence controller 180 examines the DTAG to determine which entity of the system "owns" the cache line and which entities have copies of the line. If processor P3 is the owner of the cache line x and P4 has a copy, the coherence controller generates
the necessary probes (e.g., a Fill x and Inval x) and forwards them to the output queues 226 and 228 for transmission to the processors.
Because of operational latencies through the switch and data paths of the system, memory reference requests issued by P1 may complete out-of-order. In some cases, out-of-order completion may affect the consistency of data in the system,
particularly for updates to a cache line. Memory consistency models provide formal specifications of how such updates become visible to the entities of the multiprocessor system. In the illustrative embodiment of the present invention, a weak ordering
consistency model is described, although it will be apparent to those skilled in the art that other consistency models may be used.
In a weakly-ordered system, inter-reference ordering is typically imposed by a memory barrier (MB) instruction inserted between memory reference instructions of a program executed by a processor. The MB instruction separates and groups those
instructions of a program that need ordering from the rest of the instructions. The semantics of weak ordering mandate that all pre-MB memory reference operations are logically ordered before all post-MB references. For example, the following program
instructions are executed by P1 and P2:
______________________________________ P1 P2 ______________________________________ St x Ld flag, 0 St y MB St z Rd x MB Rd y St flag,0 Rd z ______________________________________
In the case of P1's program, it is desired to store (via a write operation) all of the data items x, y and z before modifying the value of the flag; the programmer indicates this intention by placing the MB instruction after St z. According to
the weak-ordering semantics, the programmer doesn't care about the order in which the pre-MB store instructions issue as memory reference operations, nor does she care about the order in which the post-MB references appear to the system. Essentially,
the programmer only cares that every pre-MB store instruction appears before every post-MB instruction. At P2, a load (via a read operation) flag is performed to test for the value 0. Testing of the flag is ordered with respect to acquiring the data
items x, y and z as indicated by the MB instruction. Again, it is not necessary to impose order on the individual post-MB instructions.
To ensure correct implementation of the consistency model, prior systems inhibit program execution past the MB instruction until actual completion of all pre-MB operations have been confirmed to the processor. Maintaining inter-reference order
from all pre-MB operations to all post-MB operations typically requires acknowledgment responses and/or return data to signal completion of the pre-MB operations. The acknowledgment responses may be gathered and sent to the processor issuing the
operations. The pre-MB operations are considered completed only after all responses and data are received by the requesting processor. Thus, referring to the example above with respect to operation of a prior multiprocessing system, once P1 has
received the data and acknowledgment responses (e.g., an Inval acknowledgment) corresponding to an operation, the operation is considered complete.
Since each memory reference operation may consist of a number of commands, the latency of inter-reference ordering is a function of the extent to which each command must complete before the reference is considered ordered. The present invention
relates to a mechanism for reducing the latency of inter-reference ordering between sets of memory reference operations in a multiprocessor system having a shared memory that is distributed among a plurality of processors configured to issue and complete
those operations out-of-order.
The mechanism generally comprises a commit-signal that is generated by the ordering point 250 of the multiprocessor system in response to a memory reference operation issued by a requesting processor for particular data. FIG. 3 is a schematic
diagram of a commit-signal that is preferably implemented as a commit-signal packet structure 300 characterized by the assertion of a single, commit-signal ("C") bit 310 to processor. It will be apparent to those skilled in the art that the
commit-signal may be manifested in a variety of forms, including a discrete signal on a wire or, in another embodiment, a packet identifying the operation corresponding to the commit-signal. Program execution may proceed past the MB instruction once
commit-signals for all pre-MB operations have been received by the processor, thereby increasing the performance of the system. The commit-signal facilitates inter-reference ordering; in addition, the novel signal indicates the apparent completion of
the memory reference operation, rather than actual completion of the operation. Notably, the apparent completion of an operation occurs substantially sooner than the actual completion of an operation, thereby improving performance of the multiprocessor
system.
Referring again to the above example including the program instructions executed by P1, generation of a commit-signal by the ordering point 250 in response to each RdMod request for data items x, y and z (corresponding to each store instruction
for those data items) issued by P1 occurs upon successful arbitration and access to the Arb bus 170, and total ordering of those requests with respect to all memory reference requests appearing on the bus. Total ordering of each memory reference request
constitutes a commit-event for the requested operation. According to the invention, the commit-signal 300 is preferably transmitted to P1 upon the occurrence of, or after, the commit-event.
The ordering point 250 determines the state of the data items throughout the system and generates probes (i.e., probe packets) to invalidate copies of the data and to request forwarding of the data from the owner to the requesting processor P1.
For example, the ordering point may generate FRdMod probe to P3 (i.e., the owner) and Inval probes to P2 and P4. The ordering point also generates the commit-signal at this time for transmission to the P1. The commit-signal and probe packets are loaded
into the output queues and forwarded to the respective processors in single, first-in, first-out (FIFO) order; in the case of P1, the commit-signal is loaded into queue 222 and forwarded to P1 along with any other probes pending in the queue. As an
optimization, the commit-signal 300 may be "piggy backed" on top of one of these probe packets; in the illustrative embodiment of such an optimization, the C-bit of a probe packet may be asserted to indicate that a commit-signal is | | |