WikiPatents - Community Patent Review
Create Free Account  |  License or Sell Your Patent  |  WikiPatents Marketplace  |  WikiPatents Blog
Username:  Password:  
    
Advanced Search
Method and apparatus for reducing latency of inter-reference ordering in a multiprocessor system    
United States Patent6108737   
Link to this pagehttp://www.wikipatents.com/6108737.html
Inventor(s)Sharma; Madhumitra (Shrewsbury, MA), Van Doren; Stephen R. (Northborough, MA), Gharachorloo; Kourosh (Stanford, CA), Steely, Jr.; Simon C. (Hudson, NH)
AbstractA mechanism reduces the latency of inter-reference ordering between sets of memory reference operations in a multiprocessor system having a shared memory. The mechanism comprises a commit-signal that is generated by control logic of the multiprocessor system in response to an issued memory reference operation. The commit-signal facilitates inter-reference ordering; moreover, the commit signal indicates the apparent completion of the memory reference operation, rather than actual completion of the operation. The apparent completion of an operation occurs substantially sooner than the actual completion of an operation, thereby improving performance of the multiprocessor system.
   














 Title Information Submit all comments and votes
 
Patent Text Patent PDF Print Page Summary File History
Plain text PDF images Print Summary File History
Drawing from US Patent 6108737
Method and apparatus for reducing latency of inter-reference ordering in
     a multiprocessor system - US Patent 6108737 Drawing
Method and apparatus for reducing latency of inter-reference ordering in a multiprocessor system
Inventor     Sharma; Madhumitra (Shrewsbury, MA) , Van Doren; Stephen R. (Northborough, MA) , Gharachorloo; Kourosh (Stanford, CA) , Steely, Jr.; Simon C. (Hudson, NH)
Owner/Assignee     Compaq Computer Corporation (Houston, TX)
Patent assignment
All assignments
Publication Date     August 22, 2000
Application Number     08/957,097
PAIR File History     Application Data   Transaction History
Image File Wrapper   Patent Term   Fees
Litigation
Filing Date     October 24, 1997
US Classification     710/107 709/213 711/147
Int'l Classification    
Examiner     Ray; Gopal C.
Assistant Examiner    
Attorney/Law Firm     Cesari and McKenna, LLP
Address
Parent Case    
Priority Data    
USPTO Field of Search     395/287 395/200.43 395/281 711/147 711/100 711/121 710/107 710/101 710/111 710/51 710/52 710/131 710/113 710/240 709/213 709/100 709/106 709/300 712/214 370/357 370/462 707/10
Patent Tags     reducing latency inter-reference ordering in multiprocessor
   
Enter a comma (,) or semicolon (;) between multiple tag words/phrases.
Describe this patent:
 Amusing   
 Clever   
 Complex   
 Efficient   
 Historic   
 Important   
 Innovative   
 Interesting   
 Practical   
 Simple   
[no votes]
Patent WIKI

Share information and news about this patent, including information and news about the technology, inventors, company, ligation and licensing.

 References Submit all comments and votes
 
*references marked with an asterisk below are user-added references
 U.S. References
 
Add a new US reference:  
ReferenceRelevancyCommentsReferenceRelevancyComments
5761731
Van Doren et al.

Jun,1998

[0 after 0 votes]
5546582
Brockmeyer et al.

Aug,1996

[0 after 0 votes]
5551005
Sarangdher et al.

Aug,1996

[0 after 0 votes]
5504900
Raz

Apr,1996

[0 after 0 votes]
5404464
Bennett

Apr,1995

[0 after 0 votes]
5182808
Bagnoli et al.

Jan,1993

[0 after 0 votes]
4298929
Capozzi

Nov,1981

[0 after 0 votes]
 Foreign References
 Other References
 Market Review Submit all comments and votes
   
Market Size
Estimate the gross annual revenues of the relevant market sector:
> $10B
$5B - $10B
$2B - $5B
$500M - $2B
$100M - $500M
$10M - $100M
$1M - $10M
$500K - $1M
$100K - $500K
< $100K
[No votes]
$0
 
$0   $2.5B   $5B   $7.5B   $10B
Market Share
Estimate the percentage of the relevant market sector this invention will capture:
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Reasonable Royalty
What percentage of gross sales should the inventor or assignee be paid?
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Public's "Guesstimation" of Royalty Value
Market SizeN/A[No votes]
xMarket ShareN/A[No votes]
xReasonable RoyaltyN/A[No votes]

N/A

License Availablity
If you are NOT the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
License Availablity
If you ARE the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
Competitive Advantage
Does this invention have a significant competitive advantage over similar technologies?
Yes

No



[No votes]
Most helpful competitive advantage comment
[No comments]

Commercial Alternatives
Are there viable commercial alternatives for this invention?
Yes

No



[No votes]
Most helpful commercial alternative comment
[No comments]

 Technical Review Submit all comments and votes
 Claims Submit all comments and votes
 


What is claimed is:

1. Apparatus for reducing latency of inter-reference ordering between memory reference operations of a shared memory multiprocessor system having a consistency model, the system having a plurality of processors, each processor capable of issuing a memory reference operation to the system, the apparatus comprising:

a commit-signal structure generated by control logic of the multiprocessor system in response to each issued memory reference operation, the commit-signal generated substantially sooner than completion of the memory reference operation, the commit-signal indicating apparent completion of the memory reference operation rather than actual completion of the operation;

wherein each processor employs the commit signal to impose inter-reference ordering.

2. The apparatus of claim 1 wherein the consistency model is weak-ordering and wherein inter-reference ordering is imposed by a memory barrier (MB) instruction inserted between sets of the memory reference instructions of a program executed by the issuing processor, the sets of memory reference instructions issued to the system as pre-MB memory reference operations and post-MB memory reference operations.

3. The apparatus of claim 2 wherein ordering of an issued memory reference operation constitutes a commit-event for the issued operation and wherein the commit-signal is transmitted to the issuing processor upon the occurrence of, or after, the commit-event.

4. The apparatus of claim 3 wherein program execution is permitted to proceed past the MB instruction once commit signals for all pre-MB memory reference operations have been received by the issuing processor.

5. The apparatus of claim 4 further comprising a counter of the issuing processor, the processor (i) incrementing the counter upon the issuance of each pre-MB memory reference operation and (ii) decrementing the counter upon receipt of each commit-signal responsive to an issued pre-MB memory reference operation, such that when execution of the program reaches the MB instruction and a value of the counter is zero, program execution is permitted to proceed past the MB instruction.

6. The apparatus of claim 1 wherein the consistency model is sequential consistency and wherein inter-reference ordering is generally imposed by requiring completion of a current memory reference operation prior to issuing a next memory reference operation.

7. The apparatus of claim 6 wherein ordering of an issued memory reference operation constitutes a commit-event for the issued operation and wherein the commit-signal is transmitted to the issuing processor upon the occurrence of, or after, the commit-event.

8. The apparatus of claim 7 wherein receipt of the transmitted commit-signal by the issuing processor constitutes commitment of the issued memory reference operation and wherein the next memory reference operation is permitted to issue once commitment of the current memory reference operation has been received by the issuing processor.

9. Apparatus for reducing the latency of inter-reference ordering between memory reference operations issued by a processor to a symmetric multiprocessing (SMP) node having a shared memory that is distributed among a plurality of processors, each processor having a private cache and a programming interface to the distributed shared memory, the apparatus comprising:

a local switch of the SMP node, the local switch interconnecting the processors and shared memory;

an ordering point coupled to the local switch and configured to totally order the memory reference operations issued to the node; and

a commit-signal structure generated by the ordering point in response to each memory reference operation issued to the node, the commit-signal indicating the apparent completion of the memory reference operation to the processor issuing the operation.

10. The apparatus of claim 9 wherein the ordering point comprises:

an arbiter (Arb) bus coupled to the switch, the Arb bus functioning as a serialization point for the memory reference operations issued to the node; and

a coherence controller coupled to the Arb bus, the coherence controller generating memory reference probes in response to issued memory reference operations.

11. The apparatus of claim 10 wherein the local switch comprises:

a plurality of ports, each coupled to a respective processor via a full-duplex, bi-directional clock forwarded data link;

a plurality of input queues, each associated with a respective port, for receiving memory reference operations issued by its respective processor; and

a plurality of output queues, each associated with a respective port, for receiving the memory reference probes generated by the coherence controller, each of the output queues configured to receive the commit-signal generated by the ordering point, the output queues further configured to transmit the probes and commit-signal to the respective processors in first-in, first-out (FIFO) order.

12. The apparatus of claim 11 wherein the local switch further comprises an arbiter configured to implement an arbitration policy for arbitrating among the input queues to grant access to the Arb bus.

13. The apparatus of claim 12 wherein the arbitration policy is a round-robin algorithm.

14. Apparatus for reducing the latency of inter-reference ordering between memory reference operations issued by a processor to a symmetric multiprocessing (SMP) system comprising a plurality of SMP nodes, each node having a shared memory that is distributed among a plurality of processors, each processor having a private cache and a programming interface to the distributed shared memory, the apparatus comprising:

a hierarchical switch of the SMP system, the hierarchical switch interconnecting the plurality of SMP nodes;

an ordering point of the hierarchical switch configured to atomically multicast and totally order the memory reference operations issued to the system; and

a commit-signal structure generated by the ordering point in response to each memory reference operation issued to the system, the commit-signal comprising one of a plurality of commit-signal types, each type of commit-signal indicating the apparent completion of the memory reference operation to the processor issuing the operation.

15. The apparatus of claim 14 wherein the hierarchical switch comprises:

a plurality of input ports, each coupled to a respective node via a full-duplex, bi-directional clock forwarded link, for receiving memory reference operations issued by a processor of its respective node, the memory reference operations issued as commands by the processor;

a plurality of output ports, each coupled to the respective port via the bi-directional link, for forwarding the commands to at least one SMP node; and

a plurality of input buffers, each associated with a respective input port, for temporarily storing the received commands.

16. The apparatus of claim 15 wherein the hierarchical switch further comprises a plurality of multiplexer circuits having a plurality of inputs and an output, each of input of each multiplexer coupled to a respective input buffer of the hierarchical switch such that a command received from any of the input buffers may be forwarded in any order to any of the output ports.

17. The apparatus of claim 16 wherein the ordering point of the hierarchical switch comprises the input buffers and multiplexer circuits.

18. The apparatus of claim 17 wherein one of the commit-signal types generated by the ordering point of the hierarchical switch comprises a type 1 commit-signal corresponding to a global command issued by a processor to an address that is assigned to a SMP node other than the node on which the issuing processor resides.

19. The apparatus of claim 17 wherein one of the commit-signal types generated by the ordering point of the hierarchical switch comprises a type 2 commit-signal corresponding to a local command issued by a processor to an address that is assigned to the SMP node on which the issuing processor resides, the SMP node comprising a local ordering point for generating external memory reference probes to the address in response to the issued local command, the type 2 commit-signal characterized by generating the external probes and having outstanding probes.

20. The apparatus of claim 17 wherein one of the commit-signal types generated by the ordering point of the hierarchical switch comprises a type 3 commit-signal corresponding to a local command issued by a processor to an address that is assigned to the SMP node on which the issuing processor resides, the SMP node comprising a local ordering point for generating external memory reference probes to the address in response to the issued local command, the type 2 commit-signal characterized by having no generated external probes but having outstanding probes to the address.

21. Apparatus for reducing latency of inter-reference ordering between memory reference operations of a shared memory multiprocessor system having a consistency model, the system having a plurality of processors, each processor capable of issuing a memory reference operation to the system, the apparatus comprising:

a commit-signal structure generated by control logic of the multiprocessor system in response to each issued memory reference operation, the commit-signal generated substantially sooner than completion of the memory reference operation, the commit-signal indicating apparent completion of the memory reference operation rather than actual completion of the operation;

a memory barrier (MB) instruction inserted between sets of the memory reference instructions of a program executed by the issuing processor, the sets of memory reference instructions issued to the system as pre-MB memory reference operations and post-MB memory reference operations to impose inter-reference ordering in cooperation with said commit signal.
 Description Submit all comments and votes
 


FIELD OF THE INVENTION

The invention relates to multiprocessor systems and, more particularly, to the efficient ordering of memory reference operations issued by a processor of a multiprocessor system.

BACKGROUND OF THE INVENTION

Multiprocessing systems, such as symmetric multi-processors, provide a computer environment wherein software applications may operate on a plurality of processors using a single address space or shared memory abstraction. In a shared memory system, each processor can access any data item without a programmer having to worry about where the data is or how to obtain its value; this frees the programmer to focus on program development, e.g., algorithms, rather than managing partitioned data sets and communicating values. Interprocessor synchronization is typically accomplished in a shared memory system between processors performing read and write operations to "synchronization variables" either before and after accesses to "data variables".

For instance, consider the case of a processor P1 updating a data structure and processor P2 reading the updated structure after synchronization. Typically, this is accomplished by P1 updating data values and subsequently setting a semaphore or flag variable to indicate to P2 that the data values have been updated. P2 checks the value of the flag variable and, if set, subsequently issues read operations (requests) to retrieve the new data values. Note the significance of the term "subsequently" used above; if P1 sets the flag before it completes the data updates or if P2 retrieves the data before it checks the value of the flag, synchronization is not achieved. The key is that each processor must individually impose an order on its memory references for such synchronization techniques to work. The order described above is referred to as a processor's inter-reference order. Commonly used synchronization techniques require that each processor be capable of imposing an inter-reference order on its issued memory reference operations.

______________________________________ P1 P2 Store Data, New-value L1: Load Flag Store Flag, 0 BNZ L1 Load Data ______________________________________

*The inter-reference order imposed by a processor is defined by its memory reference ordering model or, more commonly, its consistency model. The consistency model for a processor architecture specifies, in part, a means by which the inter-reference order is specified. Typically, the means is realized by inserting a special memory reference ordering instruction, such as a Memory Barrier (MB) or "fence", between sets of memory reference instructions. Alternatively, the means may be implicit in other opcodes, such as in "test-and-set". In addition, the model specifies the precise semantics (meaning) of the means. Two commonly used consistency models include sequential consistency and weak-ordering, although those skilled in the art will recognize that there are other models that may be employed, such as release consistency.

Sequential Consistency

In a sequentially consistent system, the order in which memory reference operations appear in an execution path of the program (herein referred to as the "I-stream order") is the inter-reference order. Additional instructions are not required to denote the order simply because each load or store instruction is considered ordered before its succeeding operation in the I-stream order.

Consider the program example below. The program performs as expected on a sequentially consistent system because the system imposes the necessary inter-reference order. That is, P1's first store instruction is ordered before P1's store-to-flag instruction. Similarly, P2's load flag instruction is ordered before P2's load data instruction. Thus, if the system imposes the correct inter-reference ordering and P2 retrieves the value 0 for the flag, P2 will also retrieve the new value for data.

Weak Ordering

In a weakly-ordered system, an order is imposed between selected sets of memory reference operations, while other operations are considered unordered. One or more MB instructions are used to indicate the required order. In the case of an MB instruction defined by the Alpha.RTM. 21264 processor instruction set, the MB denotes that all memory reference instructions above the MB (i.e., pre-MB instructions) are ordered before all reference instructions after the MB (i.e., post-MB instructions). However, no order is required between reference instructions that are not separated by an MB.

______________________________________ P1: P2: Store Data1, New-value1 L1: Load Flag Store Data2, New-value2 MB BNZ L1 Store Flag, 0 MB Load Data1 Load Data2 ______________________________________

In above example, the MB instruction implies that each of P1's two pre-MB store instructions are ordered before P1's store-to-flag instruction. However, there is no logical order required between the two pre-MB store instructions. Similarly, P2's two post-MB load instructions are ordered after the Load flag; however, there is no order required between the two post-MB loads. It can thus be appreciated that weak ordering reduces the constraints on logical ordering of memory references, thereby allowing a processor to gain higher performance by potentially executing the unordered sets concurrently.

The prior art includes other types of barriers as described in literature and as implemented on commercial processors. For example, a write-MB (WMB) instruction on an Alpha microprocessor requires only that pre-WMB store instructions be logically ordered before any post-WMB stores. In other words, the WMB instruction places no constraints at all on load instructions occurring before or after the WMB.

In order to increase performance, modern processors do not execute memory reference instructions one at a time. It is desirable that a processor keep a large number of memory references outstanding and issue, as well as complete, memory reference operations out-of-order. This is accomplished by viewing the consistency model as a "logical order", i.e., the order in which memory reference operations appear to happen, rather than the order in which those references are issued or completed. More precisely, a consistency model defines only a logical order on memory references; it allows for a variety of optimizations in implementation. It is thus desired to increase performance by reducing latency and allowing (on average) a large number of outstanding references, while preserving the logical order implied by the consistency model.

In prior systems, a memory barrier instruction is typically contingent upon "completion" of an operation. For example, when a source processor issues a read operation, the operation is considered complete when data is received at the source processor. When executing a store instruction, the source processor issues a memory reference operation to acquire exclusive ownership of the data; in response to the issued operation, system control logic generates "probes" to invalidate old copies of the data at other processors and to request forwarding of the data from the owner processor to the source processor. Here the operation completes only when all probes reach their destination processors and the data is received at the source processor.

Broadly stated, these prior systems rely on completion to impose inter-reference ordering. For instance, in a weakly-ordered system employing MB instructions, all pre-MB operations must be complete before the MB is passed and post-MB operations may be considered. Essentially, "completion" of an operation requires actual completion of all activity, including receipt of data and acknowledgments for probes, corresponding to the operation. Such an arrangement is inefficient and, in the context of inter-reference ordering, adversely affects latency.

Therefore, the present invention is directed to increasing the efficiency of a shared memory multiprocessor system by relaxing the completion requirement while preserving the consistency model. The invention is further directed to improving the performance of a shared memory system by reducing the latency associated with memory barriers.

SUMMARY OF THE INVENTION

The invention relates to a mechanism for reducing the latency of inter-reference ordering between sets of memory reference operations in a multiprocessor system having a shared memory. The mechanism generally comprises a commit-signal that is generated by control logic of the multiprocessor system in response to an issued memory reference operation. According to the invention, the commit-signal facilitates inter-reference ordering; moreover, the commit signal indicates the apparent completion of the memory reference operation, rather than actual completion of the operation. Notably, the apparent completion of an operation occurs substantially sooner than the actual completion of an operation, thereby improving performance of the multiprocessor system.

In the illustrative embodiment, inter-reference ordering may be imposed by a memory barrier (MB) instruction inserted between memory reference instructions of a program executed by a processor. Execution of these instructions within the processor may cause out-of-order issuance and completion of external memory reference operations due to operational latencies throughout the system. To ensure correct implementation of the consistency model, prior systems inhibit program execution past the MB instruction until actual completion of all pre-MB operations have been confirmed to the processor. According to the present invention, however, program execution may proceed past the MB instruction once the commit signals for the operations have been received by the processor.

As described herein, the multiprocessing system may comprise (i) a symmetric multiprocessing (SMP) node wherein the processor and shared memory entities are interconnected by a local switch or (ii) a SMP system wherein a plurality of nodes are interconnected by a hierarchical switch. Each processor preferably has a private cache for storing data and changes to the data as a result of the memory reference operations are reflected among the entities via the transmission of probe commands in accordance with a conventional cache coherence protocol. Notably, associated with the system is an ordering point. Specifically, in the SMP node, the ordering point is associated with the local switch. In the SMP system, the ordering point is associated with the hierarchical switch.

As an example of a SMP node with an ownership-based, write-invalidate cache coherence protocol, a requesting processor issues a memory reference operation to the system requesting particular data. Upon determining which entity is the owner of the data and which entities have valid copies of the data, the ordering point totally orders the memory reference operation with respect to the other issued references using, e.g., a conventional arbitration or prioritization policy. Thereafter, the ordering point generates probes to invalidate any copies of the data at appropriate processors and/or to request forwarding of the data from the owner processor to the requesting processor, as required by the memory operation. Significantly, the ordering point also generates the commit-signal at this time for transmission to the requesting processor. Probes and commit signals are generated atomically for transmission to the appropriate processors. The net result is that all memory operations appear totally ordered.

Ordering of the requested memory reference operation with respect to memory references issued by other processors of the system constitutes a commit-event for the requested operation. For the SMP node embodiment, the commit-event is the point at which the memory reference operation is ordered at the local switch, whereas for the SMP system the commit-event occurs when the memory reference operation is ordered at the hierarchical switch. In accordance with the present invention, the commit-signal is preferably transmitted to the requesting processor upon the occurrence of, or after, such a commit-event.

As a further example using an illustrative processor algorithm, issuance of each memory reference operation by the processor preferably increments a counter and receipt by the processor of each commit-signal responsive to the issued memory reference decrements the counter. When program execution reaches the MB instruction and the counter realizes a value of zero, the previously issued operations are considered committed and execution of the program may proceed past the MB.

BRIEF DESCRIPTION OF THE DRAWINGS

The above and further advantages of the invention may be better understood by referring to the following description in conjunction with the accompanying drawings in which like reference numbers indicate identical or functionally similar elements:

FIG. 1 is a schematic block diagram of a first multiprocessor node embodiment comprising a plurality of processors coupled to an input/output processor and a memory by a local switch;

FIG. 2 is a schematic block diagram of the local switch comprising a plurality of ports coupled to the respective processors of FIG. 1;

FIG. 3 is a schematic diagram of an embodiment of a novel commit-signal implemented as a commit-signal packet in accordance with the invention;

FIG. 4 is a schematic block diagram of a second multiprocessing system embodiment comprising a plurality of multiprocessor nodes interconnected by a hierarchical switch;

FIG. 5 is a schematic block diagram of the hierarchical switch according to the present invention;

FIG. 6 is a schematic block diagram of an augmented multiprocessor node comprising a plurality of processors interconnected with a shared memory, an IOP and a global port interface via a local switch;

FIG. 7 illustrates an embodiment of a LoopComSig table in accordance with the present invention;

FIG. 8 is a schematic diagram of an incoming command packet modified with a multicast-vector in accordance with the present invention;

FIG. 9 is a schematic block diagram illustrating a total ordering property of an illustrative embodiment of the hierarchical switch;

FIG. 10 is a flowchart illustrating the sequence of steps for generating and issuing a type 0 commit-signal according to the present invention;

FIG. 11 is a flowchart illustrating the sequence of steps for generating and issuing a type 1 commit-signal according to the present invention;

FIG. 12 is a flowchart illustrating the sequence of steps for generating and issuing a type 2 commit-signal according to the present invention;

FIG. 13 is a flowchart illustrating the sequence of steps for generating and issuing a type 3 commit-signal according to the invention; and

FIG. 14 is a schematic top-level diagram of an alternate embodiment of the second multiprocessing system which may be advantageously used with the present invention.

DETAILED DESCRIPTION OF THE ILLUSTRATIVE EMBODIMENTS

As described herein, a symmetric multi-processing (SMP) system includes a number of SMP nodes interconnected by a hierarchical switch. Each SMP node thus functions as a building block in the SMP system. Below, the structure and operation of an SMP node embodiment that may be advantageously used with the present invention is first described, followed by a description of the SMP system embodiment.

SMP Node

FIG. 1 is a schematic block diagram of a first multiprocessing system embodiment, such as a small SMP node 100, comprising a plurality of processors (P) 102-108 coupled to an input/output (I/O) processor 130 and a memory 150 by a local switch 200. The memory 150 is preferably organized as a single address space that is shared by the processors and apportioned into a number of blocks, each of which may include, e.g., 64 bytes of data. The I/O processor, or IOP 130, controls the transfer of data between external devices (not shown) and the system via an I/O bus 140. Data is transferred between the components of the SMP node in the form of packets. As used herein, the term "system" refers to all components of the SMP node excluding the processors and IOP. In an embodiment of the invention, the I/O bus may operate according to the conventional Peripheral Computer Interconnect (PCI) protocol.

Each processor is a modern processor comprising a central processing unit (CPU), denoted 112-118, that preferably incorporates a traditional reduced instruction set computer (RISC) load/store architecture. In the illustrative embodiment described herein, the CPUs are Alpha.RTM. 21264 processor chips manufactured by Digital Equipment Corporation.RTM., although other types of processor chips may be advantageously used. The load/store instructions executed by the processors are issued to the system as memory reference, e.g., read and write, operations. Each operation may comprise a series of commands (or command packets) that are exchanged between the processors and the system. As described further herein, characteristics of modern processors include the ability to issue memory reference operations out-of-order, to have more than one memory reference outstanding at a time and to accommodate completion of the memory reference operations in arbitrary order.

In addition, each processor and IOP employs a private cache (denoted 122-128 and 132, respectively) for storing data determined likely to be accessed in the future. The caches are preferably organized as write-back caches apportioned into, e.g., 64-byte cache lines accessible by the processors; it should be noted, however, that other cache organizations, such as write-through caches, may be used in connection with the principles of the invention. It should be further noted that memory reference operations issued by the processors are preferably directed to a64-byte cache line granularity. Since the IOP 130 and processors 102-108 may update data in their private caches without updating shared memory 150, a cache coherence protocol is utilized to maintain consistency among the caches.

The cache coherence protocol of the illustrative embodiment is preferably a conventional write-invalidate, ownership-based protocol. "Write-Invalidate" implies that when a processor modifies a cache line, it invalidates stale copies in other processors' caches rather than updating them with the new value. The protocol is termed an "ownership protocol" because there is always an identifiable owner for a cache line, whether it is shared memory, one of the processors or the IOP entities of the system. The owner of the cache line is responsible for supplying the up-to-date value of the cache line when requested. A processor/IOP may own a cache line in one of two states: "exclusively" or "shared". If a processor has exclusive ownership of a cache line, it may update it without informing the system. Otherwise, it must inform the system and potentially invalidate copies in the other caches.

A shared data structure 160 is provided for capturing and maintaining status information corresponding to the states of data used by the system. In the illustrative embodiment, the shared data structure is configured as a conventional duplicate tag store (DTAG) 160 that cooperates with the individual caches of the system to define the coherence protocol states of the data in the system. The protocol states of the DTAG 160 are administered by a coherence controller 180, which may be implemented as a plurality of hardware registers and combinational logic configured to produce a sequential logic circuit, such as a state machine. It should be noted, however, that other configurations of the controller and shared data structure may be advantageously used herein.

The DTAG 160, coherence controller 180, IOP 130 and shared memory 150 are interconnected by a logical bus referred to an Arb bus 170. Memory reference operations issued by the processors are routed via the local switch 200 to the Arb bus 170. The order in which the actual memory reference commands appear on the Arb bus is the order in which processors perceive the results of those commands. In accordance with this embodiment

of the invention, though, the Arb bus 170 and the coherence controller 180 cooperate to provide an ordering point, as described herein.

The commands described herein are defined by the Alpha.RTM. memory system interface and may be classified into three types: requests, probes, and responses. Requests are commands that are issued by a processor when, as a result of executing a load or store instruction, it must obtain a copy of data. Requests are also used to gain exclusive ownership to a data item (cache line) from the system. Requests include Read (Rd) commands, Read/Modify (RdMod) commands, Change-to-Dirty (CTD) commands, Victim commands, and Evict commands, the latter of which specify removal of a cache line from a respective cache.

Probes are commands issued by the system to one or more processors requesting data and/or cache tag status updates. Probes include Forwarded Read (Frd) commands, Forwarded Read Modify (FRdMod) commands and Invalidate (Inval) commands. When a processor P issues a request to the system, the system may issue one or more probes (via probe packets) to other processors. For example if P requests a copy of a cache line (a Rd request), the system sends a probe to the owner processor (if any). If P requests exclusive ownership of a cache line (a CTD request), the system sends Inval probes to one or more processors having copies of the cache line. If P requests both a copy of the cache line as well as exclusive ownership of the cache line (a RdMod request) the system sends a FRd probe to a processor currently storing a dirty copy of a cache line of data. In response to the Frd probe, the dirty copy of the cache line is returned to the system. A FRdMod probe is also issued by the system to a processor storing a dirty copy of a cache line. In response to the FRdMod probe, the dirty cache line is returned to the system and the dirty copy stored in the cache is invalidated. An Inval probe may be issued by the system to a processor storing a copy of the cache line in its cache when the cache line is to be updated by another processor.

Responses are commands from the system to processors/IOPs which carry the data requested by the processor or an acknowledgment corresponding to a request. For Rd and RdMod requests, the response is a Fill and FillMod response, respectively, each of which carries the requested data. For a CTD request, the response is a CTD-Success (Ack) or CTD-Failure (Nack) response, indicating success or failure of the CTD, whereas for a Victim request, the response is a Victim-Release response.

FIG. 2 is a schematic block diagram of the local switch 200 comprising a plurality of ports 202-210, each of which is coupled to a respective processor (P1-P4) 102-108 and IOP 130 via a full-duplex, bi-directional clock forwarded data link. Each port includes a respective input queue 212-220 for receiving, e.g., a memory reference request issued by its processor and a respective output queue 222-230 for receiving, e.g., a memory reference probe issued by system control logic associated with the switch. An arbiter 240 arbitrates among the input queues to grant access to the Arb bus 170 where the requests are ordered into a memory reference request stream. In the illustrative embodiment, the arbiter selects the requests stored in the input queues for access to the bus in accordance with an arbitration policy, such as a conventional round-robin algorithm.

The following example illustrates the typical operation of multiprocessing system including switch 200. A Rd request for data item x is received at the switch 200 from P1 and loaded into input queue 212. The arbiter 240 selects the request in accordance with the arbitration algorithm. Upon gaining access to the Arb bus 170, the selected request is routed to the ordering point 250 wherein the states of the corresponding cache lines are interrogated in the DTAG 160. Specifically, the coherence controller 180 examines the DTAG to determine which entity of the system "owns" the cache line and which entities have copies of the line. If processor P3 is the owner of the cache line x and P4 has a copy, the coherence controller generates the necessary probes (e.g., a Fill x and Inval x) and forwards them to the output queues 226 and 228 for transmission to the processors.

Because of operational latencies through the switch and data paths of the system, memory reference requests issued by P1 may complete out-of-order. In some cases, out-of-order completion may affect the consistency of data in the system, particularly for updates to a cache line. Memory consistency models provide formal specifications of how such updates become visible to the entities of the multiprocessor system. In the illustrative embodiment of the present invention, a weak ordering consistency model is described, although it will be apparent to those skilled in the art that other consistency models may be used.

In a weakly-ordered system, inter-reference ordering is typically imposed by a memory barrier (MB) instruction inserted between memory reference instructions of a program executed by a processor. The MB instruction separates and groups those instructions of a program that need ordering from the rest of the instructions. The semantics of weak ordering mandate that all pre-MB memory reference operations are logically ordered before all post-MB references. For example, the following program instructions are executed by P1 and P2:

______________________________________ P1 P2 ______________________________________ St x Ld flag, 0 St y MB St z Rd x MB Rd y St flag,0 Rd z ______________________________________

In the case of P1's program, it is desired to store (via a write operation) all of the data items x, y and z before modifying the value of the flag; the programmer indicates this intention by placing the MB instruction after St z. According to the weak-ordering semantics, the programmer doesn't care about the order in which the pre-MB store instructions issue as memory reference operations, nor does she care about the order in which the post-MB references appear to the system. Essentially, the programmer only cares that every pre-MB store instruction appears before every post-MB instruction. At P2, a load (via a read operation) flag is performed to test for the value 0. Testing of the flag is ordered with respect to acquiring the data items x, y and z as indicated by the MB instruction. Again, it is not necessary to impose order on the individual post-MB instructions.

To ensure correct implementation of the consistency model, prior systems inhibit program execution past the MB instruction until actual completion of all pre-MB operations have been confirmed to the processor. Maintaining inter-reference order from all pre-MB operations to all post-MB operations typically requires acknowledgment responses and/or return data to signal completion of the pre-MB operations. The acknowledgment responses may be gathered and sent to the processor issuing the operations. The pre-MB operations are considered completed only after all responses and data are received by the requesting processor. Thus, referring to the example above with respect to operation of a prior multiprocessing system, once P1 has received the data and acknowledgment responses (e.g., an Inval acknowledgment) corresponding to an operation, the operation is considered complete.

Since each memory reference operation may consist of a number of commands, the latency of inter-reference ordering is a function of the extent to which each command must complete before the reference is considered ordered. The present invention relates to a mechanism for reducing the latency of inter-reference ordering between sets of memory reference operations in a multiprocessor system having a shared memory that is distributed among a plurality of processors configured to issue and complete those operations out-of-order.

The mechanism generally comprises a commit-signal that is generated by the ordering point 250 of the multiprocessor system in response to a memory reference operation issued by a requesting processor for particular data. FIG. 3 is a schematic diagram of a commit-signal that is preferably implemented as a commit-signal packet structure 300 characterized by the assertion of a single, commit-signal ("C") bit 310 to processor. It will be apparent to those skilled in the art that the commit-signal may be manifested in a variety of forms, including a discrete signal on a wire or, in another embodiment, a packet identifying the operation corresponding to the commit-signal. Program execution may proceed past the MB instruction once commit-signals for all pre-MB operations have been received by the processor, thereby increasing the performance of the system. The commit-signal facilitates inter-reference ordering; in addition, the novel signal indicates the apparent completion of the memory reference operation, rather than actual completion of the operation. Notably, the apparent completion of an operation occurs substantially sooner than the actual completion of an operation, thereby improving performance of the multiprocessor system.

Referring again to the above example including the program instructions executed by P1, generation of a commit-signal by the ordering point 250 in response to each RdMod request for data items x, y and z (corresponding to each store instruction for those data items) issued by P1 occurs upon successful arbitration and access to the Arb bus 170, and total ordering of those requests with respect to all memory reference requests appearing on the bus. Total ordering of each memory reference request constitutes a commit-event for the requested operation. According to the invention, the commit-signal 300 is preferably transmitted to P1 upon the occurrence of, or after, the commit-event.

The ordering point 250 determines the state of the data items throughout the system and generates probes (i.e., probe packets) to invalidate copies of the data and to request forwarding of the data from the owner to the requesting processor P1. For example, the ordering point may generate FRdMod probe to P3 (i.e., the owner) and Inval probes to P2 and P4. The ordering point also generates the commit-signal at this time for transmission to the P1. The commit-signal and probe packets are loaded into the output queues and forwarded to the respective processors in single, first-in, first-out (FIFO) order; in the case of P1, the commit-signal is loaded into queue 222 and forwarded to P1 along with any other probes pending in the queue. As an optimization, the commit-signal 300 may be "piggy backed" on top of one of these probe packets; in the illustrative embodiment of such an optimization, the C-bit of a probe packet may be asserted to indicate that a commit-signal is