WikiPatents - Community Patent Review
Create Free Account  |  License or Sell Your Patent  |  WikiPatents Marketplace  |  WikiPatents Blog
Username:  Password:  
    
Advanced Search
Multiprocessing computer system employing local and global address spaces and COMA and NUMA access modes    
United States Patent5887138   
Link to this pagehttp://www.wikipatents.com/5887138.html
Inventor(s)Hagersten; Erik E. (Palo Alto, CA); Loewenstein; Paul N. (Palo Alto, CA)
AbstractA multiprocessing computer system employs local and global address spaces and Non- Uniform Memory Architecture (NUMA) and Cache-Only Memory Architecture (COMA) access modes. The multiprocessing computer architecture employs a plurality of processing nodes. When a processing node initiates a memory transaction, the node determines whether the address of the memory transaction is a global address or a local physical address. If the address is a global address, a NUMA coherency request is initiated. Alternatively, if the address is a local physical address, a COMA coherency request is initiated. The nodes additionally include local physical address to global address translation units. The local physical address to global address translation units are configured to translate a local physical address to a corresponding global address prior to initiating a COMA coherency request.
   














 Title Information Submit all comments and votes
 
Patent Text Patent PDF Print Page Summary File History
Plain text PDF images Print Summary File History
Inventor     Hagersten; Erik E. (Palo Alto, CA); Loewenstein; Paul N. (Palo Alto, CA)
Owner/Assignee     Sun Microsystems, Inc. (Mountain View, CA)
Patent assignment
All assignments
Publication Date     March 23, 1999
Application Number     08/675,635
PAIR File History     Application Data   Transaction History
Image File Wrapper   Patent Term   Fees
Litigation
Filing Date     July 1, 1996
US Classification     709/215 709/214 711/141 711/148 712/10 712/28
Int'l Classification     G06F 013/42
Examiner     Lee; Thomas C.
Assistant Examiner     Patel; Gautam R.
Attorney/Law Firm     PC, Kivlin; B. Noel Conley, Rose & Tayon,
Address
Parent Case    
Priority Data    
USPTO Field of Search     395/800.28 395/800.27 395/200.43 395/200.45 395/200.46 395/200.42 395/200.44 395/475 395/480 395/468 395/800.1 711/148 711/141
Patent Tags     multiprocessing computer employing local global address spaces coma numa access modes
   
Enter a comma (,) or semicolon (;) between multiple tag words/phrases.
Describe this patent:
 Amusing   
 Clever   
 Complex   
 Efficient   
 Historic   
 Important   
 Innovative   
 Interesting   
 Practical   
 Simple   
[no votes]
Patent WIKI

Share information and news about this patent, including information and news about the technology, inventors, company, ligation and licensing.

 References Submit all comments and votes
 
*references marked with an asterisk below are user-added references
 U.S. References
 
Add a new US reference:  
ReferenceRelevancyCommentsReferenceRelevancyComments
5710907
Hagersten
711/148
Jan,1998

[0 after 0 votes]
5692149
Lee
711/133
Nov,1997

[0 after 0 votes]
5617537
Yamada

Apr,1997

[0 after 0 votes]
5613071
Rankin
707/10
Mar,1997

[0 after 0 votes]
5535116
Gupta
700/5
Jul,1996

[0 after 0 votes]
5428803
Chen
712/6
Jun,1995

[0 after 0 votes]
 Foreign References
 Other References
 Market Review Submit all comments and votes
   
Market Size
Estimate the gross annual revenues of the relevant market sector:
> $10B
$5B - $10B
$2B - $5B
$500M - $2B
$100M - $500M
$10M - $100M
$1M - $10M
$500K - $1M
$100K - $500K
< $100K
[No votes]
$0
 
$0   $2.5B   $5B   $7.5B   $10B
Market Share
Estimate the percentage of the relevant market sector this invention will capture:
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Reasonable Royalty
What percentage of gross sales should the inventor or assignee be paid?
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Public's "Guesstimation" of Royalty Value
Market SizeN/A[No votes]
xMarket ShareN/A[No votes]
xReasonable RoyaltyN/A[No votes]

N/A

License Availablity
If you are NOT the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
License Availablity
If you ARE the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
Competitive Advantage
Does this invention have a significant competitive advantage over similar technologies?
Yes

No



[No votes]
Most helpful competitive advantage comment
[No comments]

Commercial Alternatives
Are there viable commercial alternatives for this invention?
Yes

No



[No votes]
Most helpful commercial alternative comment
[No comments]

 Technical Review Submit all comments and votes
 Claims Submit all comments and votes
 


What is claimed is:

1. A multiprocessing computer system comprising:

a first processing node including a first processor, a first memory, and a first system interface; and

a second processing node coupled to said first processing node, said second processing node including a second memory, wherein said first memory and said second memory comprise a distributed shared memory system;

wherein said first processor is configured to initiate a first transaction having a first address, wherein a first portion of said first address is indicative of a location of a coherency unit corresponding to said first address within said distributed shared memory system, wherein said coherency unit is a number of contiguous bytes of memory,

wherein a COMA mode is selected for said first transaction if said first portion of said first address indicates that said coherency unit is stored in said first memory, and wherein a NUMA mode is selected for said first transaction if said first portion of said first address indicates that said coherency unit is stored in said second memory, and

wherein said first system interface is configured to initiate a NUMA coherency request responsive to said first transaction and said NUMA mode being selected for said first transaction, and wherein said first system interface is configured to initiate a COMA coherency request responsive to said first transaction, said COMA mode being selected and a corresponding coherency unit stored within said first memory is a copy of a third coherency unit stored within said second memory, and wherein said NUMA coherency request causes said first processor to complete said first transaction upon data stored within said second memory, wherein said COMA coherency request causes said first processor to complete said first transaction upon data stored within said first memory.

2. The multiprocessing computer system as recited in claim 1 wherein said first system interface comprising a translation unit configured to translate said first address from a local physical address specifying said first coherency unit to a corresponding global address specifying said third coherency unit prior to initiating said COMA coherency request.

3. The multiprocessing computer system as recited in claim 2 wherein said first system interface further comprises a storage for storing a plurality of coherency states including a coherency state corresponding to said local physical address.

4. The multiprocessing computer system as recited in claim 3 wherein said first system interface is configured to determine if said COMA coherency request is to be generated by examining said coherency state.

5. The multiprocessing computer system as recited in claim 3 wherein said plurality of coherency states includes a coherency state for each coherency unit stored in said first memory.

6. The multiprocessing computer system as recited in claim 1 wherein said first processing node comprises a first plurality of processors including said first processor.

7. The multiprocessing computer system as recited in claim 6 wherein said first processing node comprises a symmetric multiprocessing node.

8. The multiprocessing computer system as recited in claim 7 wherein said first plurality of processors are coupled to provide transactions upon a shared bus within said first processing node, said first system interface also coupled to said shared bus.

9. The multiprocessing computer system as recited in claim 8 wherein each of said first plurality of processors is coupled to said shared bus via a respective one of a first plurality of external caches.

10. The multiprocessing computer system as recited in claim 1 further comprising a third processing node coupled to said first processing node and said second processing node, wherein said second processing node is configured to generate a coherency demand for said third processing node in response to a coherency request from said first processing node if said third processing node is storing a second copy of said third coherency unit, and wherein said third processing node is configured to transmit a coherency reply to said first processing node in response to said coherency demand.

11. The multiprocessing computer system as recited in claim 10 wherein said coherency reply includes said third coherency unit, and wherein said first system interface is configured to provide said third coherency unit to said first processor, and wherein said first system interface is further configured to store said third coherency unit as said first coherency unit in said first memory if said coherency request is said COMA coherency request.

12. A system interface for a processing node in a multiprocessing system comprising

a system interface logic unit coupled to receive a transaction initiated by a processor within said processing node, said system interface logic unit is configured to generate a COMA coherency request in response to said transaction if a portion of an address corresponding to said transaction indicates a COMA mode for said transaction and a corresponding coherency unit stored within said first memory is a copy of a third coherency unit stored within said second memory, wherein said COMA coherency request causes said processor to complete said transaction upon data stored within a memory within said processing node, and wherein said system interface logic unit is configured to generate a NUMA coherency request in response to said transaction if said portion of said address indicates a NUMA mode for said transaction, wherein said NUMA coherency request causes said processor to complete said transaction upon data stored within a memory in a second processing node; and

a translation unit coupled to said system interface logic unit, wherein said translation unit is configured to translate said address to a corresponding global address, said system interface logic unit is configured to receive said corresponding global address from said translation unit prior to initiating said COMA coherency request.

13. The system interface as recited in claim 12 wherein said system interface logic unit is configured to use said corresponding global address as an address for said COMA coherency request.

14. The system interface as recited in claim 12 farther comprising a storage configured to store a coherency state corresponding to a coherency unit, wherein said coherency unit is a number of contiguous bytes of memory, and wherein said coherency unit corresponds to said address of said transaction if said portion of said address indicates said COMA mode.

15. The system interface as recited in claim 14 further comprising a transaction filter coupled to said storage and said system interface logic unit, wherein said transaction filter is configured to determine if an access right represented by said coherency state is sufficient for said transaction to complete within said processing node, and wherein, if said access right is insufficient for said transaction to complete within said processing node, said transaction filter is configured to convey said transaction to said system interface logic unit, and wherein, if said access right is sufficient for said transaction to complete within said processing node, said transaction filter is configured to inhabit conveyance of said transaction to said system interface logic unit.

16. A method for operating a multiprocessing computer system including a first processing node comprising a first processor and a first memory, said multiprocessing computer system further including a second processing node, the method comprising:

initiating a transaction having an address corresponding to a coherency unit, said initiating performed by said first processor, wherein said coherency unit is a number of contiguous bytes of memory;

generating a COMA coherency request if a portion of said address indicates a COMA mode for said transaction and a corresponding coherency unit stored within said first memory is a copy of a different coherency unit in a second memory within said second processing node, wherein said COMA coherency request causes said first processor to complete said transaction upon data stored within said first memory; and

generating a NUMA coherency request if said portion of said address indicates a NUMA mode for said transaction, wherein said NUMA coherency request causes said first processor to complete said transaction upon data stored within said second memory.

17. The method as recited in claim 16 further comprising receiving a coherency reply in said first processing node, said coherency reply including said coherency unit, and providing said coherency unit to said first processor.

18. The method as recited in claim 17 further comprising storing said coherency unit in said first memory, if said coherency reply is responsive to said COMA coherency request.

19. The method as recited in claim 18 further comprising inhibiting storage of said coherency unit in said first memory, if said coherency reply is responsive to said NUMA request.

20. The method as recited in claim 16 wherein said generating a COMA coherency request is performed if an access right stored in said first processing node is insufficient for said transaction to complete within said first processing node.
 Description Submit all comments and votes
 


CROSS REFERENCE TO RELATED PATENT APPLICATIONS

This patent application is related to the following copending, commonly assigned patent applications, the disclosures of which are incorporated herein by reference in their entirety:

1. "Extending The Coherence Domain Beyond A Computer System Bus" by Hagersten et al., Ser. No. 08/673,059, now U.S. Pat. No. 5,829,033 filed concurrently herewith.

2. "Method And Apparatus Optimizing Global Data Replies In A Computer System" by Hagersten, Ser. No. 08/675,284, filed concurrently herewith.

3. "Method And Apparatus Providing Short Latency Round-Robin Arbitration For Access To A Shared Resource" by Hagersten et al., Ser. No. 08/675,286, filed concurrently herewith.

4. "Implementing Snooping On A Split-Transaction Computer System Bus" by Singhal et al., Ser. No. 08/673,038, filed concurrently herewith.

5. "Split Transaction Snooping Bus Protocol" by Singhal et al., Ser. No. 08/673,967, filed concurrently herewith.

6. "Interconnection Subsystem For A Multiprocessor Computer System With A Small Number Of Processors Using A Switching Arrangement Of Limited Degree" by Heller et al., Ser. No. 08/675,629, filed concurrently herewith.

7. "System And Method For Performing Deadlock Free Message Transfer In Cyclic Multi-Hop Digital Computer Network" by Wade et al., Ser. No. 08/674,277, filed concurrently herewith.

8. "Synchronization System And Method For Plesiochronous Signaling" by Cassiday et al., Ser. No. 08/674,316, now U.S. Pat. No. 5,799,175, filed concurrently herewith.

9. "Methods And Apparatus For A Coherence Transformer For Connecting Computer System Coherence Domains" by Hagersten et al., Ser. No. 08/677,015, filed concurrently herewith.

10. "Methods And Apparatus For A Coherence Transformer With Limited Memory For Connecting Computer System Coherence Domains" by Hagersten et al., Ser. No. 08/677,014, now U.S. Pat. No. 5,829,034 filed concurrently herewith.

11. "Methods And Apparatus For Sharing Stored Data Objects In A Computer System" by Hagersten et al., Ser. No. 08/673,130, filed concurrently herewith.

12. "Methods And Apparatus For A Directory-Less Memory Access Protocol In A Distributed Shared Memory Computer System" by Hagersten et al., Ser. No. 08/671,303, filed concurrently herewith.

13. "Hybrid Memory Access Protocol In A Distributed Shared Memory Computer System" by Hagersten et al., Ser. No. 08/673,957, filed concurrently herewith.

14. "Methods And Apparatus For Substantially Memory-Less Coherence Transformer For Connecting Computer System Coherence Domains" by Hagersten et al., Ser. No. 08/677,012, filed concurrently herewith.

15. "A Multiprocessing System Including An Enhanced Blocking Mechanism For Read To Share Transactions In A NUMA Mode" by Hagersten, Ser. No. 08/674,271, filed concurrently herewith.

16. "Encoding Method For Directory State In Cache Coherent Distributed Shared Memory Systems" by Guzovskiy et al., Ser. No. 08/672,946, now U.S. Pat. No. 5,752,258 filed concurrently herewith.

17. "Software Use Of Address Translation Mechanism" by Nesheim et al., Ser. No. 08/673,043, filed concurrently herewith.

18. "Directory-Based, Shared-Memory, Scaleable Multiprocessor Computer System Having Deadlock-free Transaction Flow Sans Flow Control Protocol" by Lowenstein et al., Ser. No. 08/674,358, filed concurrently herewith.

19. "Maintaining A Sequential Stored Order (SSO) In A Non-SSO Machine" by Nesheim, Ser. No. 08/673,049, filed concurrently herewith.

20. "Node To Node Interrupt Mechanism In A Multiprocessor System" by Wong-Chan, Ser. No. 08/672,947, filed concurrently herewith.

21. "Deterministic Distributed Multicache Coherence Protocol" by Hagersten et al., filed Apr. 8, 1996, Ser. No. 08/630,703.

22. "A Hybrid NUMA Coma Caching System And Methods For Selecting Between The Caching Modes" by Hagersten et al., filed Dec. 22, 1995, Ser. No. 08/577,283, now U.S. Pat. No. 5,710,907.

23. "A Hybrid NUMA Coma Caching System And Methods For Selecting Between The Caching Modes" by Wood et al., filed Dec. 22, 1995, Ser. No. 08/575,787.

24. "Flushing Of Cache Memory In A Computer System" by Hagersten et al., Ser. No. 08/673,881, filed concurrently herewith.

25. "Efficient Allocation Of Cache Memory Space In A Computer System" by Hagersten et al., Ser. No. 08/675,306, filed concurrently herewith.

26. "Efficient Selection Of Memory Storage Modes In A Computer System" by Hagersten et al., Ser. No. 08/674,029, now U.S. Pat. No. 5,802,563 filed concurrently herewith.

27. "Skip-level Write-through In A Multi-level Memory Of A Computer System" by Hagersten et al., Ser. No. 08/674,560, filed concurrently herewith.

28. "A Multiprocessing System Configured to Perform Efficient Write Operations" by Hagersten, Ser. No. 08/675,634, now U.S. Pat. No. 5,749,095 filed concurrently herewith.

29. "A Multiprocessing System Configured to Perform Efficient Block Copy Operations" by Hagersten, Ser. No. 08/674,269, filed concurrently herewith.

30. "A Multiprocessing System Including An Apparatus For Optimizing Spin-Lock Operations" by Hagersten, Ser. No. 08/674,272, filed concurrently herewith.

31. "A Multiprocessing System Configured to Detect and Efficiently Provide for Migratory Data Access Patterns" by Hagersten et al., Ser. No. 08/674,330, now U.S. Pat. No. 5,734,922 filed concurrently herewith.

32. "A Multiprocessing System Configured to Store Coherency State within Multiple Subnodes of a Processing Node" by Hagersten, Ser. No. 08/674,274, filed concurrently herewith.

33. "A Multiprocessing System Configured to Perform Prefetching Operations" by Hagersten et al., Ser. No. 08/674,327, filed concurrently herewith.

34. "A Multiprocessing System Configured to Perform Synchronization Operations" by Hagersten et al., Ser. No. 08/674,328, filed concurrently herewith.

35. "A Multiprocessing System Having Coherency-Related Error Logging Capabilities" by Hagersten et al., Ser. No. 08/674,276, filed concurrently herewith.

36. "Multiprocessing System Employing A Three-Hop Communication Protocol" by Hagersten, Ser. No. 08/674,270, filed concurrently herewith.

37. "A Multiprocessing Computer System Employing Local and Global Address Spaces and Multiple Access Modes" by Hagersten, Ser. No. 08/675,635, filed concurrently herewith.

38. "Multiprocessing System Employing A Coherency Protocol Including A Reply Count" by Hagersten et al., Ser. No. 08/674,314, filed concurrently herewith.

BACKGROUND OF THE INVENTION

1. Field of the Invention

This invention relates to the field of multiprocessor computer systems and, more particularly, to communication protocols employed within multiprocessor computer systems having distributed shared memory architectures.

2. Description of the Relevant Art

Multiprocessing computer systems include two or more processors which may be employed to perform computing tasks. A particular computing task may be performed upon one processor while other processors perform unrelated computing tasks. Alternatively, components of a particular computing task may be distributed among multiple processors to decrease the time required to perform the computing task as a whole. Generally speaking, a processor is a device configured to perform an operation upon one or more operands to produce a result. The operation is performed in response to an instruction executed by the processor.

A popular architecture in commercial multiprocessing computer systems is the symmetric multiprocessor (SMP) architecture. Typically, an SMP computer system comprises multiple processors connected through a cache hierarchy to a shared bus. Additionally connected to the bus is a memory, which is shared among the processors in the system. Access to any particular memory location within the memory occurs in a similar amount of time as access to any other particular memory location. Since each location in the memory may be accessed in a uniform manner, this structure is often referred to as a uniform memory architecture (UMA).

Processors are often configured with internal caches, and one or more caches are typically included in the cache hierarchy between the processors and the shared bus in an SMP computer system. Multiple copies of data residing at a particular main memory address may be stored in these caches. In order to maintain the shared memory model, in which a particular address stores exactly one data value at any given time, shared bus computer systems employ cache coherency. Generally speaking, an operation is coherent if the effects of the operation upon data stored at a particular memory address are reflected in each copy of the data within the cache hierarchy. For example, when data stored at a particular memory address is updated, the update may be supplied to the caches which are storing copies of the previous data. Alternatively, the copies of the previous data may be invalidated in the caches such that a subsequent access to the particular memory address causes the updated copy to be transferred from main memory. For shared bus systems, a snoop bus protocol is typically employed. Each coherent transaction performed upon the shared bus is examined (or "snooped") against data in the caches. If a copy of the affected data is found, the state of the cache line containing the data may be updated in response to the coherent transaction.

Unfortunately, shared bus architectures suffer from several drawbacks which limit their usefulness in multiprocessing computer systems. A bus is capable of a peak bandwidth (e.g. a number of bytes/second which may be transferred across the bus). As additional processors are attached to the bus, the bandwidth required to supply the processors with data and instructions may exceed the peak bus bandwidth. Since some processors are forced to wait for available bus bandwidth, performance of the computer system suffers when the bandwidth requirements of the processors exceeds available bus bandwidth.

Additionally, adding more processors to a shared bus increases the capacitive loading on the bus and may even cause the physical length of the bus to be increased. The increased capacitive loading and extended bus length increases the delay in propagating a signal across the bus. Due to the increased propagation delay, transactions may take longer to perform. Therefore, the peak bandwidth of the bus may decrease as more processors are added.

These problems are further magnified by the continued increase in operating frequency and performance of processors. The increased performance enabled by the higher frequencies and more advanced processor microarchitectures results in higher bandwidth requirements than previous processor generations, even for the same number of processors. Therefore, buses which previously provided sufficient bandwidth for a multiprocessing computer system may be insufficient for a similar computer system employing the higher performance processors.

Another structure for multiprocessing computer systems is a distributed shared memory architecture. A distributed shared memory architecture includes multiple nodes within which processors and memory reside. The multiple nodes communicate via a network coupled there between. When considered as a whole, the memory included within the multiple nodes forms the shared memory for the computer system. Typically, directories are used to identify which nodes have cached copies of data corresponding to a particular address. Coherency activities may be generated via examination of the directories.

Distributed shared memory systems are scaleable, overcoming the limitations of the shared bus architecture. Since many of the processor accesses are completed within a node, nodes typically have much lower bandwidth requirements upon the network than a shared bus architecture must provide upon its shared bus. The nodes may operate at high clock frequency and bandwidth, accessing the network when needed. Additional nodes may be added to the network without affecting the local bandwidth of the nodes. Instead, only the network bandwidth is affected.

The coherence between nodes in a distributed shared memory system is often kept using a distributed implementation of coherence protocols. Many such coherence protocols employ four-hop replies wherein a request is first sent to a home node from a requesting node. The home node responsively sends read/invalidate demands to slave nodes holding cached copies of the data. The slaves reply back to the home node according to the demands. The four-hop reply protocol is completed when the home node replies back to the requesting node.

Unfortunately, the communication patterns generated when data must be accessed from a remote node causes a significant amount of network traffic. In addition, after all slave nodes have replied to the home node, the requesting node must wait until the home node sends a completion indication back to the requesting node before the requesting node can treat the transaction as completed. This may add to the overall latency of the critical path associated with the coherency transaction.

A multiprocessor computer system having a distributed shared memory system is thus desirable wherein network traffic is reduced and wherein the latency in replying to a requesting node is reduced.

SUMMARY OF THE INVENTION

The problems outlined above are in large part solved by a multiprocessor computer system employing local and global address spaces and multiple access modes in accordance with the present invention. In one embodiment, the multiprocessing computer system comprises a first processing node including a first processor, a first memory, and a first system interface, and a second processing node coupled to the first processing node. The second processing node may include a second memory. The first memory and the second memory collectively comprise a distributed shared memory system. The first processor may be configured to initiate a first transaction having a first address in which the first address may contain a first value indicative of a location of its location. The first value may also correspond to a coherency unit stored in the first memory. In this embodiment, the first address is a local physical address if the first value identifies the location as within the first memory of the first processing node, while the first address is a global address if the first value identifies a location of that is not local to the processing node as it is stored within the second memory. Further, in this embodiment, the first system interface may be configured to initiate a NUMA coherency request if the first address is a global address, and a COMA coherency request if the address is a local physical address and a corresponding coherency unit stored within the first memory is a copy of a third coherency unit stored within the second memory. The NUMA coherency request causes the first processor to complete transactions upon data stored within the second processing node. The COMA coherency request causes the first processor to complete transactions upon data stored local to the first processing node. In one embodiment, when a request is sent by a requesting node to a home node, the home node sends read and/or invalidate demands to any slave nodes holding cached copies of the requested data. The demands from the home node to the slave nodes may each advantageously include a value indicative of the number of replies the requesting agent should expect to receive. The slaves reply back to the requesting node with either data or an acknowledge. Each reply may further include the number of replies the requester should expect. Upon receiving all expected replies, the requesting node may treat the transaction as completed and proceed with subsequent processing. In this manner, all communications may require at most a three-hop communication on the critical path of the cache coherence protocol. Accordingly, the overall network traffic as a result of the cache coherence protocol may be advantageously reduced. Furthermore, the latency of the critical path for a requesting node to complete a transaction may be reduced.

In one implementation, after the requesting node has received all expected replies, the requesting node may send a completion message back to the home. The home node may then remove a "block" placed upon the coherency unit of the completed transaction.

The requesting node may further or alternatively send data back to the home node to achieve memory reflection after receiving data from a slave node. Furthermore, in cases where the home node contains the requested data in an appropriate state, e.g., state shared for a read-to-own request, the home node does not send any demands to other nodes. Instead, the home node replies directly to the requesting node.

A system and method in accordance with the present invention may advantageously allow for an efficient and simple implementation of a global coherency protocol in a multiprocessing computer system. The protocol allows for an owner-based protocol wherein several dirty cached copies may reside in differing nodes with one of them in the owner state and a copy in the home node which is stale.

BRIEF DESCRIPTION OF THE DRAWINGS

Other objects and advantages of the invention will become apparent upon reading the following detailed description and upon reference to the accompanying drawings in which:

FIG. 1 is a block diagram of a multiprocessor computer system.

FIG. 1A is a conceptualized block diagram depicting a non-uniform memory architecture supported by one embodiment of the computer system shown in FIG. 1.

FIG. 1B is a conceptualized block diagram depicting a cache-only memory architecture supported by one embodiment of the computer system shown in FIG. 1.

FIG. 2 is a block diagram of one embodiment of a symmetric multiprocessing node depicted in FIG. 1.

FIG. 2A is an exemplary directory entry stored in one embodiment of a directory depicted in FIG. 2.

FIG. 3 is a block diagram of one embodiment of a system interface shown in FIG. 1.

FIG. 4 is a diagram depicting activities performed in response to a typical coherency operation between a request agent, a home agent, and a slave agent.

FIG. 5A is a diagram of an exemplary coherency operation performed in response to a read to own request from a processor.

FIG. 5B is a diagram depicting coherency activity in response to a read to own request when a slave agent is the current owner of the coherency unit and other slave agents have shared copies of the coherency unit.

FIG. 5C is a diagram that depicts coherency activity when a request agent has a shared copy and sends a read to own request to a home agent.

FIG. 5D is a diagram depicting coherency activity in response to a read to share request when a slave is the owner of a coherency unit.

FIG. 6 is a flowchart depicting an exemplary state machine for one embodiment of a request agent shown in FIG. 3.

FIG. 7 is a flowchart depicting an exemplary state machine for one embodiment of a home agent shown in FIG. 3.

FIG. 8 is a flowchart depicting an exemplary state machine for one embodiment of a slave agent shown in FIG. 3.

FIG. 9 is a table listing request types according to one embodiment of the system interface.

FIG. 10 is a table listing demand types according to one embodiment of the system interface.

FIG. 11 is a table listing reply types according to one embodiment of the system interface.

FIG. 12 is a table listing completion types according to one embodiment of the system interface.

FIG. 13 is a table describing coherency operations in response to various operations performed by a processor, according to one embodiment of the system interface.

While the invention is susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will herein be described in detail. It should be understood, however, that the drawings and detailed description thereto are not intended to limit the invention to the particular form disclosed, but on the contrary, the intention is to cover all modifications, equivalents and alternatives falling within the spirit and scope of the present invention as defined by the appended claims.

DETAILED DESCRIPTION OF THE INVENTION

Turning now to FIG. 1, a block diagram of one embodiment of a multiprocessing computer system 10 is shown. Computer system 10 includes multiple SMP nodes 12A-12D interconnected by a point-to-point network 14. Elements referred to herein with a particular reference number followed by a letter will be collectively referred to by the reference number alone. For example, SMP nodes 12A-12D will be collectively referred to as SMP nodes 12. In the embodiment shown, each SMP node 12 includes multiple processors, external caches, an SMP bus, a memory, and a system interface. For example, SMP node 12A is configured with multiple processors including processors 16A-16B. The processors 16 are connected to external caches 18, which are further coupled to an SMP bus 20. Additionally, a memory 22 and a system interface 24 are coupled to SMP bus 20. Still further, one or more input/output (I/O) interfaces 26 may be coupled to SMP bus 20. I/O interfaces 26 are used to interface to peripheral devices such as serial and parallel ports, disk drives, modems, printers, etc. Other SMP nodes 12B-12D may be configured similarly.

Generally speaking, for any given transaction a particular SMP node 12 may serve as a requesting node, a home node, or a slave node. When a request is sent by a requesting node to a home node, the home node sends read and/or invalidate requests to any slave nodes holding cached copies of the requested data. The demands from the home node to the slave nodes advantageously includes a value indicative of the number of replies the requesting agent should expect to receive. The slaves reply back to the requesting node with either data or an acknowledge. Each reply may further include the number of replies the requester should expect Upon receiving all expected replies, the requesting node may treat the transaction as completed and proceed with subsequent processing. In this manner, all communications may require at most a three-hop communication on the critical path of the cache coherence protocol. Accordingly, the overall network traffic as a result of the cache coherence protocol may be advantageously reduced. Furthermore, the latency of the critical path for a requesting node to complete a transaction may be reduced.

In one implementation, after the requesting node has received all expected replies, the requesting node may send a completion message back to the home. The home node may remove a "block" placed upon the coherency unit of the completed transaction.

The requesting node may further or alternatively send data back to the home node to achieve memory reflection after receiving data from a slave node. Furthermore, in cases where the home node contains the requested data in an appropriate state, e.g., state shared for a read-to-own request, the home node does not send any demands to other nodes. Instead, the home node replies directly to the requesting node. Further details regarding the communication protocol associated with system 10 are provided further below.

As used herein, a memory operation is an operation causing transfer of data from a source to a destination. The source and/or destination may be storage locations within the initiator, or may be storage locations within memory. When a source or destination is a storage location within memory, the source or destination is specified via an address conveyed with the memory operation. Memory operations may be read or write operations. A read operation causes transfer of data from a source outside of the initiator to a destination within the initiator. Conversely, a write operation causes transfer of data from a source within the initiator to a destination outside of the initiator. In the computer system shown in FIG. 1, a memory operation may include one or more transactions upon SMP bus 20 as well as one or more coherency operations upon network 14.

Each SMP node 12 is essentially an SMP system having memory 22 as the shared memory. Processors 16 are high performance processors. In one embodiment, each processor 16 is a SPARC processor compliant with version 9 of the SPARC processor architecture. It is noted, however, that any processor architecture may be employed by processors 16.

Typically, processors 16 include internal instruction and data caches. Therefore, external caches 18 are labeled as L2 caches (for level 2, wherein the internal caches are level 1 caches). If processors 16 are not configured with internal caches, then external caches 18 are level 1 caches. It is noted that the "level" nomenclature is used to identify proximity of a particular cache to the processing core within processor 16. Level 1 is nearest the processing core, level 2 is next nearest, etc. External caches 18 provide rapid access to memory addresses frequently accessed by the processor 16 coupled thereto. It is noted that external caches 18 may be configured in any of a variety of specific cache arrangements. For example, set-associative or direct-mapped configurations may be employed by external caches 18.

SMP bus 20 accommodates communication between processors 16 (through caches 18), memory 22, system interface 24, and I/O interface 26. In one embodiment, SMP bus 20 includes an address bus and related control signals, as well as a data bus and related control signals. Because the address and data buses are separate, a split-transaction bus protocol may be employed upon SMP bus 20. Generally speaking, a split-transaction bus protocol is a protocol in which a transaction occurring upon the address bus may differ from a concurrent transaction occurring upon the data bus. Transactions involving address and data include an address phase in which the address and related control information is conveyed upon the address bus, and a data phase in which the data is conveyed upon the data bus. Additional address phases and/or data phases for other transactions may be initiated prior to the data phase corresponding to a particular address phase. An address phase and the corresponding data phase may be correlated in a number of ways. For example, data transactions may occur in the same order that the address transactions occur. Alternatively, address and data phases of a transaction may be identified via a unique tag.

Memory 22 is configured to store data and instruction code for use by processors 16. Memory 22 preferably comprises dynamic random access memory (DRAM), although any type of memory may be used. Memory 22, in conjunction with similar illustrated memories in the other SMP nodes 12, forms a distributed shared memory system. Each address in the address space of the distributed shared memory is assigned to a particular node, referred to as the home node of the address. A processor within a different node than the home node may access the data at an address of the home node, potentially caching the data. Therefore, coherency is maintained between SMP nodes 12 as well as among processors 16 and caches 18 within a particular SMP node 12A-12D. System interface 24 provides internode coherency, while snooping upon SMP bus 20 provides intranode coherency.

In addition to maintaining internode coherency, system interface 24 detects addresses upon SMP bus 20 which require a data transfer to or from another SMP node 12. System interface 24 performs the transfer, and provides the corresponding data for the transaction upon SMP bus 20. In the embodiment shown, system interface 24 is coupled to a point-to-point network 14. However, it is noted that in alternative embodiments other networks may be used. In a point-to-point network, individual connections exist between each node upon the network. A particular node communicates directly with a second node via a dedicated link. To communicate with a third node, the particular node utilizes a different link than the one used to communicate with the second node.

It is noted that, although four SMP nodes 12 are shown in FIG. 1, embodiments of computer system 10 employing any number of nodes are contemplated.

FIGS. 1A and 1B are conceptualized illustrations of distributed memory architectures supported by one embodiment of computer system 10. Specifically, FIGS. 1A and 1B illustrate alternative ways in which each SMP node 12 of FIG. 1 may cache data and perform memory accesses. Details regarding the manner in which computer system 10 supports such accesses will be described in further detail below.

Turning now to FIG. 1A, a logical diagram depicting a first memory architecture 30 supported by one embodiment of computer system 10 is shown. Architecture 30 includes multiple processors 32A-32D, multiple caches 34A-34D, multiple memories 36A-36D, and an interconnect network 38. The multiple memories 36 form a distributed shared memory. Each address within the address space corresponds to a location within one of memories 36.

Architecture 30 is a non-uniform memory architecture (NUMA). In a NUMA architecture, the amount of time required to access a first memory address may be substantially different than the amount of time required to access a second memory address. The access time depends upon the origin of the access and the location of the memory 36A-36D which stores the accessed data. For example, if processor 32A accesses a first memory address stored in memory 36A, the access time may be significantly shorter than the access time for an access to a second memory address stored in one of memories 36B-36D. That is, an access by processor 32A to memory 36A may be completed locally (e.g. without transfers upon network 38), while a processor 32A access to memory 36B is performed via network 38. Typically, an access through network 38 is slower than an access completed within a local memory. For example, a local access might be completed in a few hundred nanoseconds while an access via the network might occupy a few microseconds.

Data corresponding to addresses stored in remote nodes may be cached in any of the caches 34. However, once a cache 34 discards the data corresponding to such a remote address, a subsequent access to the remote address is completed via a transfer upon network 38.

NUMA architectures may provide excellent performance characteristics for software applications which use addresses that correspond primarily to a particular local memory. Software applications which exhibit more random access patterns and which do not confine their memory accesses to addresses within a particular local memory, on the other hand, may experience a large amount of network traffic as a particular processor 32 performs repeated accesses to remote nodes.

Turning now to FIG. 1B, a logic diagram depicting a second memory architecture 40 supported by the computer system 10 of FIG. 1 is shown. Architecture 40 includes multiple processors 42A-42D, multiple caches 44A-44D, multiple memories 46A-46D, and network 48. However, memorie