WikiPatents - Community Patent Review
Create Free Account  |  License or Sell Your Patent  |  WikiPatents Marketplace  |  WikiPatents Blog
Username:  Password:  
    
Advanced Search
Multiprocessing system configured to store coherency state within multiple subnodes of a processing node    
United States Patent5878268   
Link to this pagehttp://www.wikipatents.com/5878268.html
Inventor(s)Hagersten; Erik E. (Palo Alto, CA)
AbstractA computer system including one or more processing nodes, each of which includes one or more subnodes is provided. One of the subnodes (the controller subnode) manages the interface between the processing node and the remainder of the computer system. Other subnodes (snooper subnodes) are employed to store access rights for coherency units within the memory. The processing node's memory is logically divided into portions, and each subnode stores access rights for a particular memory portion. When a transaction is initiated within the processing node, the subnode storing the access rights for the coherency unit affected by the transaction analyzes the access rights and determines if the transaction may complete locally within the processing node. If coherency activity is required, the subnode asserts an ignore signal causing the transaction to be omitted while coherency activity is performed to acquire sufficient access rights. The access rights are updated concurrent with reissue of a transaction for which coherency activity is performed. In this manner, the updated access rights are available to subsequent transactions even though the access rights may be stored in a different subnode than the controller subnode (which performs the reissue transaction). In one embodiment, the updated access rights are conveyed within one of the address phases of the reissue transaction. A bytemask field within one of the address phases is used.
   














 Title Information Submit all comments and votes
 
Patent Text Patent PDF Print Page Summary File History
Plain text PDF images Print Summary File History
Inventor     Hagersten; Erik E. (Palo Alto, CA)
Owner/Assignee     Sun Microsystems, Inc. ()
Patent assignment
All assignments
Publication Date     March 2, 1999
Application Number     08/674,274
PAIR File History     Application Data   Transaction History
Image File Wrapper   Patent Term   Fees
Litigation
Filing Date     July 1, 1996
US Classification     712/28
Int'l Classification     G06F 015/16
Examiner     Ellis; Richard L.
Assistant Examiner    
Attorney/Law Firm     PC, Kivlin; B. Noel Conley, Rose & Tayon,
Address
Parent Case    
Priority Data    
USPTO Field of Search     395/800.28 395/800.29 395/800.3 395/379
Patent Tags     multiprocessing configured store coherency state within multiple subnodes processing node
   
Enter a comma (,) or semicolon (;) between multiple tag words/phrases.
Describe this patent:
 Amusing   
 Clever   
 Complex   
 Efficient   
 Historic   
 Important   
 Innovative   
 Interesting   
 Practical   
 Simple   
[no votes]
Patent WIKI

Share information and news about this patent, including information and news about the technology, inventors, company, ligation and licensing.

 References Submit all comments and votes
 
*references marked with an asterisk below are user-added references
 U.S. References
 
Add a new US reference:  
ReferenceRelevancyCommentsReferenceRelevancyComments
5613071
Rankin
707/10
Mar,1997

[0 after 0 votes]
5606686
Tarui
711/121
Feb,1997

[0 after 0 votes]
5577204
Brewer
710/317
Nov,1996

[0 after 0 votes]
5522058
Iwasa
711/145
May,1996

[0 after 0 votes]
5446841
Kitano
709/213
Aug,1995

[0 after 0 votes]
5428803
Chen
712/6
Jun,1995

[0 after 0 votes]
4648030
Bomba
711/141
Mar,1987

[0 after 0 votes]
 Foreign References
 Other References
 Market Review Submit all comments and votes
   
Market Size
Estimate the gross annual revenues of the relevant market sector:
> $10B
$5B - $10B
$2B - $5B
$500M - $2B
$100M - $500M
$10M - $100M
$1M - $10M
$500K - $1M
$100K - $500K
< $100K
[No votes]
$0
 
$0   $2.5B   $5B   $7.5B   $10B
Market Share
Estimate the percentage of the relevant market sector this invention will capture:
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Reasonable Royalty
What percentage of gross sales should the inventor or assignee be paid?
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Public's "Guesstimation" of Royalty Value
Market SizeN/A[No votes]
xMarket ShareN/A[No votes]
xReasonable RoyaltyN/A[No votes]

N/A

License Availablity
If you are NOT the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
License Availablity
If you ARE the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
Competitive Advantage
Does this invention have a significant competitive advantage over similar technologies?
Yes

No



[No votes]
Most helpful competitive advantage comment
[No comments]

Commercial Alternatives
Are there viable commercial alternatives for this invention?
Yes

No



[No votes]
Most helpful commercial alternative comment
[No comments]

 Technical Review Submit all comments and votes
 Claims Submit all comments and votes
 


What is claimed is:

1. A method for completing a transaction in a processing node of a multiprocessing computer system, comprising:

reissuing said transaction within said processing node upon completion of coherency activity performed with respect to said transaction;

broadcasting within said processing node a coherency state corresponding to a coherency unit affected by said transaction concurrent with said reissuing; and

recording said coherency state in a position within a table of coherency states within said processing node, wherein said position corresponds to said coherency unit.

2. The method as recited in claim 1 wherein said coherency state is indicative of an access right accorded to said processing node with respect to said coherency unit.

3. The method as recited in claim 1 wherein said reissuing is performed upon a bus within said processing node.

4. The method as recited in claim 3 wherein said broadcasting is initiated and delivered upon said bus within said processing node.

5. The method as recited in claim 4 wherein said reissuing is performed by a first subnode within said processing node, wherein said first subnode is configured to communicate between said processing node and other processing nodes within said multiprocessing computer system.

6. The method as recited in claim 5 wherein said broadcasting is performed by said first subnode.

7. The method as recited in claim 6 wherein said recording is performed by a second subnode coupled to said bus, wherein said second subnode includes a first processing unit, a first memory unit, and a first cache unit.

8. The method as recited in claim 7 further comprising issuing a second transaction affecting said coherency unit subsequent to said reissuing, wherein said second subnode responds in accordance with said coherency state received in said recording, wherein said second subnode includes a first processing unit, a first memory unit, and a first cache unit.

9. A system interface comprising:

a first subnode configured to communicate between a local bus of a processing node and a network, and configured to store a first plurality of coherency states corresponding to a first plurality of coherency units; and

a second subnode coupled to said local bus, wherein said second subnode is configured to store a second plurality of coherency states corresponding to a second plurality of coherency units stored within said processing node, wherein said first subnode and said second subnode each include a first processing unit, a first memory unit, and a first cache unit.

10. The system interface as recited in claim 9 further comprising a third subnode configured to store a third plurality of coherency states corresponding to a third plurality of coherency units stored within said processing node.

11. The system interface as recited in claim 9 wherein said first subnode is further configured to reissue a transaction on said local bus upon completion of coherency activity associated with said transaction via said network, and wherein said first subnode is further configured to concurrently broadcast a coherency state corresponding to a coherency unit affected by said transaction on said local bus.

12. The system interface as recited in claim 11 wherein said second subnode is further configured to capture and store said coherency state if said coherency unit is one of said first plurality of coherency units.

13. The system interface as recited in claim 11 wherein said transaction comprises a first address phase and a second address phase.

14. The system interface as recited in claim 13 wherein said first address phase and said second address phase are performed on consecutive bus clock cycles.

15. The system interface as recited in claim 14 wherein said second address phase includes said coherency state.

16. The system interface as recited in claim 15 wherein said coherency state is specified via a bytemask field of said second address phase.

17. The system interface as recited in claim 16 wherein said bytemask includes an indicate that said transaction is immune to assertion of an ignore signal upon said local bus.

18. The system interface as recited in claim 9 wherein said second subnode is configured to assert an ignore signal if a transaction is performed and a corresponding coherency state stored within said second subnode indicates that said processing node has insufficient access rights to an affected coherency unit.

19. The system interface as recited in claim 18 wherein said second subnode is further configured to assert a second signal upon a second signal line coupled to said first subnode, and wherein said second signal informs said first subnode that said second subnode is asserting said ignore signal.

20. A computer system comprising:

a network; and

a first processing node coupled to said network, said first processing node including:

a first controller subnode configured to effectuate communication upon said network and to reissue a transaction for which said communication is effectuated upon completion of said communication, and further to broadcast a coherency state achieved via said communication; and

a first snooper subnode configured to store a first plurality of coherency states corresponding to a first plurality of coherency units stored within said first processing node, wherein said first snooper subnode is configured to capture said coherency state broadcast by said first controller subnode if said coherency state is one of said plurality of coherency states, wherein said first controller subnode and said first snooper subnode each includes a first processing unit, a first memory unit, and a first cache unit.

21. The computer system as recited in claim 20 wherein said controller subnode is configured to store a second plurality of coherency states corresponding to a second plurality of coherency units.
 Description Submit all comments and votes
 


BACKGROUND OF THE INVENTION

1. Field of the Invention

This invention relates to the field of multiprocessor computer systems and, more particularly, to storing coherency states within multiple subnodes of processing nodes in distributed shared memory multiprocessing computer systems.

2. Description of the Relevant Art

Multiprocessing computer systems include two or more processors which may be employed to perform computing tasks. A particular computing task may be performed upon one processor while other processors perform unrelated computing tasks. Alternatively, components of a particular computing task may be distributed among multiple processors to decrease the time required to perform the computing task as a whole. Generally speaking, a processor is a device configured to perform an operation upon one or more operands to produce a result. The operation is performed in response to an instruction executed by the processor.

A popular architecture in commercial multiprocessing computer systems is the symmetric multiprocessor (SMP) architecture. Typically, an SMP computer system comprises multiple processors connected through a cache hierarchy to a shared bus. Additionally connected to the bus is a memory, which is shared among the processors in the system. Access to any particular memory location within the memory occurs in a similar amount of time as access to any other particular memory location. Since each location in the memory may be accessed in a uniform manner, this structure is often referred to as a uniform memory architecture (UMA).

Processors are often configured with internal caches, and one or more caches are typically included in the cache hierarchy between the processors and the shared bus in an SMP computer system. Multiple copies of data residing at a particular main memory address may be stored in these caches. In order to maintain the shared memory model, in which a particular address stores exactly one data value at any given time, shared bus computer systems employ cache coherency. Generally speaking, an operation is coherent if the effects of the operation upon data stored at a particular memory address are reflected in each copy of the data within the cache hierarchy. For example, when data stored at a particular memory address is updated, the update may be supplied to the caches which are storing copies of the previous data. Alternatively, the copies of the previous data may be invalidated in the caches such that a subsequent access to the particular memory address causes the updated copy to be transferred from main memory. For shared bus systems, a snoop bus protocol is typically employed. Each coherent transaction performed upon the shared bus is examined (or "snooped") against data in the caches. If a copy of the affected data is found, the state of the cache line containing the data may be updated in response to the coherent transaction.

Unfortunately, shared bus architectures suffer from several drawbacks which limit their usefulness in multiprocessing computer systems. A bus is capable of a peak bandwidth (e.g. a number of bytes/second which may be transferred across the bus). As additional processors are attached to the bus, the bandwidth required to supply the processors with data and instructions may exceed the peak bus bandwidth. Since some processors are forced to wait for available bus bandwidth, performance of the computer system suffers when the bandwidth requirements of the processors exceeds available bus bandwidth.

Additionally, adding more processors to a shared bus increases the capacitive loading on the bus and may even cause the physical length of the bus to be increased. The increased capacitive loading and extended bus length increases the delay in propagating a signal across the bus. Due to the increased propagation delay, transactions may take longer to perform. Therefore, the peak bandwidth of the bus may decrease as more processors are added.

These problems are further magnified by the continued increase in operating frequency and performance of processors. The increased performance enabled by the higher frequencies and more advanced processor microarchitectures results in higher bandwidth requirements than previous processor generations, even for the same number of processors. Therefore, buses which previously provided sufficient bandwidth for a multiprocessing computer system may be insufficient for a similar computer system employing the higher performance processors.

Another structure for multiprocessing computer systems is a distributed shared memory architecture. A distributed shared memory architecture includes multiple nodes within which processors and memory reside. The multiple nodes communicate via a network coupled there between. When considered as a whole, the memory included within the multiple nodes forms the shared memory for the computer system. Typically, directories are used to identify which nodes have cached copies of data corresponding to a particular address. Coherency activities may be generated via examination of the directories.

Distributed shared memory systems are scaleable, overcoming the limitations of the shared bus architecture. Since many of the processor accesses are completed within a node, nodes typically have much lower bandwidth requirements upon the network than a shared bus architecture must provide upon its shared bus. The nodes may operate at high clock frequency and bandwidth, accessing the network when needed. Additional nodes may be added to the network without affecting the local bandwidth of the nodes. Instead, only the network bandwidth is affected.

Many distributed shared memory systems suffer from a limitation upon the memory which may be included within a node. The limitation arises not from the number of memory modules (such as dynamic random access memory, or DRAM, modules which are popular in the industry) which may be configured into a node to form the memory, but instead arises from the amount of memory which may be used to store the access rights of the node to a particular coherency unit within the memory. In order to maintain system-wide memory coherency, the access rights granted to a particular node must be respected by that node. However, the node typically employs high speed internal communications, such that the access rights must by accessible very quickly. DRAM is typically not suitable for high speed access. Instead, static random access memory (SRAM) modules are typically used to store the access rights.

While SRAM modules may respond with speeds suitable for use in storing access rights, SRAM modules suffer from other drawbacks. SRAM modules are not fabricated with the densities typified by DRAM. In other words, a much larger number of SRAM modules must be used to store the same number of bits as a particular number of DRAM modules. Unfortunately, the lack of density in SRAM modules leads to increased pinouts on modules housing the control logic which interfaces to the SRAM modules in order to store, retrieve, and analyze access rights corresponding to a coherency unit accessed by a transaction occurring within the node. The number of SRAM modules which may be used is therefore limited by the number of pins available on the control logic modules. Hence, the number of access rights (and therefore the number of coherency units) which may be stored in the node is limited. Additionally, SRAM modules are significantly more expensive then DRAM modules. In order to minimize the cost of the computer system, it is important to minimize the number of SRAM modules included.

For at least the above reasons, the amount of memory needed to store access rights may limit the amount of main memory which may be included within the node. Still further, if less than the maximum amount of memory is included in a node, it is desirable to reduce the memory dedicated to storing access rights accordingly. In addition, it is desirable to be able to upgrade the amount of memory in a given node subsequent to manufacture of the computer system. Therefore, the amount of memory used for storing access rights must be similarly increasable.

SUMMARY OF THE INVENTION

The problems outlined above are in large part solved by a computer system in accordance with the present invention. The computer system includes one or more processing nodes, each of which includes one or more subnodes. One of the subnodes (the controller subnode) manages the interface between the processing node and the remainder of the computer system. Other subnodes (snooper subnodes) are employed to store access rights for coherency units within the memory. The processing node's memory is logically divided into portions, and each subnode stores access rights for a particular memory portion. When a transaction is initiated within the processing node, the subnode storing the access rights for the coherency unit affected by the transaction analyzes the access rights and determines if the transaction may complete locally within the processing node. If coherency activity is required, the subnode asserts an ignore signal causing the transaction to be delayed while coherency activity is performed to acquire sufficient access rights.

The access rights are updated concurrent with reissue of a transaction for which coherency activity is performed. In this manner, the updated access rights are available to subsequent transactions even though the access rights may be stored in a different subnode than the controller subnode (which performs the reissue transaction). In one embodiment, the updated access rights are conveyed within one of the address phases of the reissue transaction. A bytemask field within one of the address phases is used.

By dividing the access rights storage into multiple subnodes, subnodes may be added to increase the number of access rights which may be stored within a processing node. Consequently, the amount of memory (measured in coherency units) may be increased beyond the number of coherency units manageable by one subnode. Advantageously, the computer system exhibits a high degree of flexibility and reconfigurability. For example, the computer system may be purchased with a small amount of memory and later upgraded to a larger amount of memory relatively easily.

Additionally, the division of access rights storage solves the physical problems of storing the access rights in fast but sparse SRAM-type memory. Each subnode may be configured with a certain number of banks of SRAM (for example, two). When the number of access rights to be stored requires more than the certain number of banks, then another subnode may be added. In this manner, the number of signal lines to which the control logic within any given subnode connects is limited to a smaller number than if a single subnode were used. Advantageously, the control logic may be use commercially available packaging since the number of pins required is minimized. Still further, the fast SRAM may be used, satisfying speed requirements for access with during high speed intranode communications.

Broadly speaking, the present invention contemplates a method for completing a transaction in a processing node of a multiprocessing computer system. The transaction is reissued within the processing node upon completion of coherency activity performed with respect to the transaction. Concurrently, a new coherency state corresponding to a coherency unit affected by the transaction is broadcast. The coherency state is recorded in a position within a table of coherency states. The position corresponds to the coherency unit.

The present invention further contemplates a system interface comprising a first subnode and a second subnode. The first subnode is configured to communicate between a local bus of a processing node and a network. Coupled to the local bus, the second subnode is configured to store a first plurality of coherency states corresponding to a first plurality of coherency units stored within the processing node.

The present invention still further contemplates a computer system comprising a network and a first processing node. The first processing node is coupled to the network and includes a controller subnode and a snooper subnode. The controller subnode is configured to effectuate communication upon the network and to reissue a transaction for which the communication is effectuated upon completion of the communication. Furthermore, the controller subnode is configured to broadcast a coherency state achieved via the communication. The snooper subnode is configured to store a first plurality of coherency states corresponding to a first plurality of coherency units stored within the first processing node. Additionally, the snooper subnode is configured to capture the coherency state broadcast by the controller subnode if the coherency state is one of the plurality of coherency states.

BRIEF DESCRIPTION OF THE DRAWINGS

Other objects and advantages of the invention will become apparent upon reading the following detailed description and upon reference to the accompanying drawings in which:

FIG. 1 is a block diagram of a multiprocessor computer system.

FIG. 1A is a conceptualized block diagram depicting a non-uniform memory architecture supported by one embodiment of the computer system shown in FIG. 1.

FIG. 1B is a conceptualized block diagram depicting a cache-only memory architecture supported by one embodiment of the computer system shown in FIG. 1.

FIG. 2 is a block diagram of one embodiment of a symmetric multiprocessing node depicted in FIG. 1.

FIG. 2A is an exemplary directory entry stored in one embodiment of a directory depicted in FIG. 2.

FIG. 3 is a block diagram of one embodiment of a system interface shown in FIG. 1.

FIG. 4 is a diagram depicting activities performed in response to a typical coherency operation between a request agent, a home agent, and a slave agent.

FIG. 5 is an exemplary coherency operation performed in response to a read to own request from a processor.

FIG. 6 is a flowchart depicting an exemplary state machine for one embodiment of a request agent shown in FIG. 3.

FIG. 7 is a flowchart depicting an exemplary state machine for one embodiment of a home agent shown in FIG. 3.

FIG. 8 is a flowchart depicting an exemplary state machine for one embodiment of a slave agent shown in FIG. 3.

FIG. 9 is a table listing request types according to one embodiment of the system interface.

FIG. 10 is a table listing demand types according to one embodiment of the system interface.

FIG. 11 is a table listing reply types according to one embodiment of the system interface.

FIG. 12 is a table listing completion types according to one embodiment of the system interface.

FIG. 13 is a table describing coherency operations in response to various operations performed by a processor, according to one embodiment of the system interface.

FIG. 14 is a block diagram of a second embodiment of a symmetric multiprocessing node depicted in FIG. 1.

FIG. 15 is a timing diagram depicting a portion of a transaction upon a bus of the symmetric multiprocessing node shown in FIG. 14.

FIG. 16 is a timing diagram depicting a portion of two transactions upon the bus of the symmetric multiprocessing node shown in FIG. 14, highlighting update of an MTAG state with respect to the first of the two transactions.

FIG. 17 is a diagram depicting fields of a bytemask transmitted upon the bus of the symmetric multiprocessing node shown in FIG. 14, according to one embodiment of the symmetric multiprocessing node.

FIG. 18 is a diagram of control registers employed in one embodiment of the symmetric multiprocessing node shown in FIG. 14.

FIG. 19 is a diagram of an address space of one embodiment of the symmetric multiprocessing node shown in FIG. 14.

FIG. 20 is a diagram depicting an exemplary MTAG layout employed by one embodiment of the symmetric multiprocessing node shown in FIG. 14.

While the invention is susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will herein be described in detail. It should be understood, however, that the drawings and detailed description thereto are not intended to limit the invention to the particular form disclosed, but on the contrary, the intention is to cover all modifications, equivalents and alternatives falling within the spirit and scope of the present invention as defined by the appended claims.

DETAILED DESCRIPTION OF THE INVENTION

Turning now to FIG. 1, a block diagram of one embodiment of a multiprocessing computer system 10 is shown. Computer system 10 includes multiple SMP nodes 12A-12D interconnected by a point-to-point network 14. Elements referred to herein with a particular reference number followed by a letter will be collectively referred to by the reference number alone. For example, SMP nodes 12A-12D will be collectively referred to as SMP nodes 12. In the embodiment shown, each SMP node 12 includes multiple processors, external caches, an SMP bus, a memory, and a system interface. For example, SMP node 12A is configured with multiple processors including processors 16A-16B. The processors 16 are connected to external caches 18, which are further coupled to an SMP bus 20. Additionally, a memory 22 and a system interface 24 are coupled to SMP bus 20. Still further, one or more input/output (I/O) interfaces 26 may be coupled to SMP bus 20. I/O interfaces 26 are used to interface to peripheral devices such as serial and parallel ports, disk drives, modems, printers, etc. Other SMP nodes 12B-12D may be configured similarly.

Generally speaking, system interface 24 comprises one or more subnodes. One of the subnodes (the controller subnode) includes an interface to network 14, while other subnodes simply maintain storage for access rights to coherency units stored within memory 22. Upon completion of coherency activity via network 14 in response to a transaction, the controller node reissues the transaction upon SMP bus 20. Concurrently, the controller node provides an updated access rights value for storage in the subnode corresponding to the affected coherency unit. Because the updated access rights value is provided concurrent with the reissued transaction, the updated access rights are available to subsequent transactions through the corresponding subnode. Advantageously, the access rights are logically updated at the same time as the transaction completes, in accordance with the memory coherency model supported by computer system 10. The update is performed even though the access rights may be stored in a subnode other than the controller subnode. It is noted that the controller subnode may include a portion of the access rights memory (referred to herein as the MTAG) as well.

In one embodiment, each subnode which forms a portion of system interface 24 comprises a printed circuit board which is independently inserted into a backplane comprising SMP bus 20. The number of subnodes included is configurable, and therefore is expandable if the size of memory 22 is expanded. Advantageously, the amount of MTAG memory is adjustable to match the amount needed for the size of memory 22. For example, if each coherency unit is 64 bytes and access rights comprise two bits per coherency unit, then the amount of MTAG memory is 1/256.sup.th of the size of memory 22. Computer system 10 may initially be manufactured with a particular amount of memory, and subsequently memory may be added or deleted by adding or deleting memory modules from memory 22 and adding or deleting subnodes from system interface 24.

Generally speaking, a memory operation is an operation causing transfer of data from a source to a destination. The source and/or destination may be storage locations within the initiator, or may be storage locations within memory. When a source or destination is a storage location within memory, the source or destination is specified via an address conveyed with the memory operation. Memory operations may be read or write operations. A read operation causes transfer of data from a source outside of the initiator to a destination within the initiator. Conversely, a write operation causes transfer of data from a source within the initiator to a destination outside of the initiator. In the computer system shown in FIG. 1, a memory operation may include one or more transactions upon SMP bus 20 as well as one or more coherency operations upon network 14.

Architectural Overview

Each SMP node 12 is essentially an SMP system having memory 22 as the shared memory. Processors 16 are high performance processors. In one embodiment, each processor 16 is a SPARC processor compliant with version 9 of the SPARC processor architecture. It is noted, however, that any processor architecture may be employed by processors 16.

Typically, processors 16 include internal instruction and data caches. Therefore, external caches 18 are labeled as L2 caches (for level 2, wherein the internal caches are level 1 caches). If processors 16 are not configured with internal caches, then external caches 18 are level 1 caches. It is noted that the "level" nomenclature is used to identify proximity of a particular cache to the processing core within processor 16. Level 1 is nearest the processing core, level 2 is next nearest, etc. External caches 18 provide rapid access to memory addresses frequently accessed by the processor 16 coupled thereto. It is noted that external caches 18 may be configured in any of a variety of specific cache arrangements. For example, set-associative or direct-mapped configurations may be employed by external caches 18.

SMP bus 20 accommodates communication between processors 16 (through caches 18), memory 22, system interface 24, and I/O interface 26. In one embodiment, SMP bus 20 includes an address bus and related control signals, as well as a data bus and related control signals. Because the address and data buses are separate, a split-transaction bus protocol may be employed upon SMP bus 20. Generally speaking, a split-transaction bus protocol is a protocol in which a transaction occurring upon the address bus may differ from a concurrent transaction occurring upon the data bus. Transactions involving address and data include an address phase in which the address and related control information is conveyed upon the address bus, and a data phase in which the data is conveyed upon the data bus. Additional address phases and/or data phases for other transactions may be initiated prior to the data phase corresponding to a particular address phase. An address phase and the corresponding data phase may be correlated in a number of ways. For example, data transactions may occur in the same order that the address transactions occur. Alternatively, address and data phases of a transaction may be identified via a unique tag.

Memory 22 is configured to store data and instruction code for use by processors 16. Memory 22 preferably comprises dynamic random access memory (DRAM), although any type of memory may be used. Memory 22, in conjunction with similar illustrated memories in the other SMP nodes 12, forms a distributed shared memory system. Each address in the address space of the distributed shared memory is assigned to a particular node, referred to as the home node of the address. A processor within a different node than the home node may access the data at an address of the home node, potentially caching the data. Therefore, coherency is maintained between SMP nodes 12 as well as among processors 16 and caches 18 within a particular SMP node 12A-12D. System interface 24 provides internode coherency, while snooping upon SMP bus 20 provides intranode coherency.

In addition to maintaining internode coherency, system interface 24 detects addresses upon SMP bus 20 which require a data transfer to or from another SMP node 12. System interface 24 performs the transfer, and provides the corresponding data for the transaction upon SMP bus 20. In the embodiment shown, system interface 24 is coupled to a point-to-point network 14. However, it is noted that in alternative embodiments other networks may be used. In a point-to-point network, individual connections exist between each node upon the network. A particular node communicates directly with a second node via a dedicated link. To communicate with a third node, the particular node utilizes a different link than the one used to communicate with the second node.

It is noted that, although four SMP nodes 12 are shown in FIG. 1, embodiments of computer system 10 employing any number of nodes are contemplated.

FIGS. 1A and 1B are conceptualized illustrations of distributed memory architectures supported by one embodiment of computer system 10. Specifically, FIGS. 1A and 1B illustrate alternative ways in which each SMP node 12 of FIG. 1 may cache data and perform memory accesses. Details regarding the manner in which computer system 10 supports such accesses will be described in further detail below.

Turning now to FIG. 1A, a logical diagram depicting a first memory architecture 30 supported by one embodiment of computer system 10 is shown. Architecture 30 includes multiple processors 32A-32D, multiple caches 34A-34D, multiple memories 36A-36D, and an interconnect network 38. The multiple memories 36 form a distributed shared memory. Each address within the address space corresponds to a location within one of memories 36.

Architecture 30 is a non-uniform memory architecture (NUMA). In a NUMA architecture, the amount of time required to access a first memory address may be substantially different than the amount of time required to access a second memory address. The access time depends upon the origin of the access and the location of the memory 36A-36D which stores the accessed data. For example, if processor 32A accesses a first memory address stored in memory 36A, the access time may be significantly shorter than the access time for an access to a second memory address stored in one of memories 36B-36D. That is, an access by processor 32A to memory 36A may be completed locally (e.g. without transfers upon network 38), while a processor 32A access to memory 36B is performed via network 38. Typically, an access through network 38 is slower than an access completed within a local memory. For example, a local access might be completed in a few hundred nanoseconds while an access via the network might occupy a few microseconds.

Data corresponding to addresses stored in remote nodes may be cached in any of the caches 34. However, once a cache 34 discards the data corresponding to such a remote address, a subsequent access to the remote address is completed via a transfer upon network 38.

NUMA architectures may provide excellent performance characteristics for software applications which use addresses that correspond primarily to a particular local memory. Software applications which exhibit more random access patterns and which do not confine their memory accesses to addresses within a particular local memory, on the other hand, may experience a large amount of network traffic as a particular processor 32 performs repeated accesses to remote nodes.

Turning now to FIG. 1B, a logic diagram depicting a second memory architecture 40 supported by the computer system 10 of FIG. 1 is shown. Architecture 40 includes multiple processors 42A-42D, multiple caches 44A-44D, multiple memories 46A-46D, and network 48. However, memories 46 are logically coupled between caches 44 and network 48. Memories 46 serve as larger caches (e.g. a level 3 cache), storing addresses which are accessed by the corresponding processors 42. Memories 46 are said to "attract" the data being operated upon by a corresponding processor 42. As opposed to the NUMA architecture shown in FIG. 1A, architecture 40 reduces the number of accesses upon the network 48 by storing remote data in the local memory when the local processor accesses that data.

Architecture 40 is referred to as a cache-only memory architecture (COMA). Multiple locations within the distributed shared memory formed by the combination of memories 46 may store data corresponding to a particular address. No permanent mapping of a particular address to a particular storage location is assigned. Instead, the location storing data corresponding to the particular address changes dynamically based upon the processors 42 which access that particular address. Conversely, in the NUMA architecture a particular storage location within memories 46 is assigned to a particular address. Architecture 40 adjusts to the memory access patterns performed by applications executing thereon, and coherency is maintained between the memories 46.

In a preferred embodiment, computer system 10 supports both of the memory architectures shown in FIGS. 1A and 1B. In particular, a memory address may be accessed in a NUMA fashion from one SMP node 12A-12D while being accessed in a COMA manner from another SMP node 12A-12D. In one embodiment, a NUMA access is detected if certain bits of the address upon SMP bus 20 identify another SMP node 12 as the home node of the address presented. Otherwise, a COMA access is presumed. Additional details will be provided below.

In one embodiment, the COMA architecture is implemented using a combination of hardware and software techniques. Hardware maintains coherency between the locally cached copies of pages, and software (e.g. the operating system employed in computer system 10) is responsible for allocating and deallocating cached pages.

FIG. 2 depicts details of one implementation of an SMP node 12A that generally conforms to the SMP node 12A shown in FIG. 1. Other nodes 12 may be configured similarly. It is noted that alternative specific implementations of each SMP node 12 of FIG. 1 are also possible. The implementation of SMP node 12A shown in FIG. 2 includes multiple subnodes such as subnodes 50A and 50B. Each subnode 50 includes two processors 16 and corresponding caches 18, a memory portion 56, an address controller 52, and a data controller 54. The memory portions 56 within subnodes 50 collectively form the memory 22 of the SMP node 12A of FIG. 1. Other subnodes (not shown) are further coupled to SMP bus 20 to form the I/O interfaces 26.

As shown in FIG. 2, SMP bus 20 includes an address bus 58 and a data bus 60. Address controller 52 is coupled to address bus 58, and data controller 54 is coupled to data bus 60. FIG. 2 also illustrates system interface 24, including a system interface logic block 62, a translation storage 64, a directory 66, and a memory tag (MTAG) 68. Logic block 62 is coupled to both address bus 58 and data bus 60, and asserts an ignore signal 70 upon address bus 58 under certain circumstances as will be explained further below. Additionally, logic block 62 is coupled to translation storage 64, directory 66, MTAG 68, and network 14.

For the embodiment of FIG. 2, each subnode 50 is configured upon a printed circuit board which may be inserted into a backplane upon which SMP bus 20 is situated. In this manner, the number of processors and/or I/O interfaces 26 included within an SMP node 12 may be varied by inserting or removing subnodes 50. For example, computer system 10 may initially be configured with a small number of subnodes 50. Additional subnodes 50 may be added from time to time as the computing power required by the users of computer system 10 grows.

Address controller 52 provides an interface between caches 18 and the address portion of SMP bus 20. In the embodiment shown, address controller 52 includes an out queue 72 and some number of in queues 74. Out queue 72 buffers transactions from the processors connected thereto until address controller 52 is granted access to address bus 58. Address controller 52 performs the transactions stored in out queue 72 in the order those transactions were placed into out queue 72 (i.e. out queue 72 is a FIFO queue). Transactions performed by address controller 52 as well as transactions received from address bus 58 which are to be snooped by caches 18 and caches internal to processors 16 are placed into in queue 74.

Similar to out queue 72, in queue 74 is a FIFO queue. All address transactions are stored in the in queue 74 of each subnode 50 (even within the in queue 74 of the subnode 50 which initiates the address transaction). Address transactions are thus presented to caches 18 and processors 16 for snooping in the order they occur upon address bus 58. The order that transactions occur upon address bus 58 is the order for SMP node 12A. However, the complete system is expected to have one global memory order. This ordering expectation creates a problem in both the NUMA and COMA architectures employed by computer system 10, since the global order may need to be established by the order of operations upon network 14. If two nodes perform a transaction to an address, the order that the corresponding coherency operations occur at the home node for the address defines the order of the two transactions as seen within each node. For example, if two write transactions are performed to the same address, then the second write operation to arrive at the address' home node should be the second write transaction to complete (i.e. a byte location which is updated by both write transactions stores a value provided by the second write transaction upon completion of both transactions). However, the node which performs the second transaction may actually have the second transaction occur first upon SMP bus 20. Ignore signal 70 allows the second transaction to be transferred to system interface 24 without the remainder of the SMP node 12 reacting to the transaction.

Therefore, in order to operate effectively with the ordering constraints imposed by the out queue/in queue structure of address controller 52, system interface logic block 62 employs ignore signal 70. When a transaction is presented upon address bus 58 and system interface logic block 62 detects that a remote transaction is to be performed in response to the transaction, logic block 62 asserts the ignore signal 70. Assertion of the ignore signal 70 with respect to a transaction causes address controller 52 to inhibit storage of the transaction into in queues 74. Therefore, other transactions which may occur subsequent to the ignored transaction and which complete locally within SMP node 12A may complete out of order with respect to the ignored transaction without violating the ordering rules of in queue 74. In particular, transactions performed by system interface 24 in response to coherency activity upon network 14 may be performed and completed subsequent to the ignored tran