WikiPatents - Community Patent Review
Create Free Account  |  License or Sell Your Patent  |  WikiPatents Marketplace  |  WikiPatents Blog
Username:  Password:  
    
Advanced Search
Method and apparatus for delaying victim writes in a switch-based multi-processor system to maintain data coherency    
United States Patent6108752   
Link to this pagehttp://www.wikipatents.com/6108752.html
Inventor(s)VanDoren; Stephen R. (Northborough, MA), Goodwin; Paul M. (Littleton, MA)
AbstractAn architecture and coherency protocol for use in a large SMP computer system includes a hierarchical switch structure which allows for a number of multi-processor nodes to be coupled to the switch to operate at an optimum performance. Within each multi-processor node, a simultaneous buffering system is provided that allows all of the processors of the multi-processor node to operate at peak performance. A memory is shared among the nodes, with a portion of the memory resident at each of the multi-processor nodes. Each of the multi-processor nodes includes a number of elements for maintaining memory coherency, including a victim cache, a directory and a transaction tracking table. The victim cache allows for selective updates of victim data destined for memory stored at a remote multi-processing node, thereby improving the overall performance of memory. Memory performance is additionally improved by including, at each memory, a delayed write buffer which is used in conjunction with the directory to identify victims that are to be written to memory. An arb bus coupled to the output of the directory of each node provides a central ordering point for all messages that are transferred through the SMP. The messages comprise a number of transactions, and each transaction is assigned to a number of different virtual channels, depending upon the processing stage of the message. The use of virtual channels thus helps to maintain data coherency by providing a straightforward method for maintaining system order. Using the virtual channels and the directory structure, cache coherency problems that would previously result in deadlock may be avoided.
   














 Title Information Submit all comments and votes
 
Patent Text Patent PDF Print Page Summary File History
Plain text PDF images Print Summary File History
Drawing from US Patent 6108752
Method and apparatus for delaying victim writes in a switch-based
     multi-processor system to maintain data coherency - US Patent 6108752 Drawing
Method and apparatus for delaying victim writes in a switch-based multi-processor system to maintain data coherency
Inventor     VanDoren; Stephen R. (Northborough, MA) , Goodwin; Paul M. (Littleton, MA)
Owner/Assignee     Compaq Computer Corporation (Houston, TX)
Patent assignment
All assignments
Publication Date     August 22, 2000
Application Number     08/957,566
PAIR File History     Application Data   Transaction History
Image File Wrapper   Patent Term   Fees
Litigation
Filing Date     October 24, 1997
US Classification     711/117 710/19 710/21 710/39 710/52 710/55 710/59 711/143 711/144 711/152 711/156 711/168
Int'l Classification    
Examiner     Nguyen; Hiep T.
Assistant Examiner    
Attorney/Law Firm     Cesari and McKenna, LLP
Address
Parent Case    
Priority Data    
USPTO Field of Search     711/121 711/117 711/118 711/141 711/142 711/143 711/144 711/145 711/146 711/148 711/152 711/153 711/156 711/168 710/131 710/19 710/21 710/39 710/52 710/55 710/59
Patent Tags     delaying victim writes switch-based multi-processor maintain data coherency
   
Enter a comma (,) or semicolon (;) between multiple tag words/phrases.
Describe this patent:
 Amusing   
 Clever   
 Complex   
 Efficient   
 Historic   
 Important   
 Innovative   
 Interesting   
 Practical   
 Simple   
[no votes]
Patent WIKI

Share information and news about this patent, including information and news about the technology, inventors, company, ligation and licensing.

 References Submit all comments and votes
 
*references marked with an asterisk below are user-added references
 U.S. References
 
Add a new US reference:  
ReferenceRelevancyCommentsReferenceRelevancyComments
5905998
Ebrahim et al.

May,1999

[0 after 0 votes]
5737757
Hassoun et al.

Apr,1998

[0 after 0 votes]
5608893
Slingwine et al.

Mar,1997

[0 after 0 votes]
5584004
Aimoto et al.

Dec,1996

[0 after 0 votes]
5579504
Callander et al.

Nov,1996

[0 after 0 votes]
5551005
Sarangdhar et al.

Aug,1996

[0 after 0 votes]
5530933
Frink et al.

Jun,1996

[0 after 0 votes]
5490261
Bean et al.

Feb,1996

[0 after 0 votes]
5361342
Tone

Nov,1994

[0 after 0 votes]
5313609
Baylor et al.

May,1994

[0 after 0 votes]
5303362
Butts, Jr. et al.

Apr,1994

[0 after 0 votes]
 Foreign References
 Other References
 Market Review Submit all comments and votes
   
Market Size
Estimate the gross annual revenues of the relevant market sector:
> $10B
$5B - $10B
$2B - $5B
$500M - $2B
$100M - $500M
$10M - $100M
$1M - $10M
$500K - $1M
$100K - $500K
< $100K
[No votes]
$0
 
$0   $2.5B   $5B   $7.5B   $10B
Market Share
Estimate the percentage of the relevant market sector this invention will capture:
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Reasonable Royalty
What percentage of gross sales should the inventor or assignee be paid?
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Public's "Guesstimation" of Royalty Value
Market SizeN/A[No votes]
xMarket ShareN/A[No votes]
xReasonable RoyaltyN/A[No votes]

N/A

License Availablity
If you are NOT the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
License Availablity
If you ARE the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
Competitive Advantage
Does this invention have a significant competitive advantage over similar technologies?
Yes

No



[No votes]
Most helpful competitive advantage comment
[No comments]

Commercial Alternatives
Are there viable commercial alternatives for this invention?
Yes

No



[No votes]
Most helpful commercial alternative comment
[No comments]

 Technical Review Submit all comments and votes
 Claims Submit all comments and votes
 


What we claim is:

1. A memory subsystem comprising:

a first buffer comprising a plurality of entries corresponding in number to a subset of data blocks in a coupled memory device, each of the entries for storing status information for each of the associated data blocks;

a second buffer coupled to the first buffer, the second buffer comprising at least one entry comprising a data block field for storing a data block, an address field for indicating the address of the data block at the coupled memory device and a flag field for indicating whether the data block stored in the associated data field should be written to the coupled memory device, where a state of the flag field is determined in response to status information stored in the first buffer at the entry corresponding to the data block.

2. The memory subsystem according to claim 1, wherein the first buffer is accessed in parallel with a write operation of the data block to the second buffer and wherein the status information accessed from the first buffer is used for selectively writing the data block in the second buffer to the coupled memory device.

3. The memory subsystem according to claim 1, wherein requests issued to the coupled memory device are serviced by the second buffer responsive to the address and flag fields of the at least one entry of the second buffer.

4. The memory subsystem according to claim 1, wherein each of the entries of the first buffer includes an owner field identifying an owner of the associated data block, and wherein the flag field in the second buffer is set in response to the owner information in the first buffer.

5. The memory subsystem according to claim 4, wherein the data block in the data block field of the second buffer is written to the coupled memory device when the flag field in the second buffer is set.

6. The memory subsystem according to claim 5 wherein the flag field in the second buffer is set responsive to a source of the data in the data field of the second buffer corresponding to the owner identified in the owner field of the associated entry of the first buffer.

7. The memory subsystem according to claim 1, further comprising:

means for selectively servicing a request to the coupled memory device by either the second buffer or the coupled memory device including:

means for comparing an address of the request to the address field of the at least one entry in the second buffer;

means, responsive to a match between the address of the request and the address in the address field of the at least one entry of the second buffer, and further responsive to a set flag in the associated flag field, for servicing the request with the data block from the data field of the second buffer; and

means, responsive to either a mismatch between the address of the request and the address in the address field or a reset flag in the associated flag field, for servicing the request with the data block from the coupled memory device.

8. The memory subsystem according to claim 1 further comprising:

a cache for temporary storage of a plurality of data blocks from the coupled memory device, wherein each of the plurality of the entries of the first buffer correspond to one of the plurality of data blocks stored in the cache memory, and wherein the status information of each entry further comprises:

address information identifying the address of the data block at the cache;

a validity flag indicating that the copy of the associated data block in the coupled cache memory device is current; and

wherein the flag field of the at least one entry of the second buffer is set responsive to the address information and validity flag of the corresponding data block in the first buffer.

9. The memory subsystem according to claim 8, wherein the flag field for the data block is set in the at least one entry of the second buffer in response to the validity flag of the entry of the first buffer associated with the data block indicating that the associated data is current.

10. A memory subsystem comprising:

a first buffer comprising a plurality of entries corresponding in number to a first subset of data blocks in a coupled memory device, each of the entries for storing first status information for each of the associated data blocks;

a second buffer comprising a plurality of entries corresponding in number to a second subset of data blocks in a coupled memory device, each of the entries for storing second status information for each of the associated data blocks; and

a third buffer coupled to the first buffer and the second buffer, the third buffer comprising at least one entry comprising a data block field for storing a data block, an address field for indicating the address of the data block at the coupled memory device and a flag field for indicating whether the data block stored in the associated data field should be written to the coupled memory device, where a state of the flag field is determined in response to the first and second status information stored in the first buffer and the second buffer, at respective entries corresponding to the data block.

11. The memory subsystem according to claim 10, wherein the first and second buffers are accessed in parallel with a write operation of the data block in the third buffer, and wherein the first and second status information accessed from the first and second buffers respectively are used for selectively writing the data block in the third buffer to the coupled memory device.

12. The memory subsystem according to claim 10, wherein requests issued to the coupled memory device are serviced by the third buffer responsive to the address and flag fields of the at least one entry of the second buffer.

13. The memory subsystem according to claim 10 further comprising:

a cache for temporary storage of a plurality of data blocks from the coupled memory device, wherein each of the plurality of the entries of the second buffer correspond to one of the plurality data blocks stored in the cache memory, and wherein the status information of each entry further comprises:

address information identifying the address of data block at the coupled cache memory device; and

a validity flag indicating that the copy of the associated data block in the coupled cache memory device is current.

14. The memory subsystem according to claim 13, wherein each of the entries of the first buffer includes an owner field identifying an owner of the associated data block, and wherein the flag field in the third buffer is set responsive to the address information and validity flag of the corresponding data block in the second buffer if the data block is one of the plurality of data blocks stored in the cache, or responsive to a source of the data in the data field of the second buffer corresponding to the owner identified in the owner field of the associated entry of the first buffer if the data block is not one of the plurality of data blocks stored in the cache.

15. A method of writing a modified block of data from a cache device to a coupled memory device in response to an instruction comprising the steps of:

initiating a retrieval of status information associated with the modified block of data from a first buffer, the first buffer having a plurality of entries;

storing the modified block of data in a second buffer, the second buffer comprising at least one entry, with each of the at least one entries comprising a flag indicating whether the modified block of data at the associated entry is to be written to the coupled memory device, the step of storing including the step of setting the flag associated with the modified block of data in the buffer to indicate that the modified data block should not be written to the coupled memory device; and

writing to the coupled memory device any of the at least one entries in the second buffer having the associated flag indicating that the data entry is to be written to the coupled memory device, wherein the steps of initiating, storing and writing are performed in parallel.

16. The method according to claim 15, further comprising the step of:

updating the flag of the entry in the second buffer associated with the modified data block in response to the status information retrieved from the first buffer.

17. The method according to claim 15 wherein each of the at least one entries in the second buffer further includes an address field, and wherein the method further comprises the step of servicing a request issued to the coupled memory device by the second buffer responsive to the address and flag field of one of the at least one entries in the second buffer, wherein the address of the one of the at least one entries corresponds to the address of the request.

18. The method according to claim 15, wherein each of the plurality of entries of the first buffer includes an owner field identifying an owner of the associated data block, and wherein the flag field in the second buffer is set in response to the owner information in the first buffer.

19. The memory subsystem according to claim 18 wherein the flag field in the second buffer is set responsive to a source of the modified data the second buffer matching the owner identified in the owner field of status information of the associated entry of the first buffer.

20. The memory subsystem according to claim 15 wherein each of the plurality of the entries of the first buffer correspond to one of a plurality data blocks stored in the cache memory, and wherein the status information of each entry in the first buffer further comprises:

address information identifying the address of the data block at the coupled memory device;

a validity flag indicating that the associated data block in the coupled cache memory device is current; and

wherein the flag field of the at least one entry of the second buffer is set responsive to the address information and validity flag of the corresponding entry in the first buffer.

21. The memory subsystem according to claim 20, wherein the flag field for the data block is set in the at least one entry of the second buffer in response to the validity flag of the entry of the first buffer associated with the data block indicating that the associated data is current.

22. A method of writing a modified block of data from a cache device having a plurality of entries to a coupled memory device in response to an instruction comprising the steps of:

initiating a retrieval of first status information associated with the modified block of data from a first buffer having a first plurality of entries;

initiating a retrieval of second status information associated with the modified block of data from a second buffer, the second buffer having a second plurality of entries corresponding in number to the plurality of entries of the cache device;

storing the modified block of data in a third buffer, the third buffer comprising at least one entry, with each of the at least one entries comprising a flag indicating whether the modified block of data at the associated entry is to be written to the coupled memory device, the step of storing including the step of setting the flag associated with the modified block of data in the buffer to indicate that the modified data block should not be written to the coupled memory device; and

writing to the coupled memory device any of the at least one entries in the third buffer having the associated flag indicating that the data entry is to be written to the coupled memory device, wherein the steps of initiating a retrieval of first status information, initiating a retrieval of second status information, storing the modified block and writing any of the at least one entries are performed in parallel.

23. The method according to claim 22, further comprising the step of:

updating the flag of the entry in the third buffer associated with the modified data block in response to the first and second status information.

24. The method according to claim 23 wherein each of the at least one entries in the third buffer further includes an address field, and wherein the method further comprises the step of servicing a request issued to the coupled memory device by the third buffer responsive to the address and flag field of one of the at least one entries in the third buffer, wherein the address of the one of the at least one entries corresponds to the address of the request.

25. The method according to claim 24, wherein each of the plurality of entries of the first buffer includes an owner field identifying an owner of the associated data block.

26. The memory subsystem according to claim 25 wherein the status information of each of the plurality of entries in the second buffer further comprises:

address information identifying the address of a data block associated with the entry at the coupled memory device;

a validity flag indicating that the associated data block in the coupled cache memory device is current.

27. The method according to claim 26, wherein the flag field of the at least one entry of the third buffer is set responsive to the address information and validity flag of the corresponding entry in the second buffer if the second status information is retrieved before the first status information, and wherein the flag field in the second buffer is set responsive to a source of the modified block of data matching the owner identified in the owner field of status information of the associated entry of the first buffer if the first status information is retrieved before the second status information.
 Description Submit all comments and votes
 


FIELD OF THE INVENTION

This invention relates in general to the field of computer architecture and more specifically to distributed shared-memory multi-processing systems.

BACKGROUND OF THE INVENTION

As it is known in the art, symmetric multi-processing computers allow for high performance application processing. Typical symmetric multi-processing computer systems include a number of processors coupled together by a bus. One characteristic of a symmetric multi-processing system is that memory space is shared among all of the processors. One or more operating systems are stored in memory and control the distribution of processes or threads among the various processors.

By allowing many different processors to execute different processes or threads simultaneously, the execution speed of a given application may be greatly increased. In theory the performance of a system could be improved by simply increasing the number of processors in the multi-processing system. In reality, the continued addition of processors past a certain saturation point serves merely to increase communication bottlenecks and thereby limit the overall performance of the system.

For example, referring now to FIG. 1A, a typical prior art multi-processor system 2 including eight processors coupled together via a common interconnect bus is shown. During operation, each of the processors 3a-3h communicate with the other processors and with a shared memory 4 via a shared interconnect bus 5. The symmetric multi-processing arrangement of FIG. 1A has been adequate for multi-processors built to date. However, with the advent of faster microprocessors, a common shared interconnect is not capable of sufficiently exercising the full performance potential of the coupled microprocessors. Because the only communication link between the processors and memory is the shared bus, the bus may rapidly become saturated with requests from the processors, thereby increasing delays as each processor attempts to gain access to the system bus. Therefore, although the processors may be able to operate at enhanced speeds, the limiting factor in terms of performance is the available bandwidth of the system bus.

Communication bandwidth is a key factor in the performance of SMP systems. Since bandwidth may not be uniform between pairs or subsets of nodes in the SMP system, the industry uses a "bisection bandwidth" measurement for determining the communication bandwidth of an SMP system. Bisection bandwidth is determined in the following manner. All possible ways of partitioning the system into two portions of equal compute power (equal number of processors) are ascertained. For each partition, the sustainable bandwidth between the two partitions is determined. The minimum of all of the sustainable bandwidths is the bisection bandwidth of the interconnect. The minimum bandwidth between the two partitions indicates the communication bandwidth sustainable by the multiprocessor system in the presence of worst-case communication patterns. Thus, a large bisection bandwidth is desirable.

Several interconnection architectures or "topologies" have been used in the prior art to overcome bus saturation problems. These topologies include meshes, touri, hypercubes and enhanced hypercubes.

As an example, a mesh interconnect is shown as system 7 in FIG. 1B. The major advantage of the mesh network is its simplicity and ease of wiring. Each node is connected to a small number of other neighboring nodes. However, the mesh interconnect has three significant drawbacks. First, messages must on average traverse a large number of nodes to get to their destination, and as a result the communication latency is high. Second, the bisection bandwidth does not scale as well for a mesh topology as it does for other topologies. Finally, because each of the messages may traverse different paths within the mesh, there are no natural ordering points within an SMP system, and therefore the cache coherence protocols required to implement the mesh topology are often quite complex.

The torus, hypercube, and enhanced hypercube topologies are all topologies wherein the nodes are interconnected in various complex arrangements, for example in a torus arrangement or a cube arrangement. The torus, hypercube and enhanced hypercube interconnects are more complex than the mesh interconnect, but offer better latency and bandwidth than the mesh interconnect. However, like the mesh interconnect, the torus, hypercube and enhanced hypercube topologies do not provide natural ordering points, and thus a complex cache coherence protocol must be implemented for each of those systems.

In shared-memory multiprocessor systems, processors typically employ private caches to store data determined likely to be accessed in the future. Since processors may read data from their private cache and may update data in the private cache without writing it back to memory, a mechanism is needed to ensure that the private caches of each of the processors are kept consistent, or coherent. The mechanism that is used to ensure coherency of data in the SMP system is referred to as the cache coherence protocol.

Besides the topology, bandwidth, and latency of the physical interconnect the efficiency of the cache coherence protocol is a key factor in system performance. Cache coherency protocols may introduce latencies, bottlenecks, inefficiencies or complexity in several ways.

The latency of load and store operations is often directly affected by the protocol of the design. For example, in some protocols, a store operation is not considered complete until all invalidate messages have made it to their target processors and acknowledgment messages have made it all the way back to the original processor. The latency of stores here is much higher than a protocol wherein the original processor does not have to wait for the Invalidates to make it to their destination. Further, the acknowledgments consume a significant fraction of the system bandwidth.

Bottlenecks are often introduced due to high occupancy of controllers. "Occupancy" is a term of art; it indicates the amount of time a controller is unavailable after it receives a request. In some protocols, when a directly controller receives a request corresponding to a memory location, it becomes unavailable for other requests to the same memory location until certain acknowledgments corresponding to the former command arrive at the directory. If the controller receives conflicting requests at a higher than average rate, it becomes a bottleneck.

The design of the cache coherence protocol also affects hardware complexity. For instance, some protocols introduce deadlock and fairness problems, which are then addressed with additional mechanisms. This results in added hardware complexity.

It is desirable to provide a symmetric multiprocessing system that minimizes the latency of operations, provides large communication bandwidth, provides low controller occupancy, and can scale to a large number of processors.

SUMMARY OF THE INVENTION

The present invention is advantageously employed in a system where multiple multi-processing nodes are coupled together via a switch. A memory control subsystem includes a delayed write buffer, coupled to a memory device, for temporary storage of data that is to be conditionally written to the memory device. The delayed write buffer includes at least one entry storing a cache line size data block and a flag, where the flag is set when the data block should be written to memory. The memory control system also includes a buffer for storing status information about each of the data blocks of the coupled memory device. The flag is set depending upon the status information (including owner information) associated with the data block in the delayed write buffer. While the data block is stored in the delayed write buffer, the status information may be accessed to determine whether the data block that was written into the delayed write buffer is still owned by the source seeking to write the data block to memory. If it was not owned, it is not written to memory. If it is owned, it is written to memory when the next data block is received at the delayed write buffer. References to the data block that are received while it is stored in the first buffer may be serviced by the first buffer. With such an arrangement, a memory write operation is capable of being performed each time a block of data is received by the delayed write buffer. By accessing the status information while the data block is stored in the delayed write buffer, in parallel with writing the data block into the delayed write buffer and writing the previous data block from the delayed write buffer into memory, memory cycles are not wasted pending status accesses.

According to one aspect of the invention, a memory subsystem includes a first buffer comprising a plurality of entries corresponding in number to a subset of data blocks in a coupled memory device. Each of the entries stores status information for each of the associated data blocks. The subsystem further includes and a second buffer coupled to the first buffer, the second buffer comprising at least one entry comprising a data block field for storing a data block, an address field for indicating the address of the data block at the coupled memory device and a flag field for indicating whether the data block stored in the associated data field should be written to the coupled memory device. A state of the flag field is determined in response to status information stored in the first buffer at the entry corresponding to the data block.

According to another aspect of the invention, a memory subsystem includes a first buffer comprising a plurality of entries corresponding in number to a first subset of data blocks in a coupled memory device. Each of the entries for storing first status information for each of the associated data blocks. The memory subsystem further includes a second buffer comprising a plurality of entries corresponding in number to a second subset of data blocks in a coupled memory device, where each of the entries for store second status information for each of the associated data blocks. A third buffer is coupled to the first buffer and the second buffer. The third buffer includes at least one entry comprising a data block field for storing a data block, an address field for indicating the address of the data block at the coupled memory device and a flag field for indicating whether the data block stored in the associated data field should be written to the coupled memory device. A state of the flag field is determined in response to the first and second status information stored in the first buffer and the second buffer, at respective entries corresponding to the data block.

According to a further aspect of the invention, a method of writing a modified block of data from a cache device to a coupled memory device in response to an instruction includes the steps of initiating a retrieval of status information associated with the modified block of data from a first buffer, the first buffer having a plurality of entries, storing the modified block of data in a second buffer, the second buffer comprising at least one entry, with each of the at least one entries comprising a flag indicating whether the modified block of data at the associated entry is to be written to the coupled memory device, the step of storing including the step of setting the flag associated with the modified block of data in the buffer to unwriteable, and writing any of the at least one entries in the second buffer having the associated flag indicating that the data entry is to be written to the coupled memory device to the coupled memory device, wherein the steps of initiating, storing and writing are performed in parallel.

BRIEF DESCRIPTION OF THE DRAWINGS

The above-mentioned and other features of the invention will now become more apparent by reference to the following description taken in connection with the accompanying drawings in which:

FIGS. 1A-1B are block diagrams of two prior art symmetric multi-processor computer systems;

FIG. 2 is a block diagram of one embodiment of a multi-processor computer node of one embodiment of the invention comprising a switch;

FIG. 3 is a block diagram illustrating the data path of the switch of FIG. 1 comprising a number of Simultaneous Insertion Buffers;

FIG. 4A is a block diagram of one embodiment of one of the Simultaneous Insertion Buffers of FIG. 3;

FIG. 4B is a block diagram of one implementation of logic for controlling one of the Simultaneous Input Buffers of FIG. 4;

FIG. 5 is a block diagram of a second embodiment of one of the Simultaneous Insertion Buffers of FIG. 3;

FIG. 6 is a block diagram of the multi-processor computer node of FIG. 2, augmented for connection into a larger network of similar nodes;

FIG. 7A is one embodiment of an SMP system implemented using multiple nodes similar to the multi-processor node of FIG. 6;

FIG. 7B is another embodiment of an SMP system implemented using multiple nodes similar to the multi-processor node of FIG. 6;

FIG. 8 is a block diagram of a global port of FIG. 6;

FIG. 9 illustrates an entry in a directory of the multi-processor node of FIG. 6;

FIG. 10 illustrates a Transaction Tracking Table (TTT) for use in the global port of FIG. 8;

FIG. 11 is a block diagram of a hierarchical switch for coupling the multiple nodes in FIG. 7A;

FIG. 12A is a block diagram of one embodiment of interconnect logic for the hierarchical switch that eliminates deadlock;

FIG. 12B is a flow diagram of the operation of the interconnect logic of FIG. 12A;

FIG. 13 is a flow diagram of the method used in the interconnect logic of FIG. 12A to assert flow control to stop data being transmitted from one of the multi-processing nodes;

FIG. 14 is a timing diagram illustrating the transfer of address and data packets on the busses to and from the hierarchical switch;

FIG. 15 is a block diagram of one embodiment of buffer logic for maintaining order at the hierarchical switch;

FIG. 16 is a block diagram of another embodiment of buffer logic for maintaining order for the hierarchical switch;

FIG. 17 is a flow diagram illustrating one method of operating the buffer logic of FIG. 16;

FIG. 18 is a block diagram of another embodiment of buffer logic for maintaining order at the hierarchical switch;

FIG. 19 is a table illustrating the translation of processor instructions to network instructions for use in the SMP of FIGS. 7A or 7B;

FIGS. 20A-20J illustrate a number of communication flows for transferring packets between nodes in the SMP of FIGS. 7A or 7B;

FIG. 21 is a block diagram illustrating the layout of a memory module for use in the multi-processor system of FIGS. 2 or 6;

FIG. 22 is a timing diagram illustrating the control logic used by the memory module of FIG. 21 for delayed write operations;

FIG. 23 is a flow diagram illustrating the use of discrete transactions that are mapped to channels for maintaining cache coherency in one embodiment of the invention;

FIG. 24 is a block diagram illustrating one implementation of a shared queue structure for handling virtual channels in the SMP of FIGS. 7A or 7B;

FIG. 25 is a block diagram illustrating an implementation of individual channel buffering in the nodes and hierarchical switches of the SMP of FIGS. 7A or 7B;

FIG. 26 is a block diagram for illustrating the problems that may arise if some amount of ordering between virtual channels in not maintained;

FIGS. 27A-27C are block diagrams illustrating the flow and ordering constraints on the Q1 channel for providing coherent communication in the SMP of FIGS. 7A or 7B;

FIGS. 28A and 28B are a block diagram illustrating the ambiguity problems that arise because of the coarse vector presence bits of the directory entries of the SMP of FIGS. 7A and 7B;

FIG. 29 is a block diagram illustrating the method used to prevent data ambiguity from arising as a result of the problem described in FIG. 28;

FIG. 30 is a block diagram for illustrating a coherency issue that arises from packets on different channels being received out of sequence;

FIG. 31 is a block diagram illustrating the use of Fill Markers for preventing the coherency problem described in FIG. 29;

FIG. 32 is an entry in the TTT reflecting the status of an instruction during the flow described with regard to FIG. 31;

FIGS. 33A-33B are block diagrams illustrating the operation of Change to Dirty commands in the SMP system;

FIG. 34 is a block diagram illustrating the use of Shadow commands for remedying the problem described with regard to FIG. 33;

FIG. 35 is an entry in the TTT reflecting the status of an instruction during the flow described with regard to FIG. 34; and

FIG. 36 is a flow diagram illustrating permissible sequential ordering of instructions in the example described in FIG. 35.

DESCRIPTION OF THE PREFERRED EMBODIMENT

According to one embodiment of the invention, a hierarchical Symmetric Multi-Processing (SMP) system includes a number of SMP nodes coupled together via a high performance switch. Thus, each of the SMP nodes act as a building block in the SMP system. Below, the components and operation of one SMP node building block is first described, followed by a description of the operation of the SMP system and subsequently a description of a cache coherence protocol that is used to maintain memory coherency in the large SMP system.

SMP Node Building Block

Referring now to FIG. 2, a multi-processor node 10 includes four processor modules 12a, 12b, 12c, and 12d. Each processor module comprises a central processing unit (CPU). In a preferred embodiment, Alpha.RTM. 21264 processor chips, manufactured by Digital Equipment Corporation.RTM. are used, although other types of processor chips capable of supporting the below described coherency protocol may alternatively be used.

Multi-processor node 10 includes a memory 13, which may include a number of memory modules 13a-13d. The memory may provide 32 GBytes of storage capacity, with each of the 4 memory modules storing 8 Gigabytes. Each of the memory modules is apportioned into a number of blocks of memory, where

each block may include, for example 64 bytes of data. Data is generally retrieved from memory in blocks.

In addition, multi-processing node 10 includes an I/O processor (IOP) module 14 for controlling transfer of data between external devices (not shown) and the multi-processor node 10 via a coupled I/O bus 14a. In one embodiment of the invention, the I/O bus may operate according to the Peripheral Computer Interconnect (PCI) protocol. The IOP 14 includes an IOP cache 14c and an IOP tag store 14b. The IOP cache 14c provides temporary storage for data from memory 13 that is transferred to external devices on the PCI bus 14a. The IOP tag store 14b is a 64 entry tag store for storing coherency information for data being moved between external devices, processors and memory.

The coherency of data stored in the memory 13 of the multi-processor node is maintained by means of a Duplicate Tag store (DTAG) 20. The DTAG 20 is shared by all of the processors 12a-12d, and is apportioned into 4 banks, where each bank is dedicated to storing status information corresponding to data used by an associated one of the processors.

The DTAG, Memory and IOP are coupled to a logical bus referred to as the Arb bus 17. Memory block requests issued by the processor are routed via the local switch 15 to the Arb bus 17. The DTAG 20 and IOP 14 look up the state of the block in the processors' and IOP's caches and atomically update their state for the memory block. The Arb bus 17 acts as a serialization point for all memory references. The order in which memory request appear on the Arb bus is the order in which processors perceive the results of the requests.

The processor modules 12a-12d, memory modules 13a-13d and IOP module 14 are coupled together via a local, 9 port switch 15. Each of the interfacing modules 12a-12d, 13a-13d and 14 are connected to the local switch by means of a like number of bi-directional, clock forwarded data links 16a-16i. In one embodiment, each of the data links forwards 64 bits of data and 8 bits of error correcting code (ECC) one each edge of a system clock operating at a rate of 150 MHZ. Thus, the data bandwidth of each of the data links 16a-16i is 2.4 Gigabytes/sec.

Local switch 15 includes an Quad Switch Address control chip (QSA chip) 18 and a Quad Switch data slice chip (QSD chip) 19. QSA chip 18 includes an arbiter (QS Arb) 11 for controlling address paths between the processor modules, IOP, and memory. In addition, QSA chip 18 provides control to the QSD chip 19 to control the flow of data through the local switch 15 as described below.

QSD chip 19 provides a switch interconnect for all data paths between the processor modules, memory modules and IOP. Although not shown in FIG. 2, as will be described below, if the multi-processor node 10 were coupled to other multi-processor nodes via a global port, the QSD and OSA would additionally provide a switch interconnect for the global port. Each of the processors may request data from one of the available resources, such as the memory devices 13a-13d, other processors 12a-12d, IOP 14 or alternatively resources in other multi-processor nodes via the global port. Thus, the local switch 15 should be able to accommodate simultaneous input from a variety of resources while maintaining the high bus bandwidth of 2.4 GBytes.

The local switch is able to handle mulitple concurrent transactions. Since each transaction typically uses multiple resources (such as memory banks, datapaths, queues), the control functions of the local switch can be very complex. For instance, a transaction may require a memory bank to be available in stage 0 of the transaction, the datapath from memory bank to processor port be available in stage 1, and the datapath from processor port to processor be available in stage 2. The local switch arbiter (QSA Arb 11 in the QSA 18) arbitrates among requests in such a manner that once a transaction is initiated, resources required by a transaction in each stage are available as required.

More significantly, the arbiter guarantees that all requests and processors get fair access to the resources by ensuring that particular requests do not fail to win arbitration for a long time (potentially indefinitely) while others make progress. For instance, consider a transaction T that requires three resources A, B, and C. Transaction T may not win arbitration until all three resources are guaranteed to be available in the appropriate stages of the transaction. If the artiber bases its decision only on the availability of resources, then it is possible that T may not succeed for a long time while other transactions wh