WikiPatents - Community Patent Review
Create Free Account  |  License or Sell Your Patent  |  WikiPatents Marketplace  |  WikiPatents Blog
Username:  Password:  
    
Advanced Search
Distributed data dependency stall mechanism    
United States Patent6249846   
Link to this pagehttp://www.wikipatents.com/6249846.html
Inventor(s)Van Doren; Stephen (Northborough, MA); Razdan; Rahul (Princeton, MA)
AbstractA method and apparatus for preventing system wide data dependent stalls is provided. Requests that reach the top of a probe queue and which target data that is not contained in an attached cache memory, are stalled until the data is filled into the appropriate location in cache memory. Only the associated central processor unit's probe queue is stalled and not the entire system. Accordingly, the present invention allows a system to chain together two or more concurrent operations for the same data block without adversely affecting system performance.
   














 Title Information Submit all comments and votes
 
Patent Text Patent PDF Print Page Summary File History
Plain text PDF images Print Summary File History
Drawing from US Patent 6249846
Distributed data dependency stall mechanism - US Patent 6249846 Drawing
Distributed data dependency stall mechanism
Inventor     Van Doren; Stephen (Northborough, MA); Razdan; Rahul (Princeton, MA)
Owner/Assignee     Compaq Computer Corporation (Houston, TX)
Patent assignment
All assignments
Publication Date     June 19, 2001
Application Number     09/547,163
PAIR File History     Application Data   Transaction History
Image File Wrapper   Patent Term   Fees
Litigation
Filing Date     April 11, 2000
US Classification     711/144 710/39 710/40 711/121 711/143 711/148
Int'l Classification     G06F 012/02
Examiner     Nguyen; Hiep T.
Assistant Examiner    
Attorney/Law Firm     Hamilton, Brook, Smith & Reynolds, P.C.
Address
Parent Case     RELATED APPLICATION This application is a continuation of application Ser. No. 08/957,129 filed Oct. 24, 1997, now U.S. Pat. No. 6,085,294 the entire teachings of which are incorporated herein by reference.
Priority Data    
USPTO Field of Search     711/144 711/143 711/121 711/148 710/39 710/40
Patent Tags     distributed data dependency stall mechanism
   
Enter a comma (,) or semicolon (;) between multiple tag words/phrases.
Describe this patent:
 Amusing   
 Clever   
 Complex   
 Efficient   
 Historic   
 Important   
 Innovative   
 Interesting   
 Practical   
 Simple   
[no votes]
Patent WIKI

Share information and news about this patent, including information and news about the technology, inventors, company, ligation and licensing.

 References Submit all comments and votes
 
*references marked with an asterisk below are user-added references
 U.S. References
 
Add a new US reference:  
ReferenceRelevancyCommentsReferenceRelevancyComments
6085294
Van Doren

Jul,2000

[0 after 0 votes]
5881253
Seeman
710/310
Mar,1999

[0 after 0 votes]
5530933
Frink
711/141
Jun,1996

[0 after 0 votes]
5522058
Iwasa
711/145
May,1996

[0 after 0 votes]
5404483
Stamm
711/144
Apr,1995

[0 after 0 votes]
5325510
Frazier
711/118
Jun,1994

[0 after 0 votes]
5222224
Flynn
711/144
Jun,1993

[0 after 0 votes]
 Foreign References
 Other References
 Market Review Submit all comments and votes
   
Market Size
Estimate the gross annual revenues of the relevant market sector:
> $10B
$5B - $10B
$2B - $5B
$500M - $2B
$100M - $500M
$10M - $100M
$1M - $10M
$500K - $1M
$100K - $500K
< $100K
[No votes]
$0
 
$0   $2.5B   $5B   $7.5B   $10B
Market Share
Estimate the percentage of the relevant market sector this invention will capture:
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Reasonable Royalty
What percentage of gross sales should the inventor or assignee be paid?
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Public's "Guesstimation" of Royalty Value
Market SizeN/A[No votes]
xMarket ShareN/A[No votes]
xReasonable RoyaltyN/A[No votes]

N/A

License Availablity
If you are NOT the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
License Availablity
If you ARE the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
Competitive Advantage
Does this invention have a significant competitive advantage over similar technologies?
Yes

No



[No votes]
Most helpful competitive advantage comment
[No comments]

Commercial Alternatives
Are there viable commercial alternatives for this invention?
Yes

No



[No votes]
Most helpful commercial alternative comment
[No comments]

 Technical Review Submit all comments and votes
 Claims Submit all comments and votes
 


What is claimed is:

1. A method for distributing data dependency stalls in a multiprocessor computer system, said method comprising the steps of:

a) maintaining a plurality of entries in a miss address file coupled to a first one of a plurality of central processing units, each of the entries associated with one of a first set of commands issued by the first central processing unit;

b) stalling a probe queue storing a first one of a set of probe messages coupled to the first central processing unit, responsive to one of the plurality of entries in the miss address file associated with a data element targeted by the first one of the set of probe messages; and

c) restarting the probe queue storing the first one of the set of probe messages responsive to a change in one of the entries in the miss address file associated with the data element targeted by the first probe message.

2. The method of claim 1 wherein further the probe queue is one of a plurality of probe queues, each one of the plurality of probe queues coupled to a different one of the plurality of central processing units and responsive to a different one of the entries in the miss address file, and

wherein a central arbitration circuit coupled to the plurality of central processing units monitors a state of each of the plurality of probe queues, the state indicating whether each of said plurality of probe queues is stalled, and stalling the central arbitration circuit responsive to the state of said plurality of probe queues.

3. A method for distributing data dependency stalls in a multiprocessor computer system, comprising the steps of:

stalling a probe queue having a first one of a central processor specific ordered set of probe messages targeting particular ones of a plurality of data elements when the central processing unit associated with the first one of the ordered set of probe messages has an outstanding reference to one of the plurality of data elements; and

restarting the probe queue having the first one of the ordered set of probe messages when the referenced data element of the plurality of data elements is resolved in the specific central processor to which it the probe message is associated.

4. The method of claim 3 wherein the outstanding reference is a read command and the resolution is the return of a copy of the data element.

5. The method of claim 3 wherein the outstanding reference is a read with modify intent command.

6. An apparatus for distributing data dependency stalls in a multiprocessor computer system, comprising:

means for stalling a probe queue having a first one of a central processor specific ordered set of probe messages targeting particular ones of a plurality of data elements when the central processing unit associated with the first one of the ordered set of probe messages has an outstanding reference to one of the plurality of data elements; and

means for restarting the probe queue having the first one of the ordered set of probe messages when the referenced data element of the plurality of data elements is resolved in the specific central processor to which it the probe message is associated.

7. The apparatus of claim 6 wherein the outstanding reference is a read command and the resolution is the return of a copy of the data element.

8. The apparatus of claim 6 wherein the outstanding reference is a read with intent to modify command.
 Description Submit all comments and votes
 


BACKGROUND OF THE INVENTION

This invention relates generally to computer systems and more specifically to the flow control of subsequent references to data elements in computer systems.

As it is known in the art, a multiprocessing computer system includes a plurality of central processing units (CPUs), a main memory and system control logic. Each CPU typically includes a cache for storing data elements that are accessed most frequently by the CPU. The system control logic provides a communication interconnect for data and commands sent between the CPUs and between the CPUs and main memory. The system control logic often includes a duplicate tag store and an arbitration circuit. The arbitration circuit produces a serial stream of command references which are derived from commands from all CPUs and is applied to the duplicate tag store and main memory. The duplicate tag store holds status information pertaining to data stored in the caches coupled to each of the CPUs. The duplicate tag store is coupled with the arbitration circuit so that it may operate on the serial stream of commands. It is therefore implemented remote from the CPUs.

Each CPU may issue a variety of commands to the system control logic dependent upon the current cache status of a given data element and the operation the CPU needs to perform on that data element. If a CPU needs to access a copy of a data element that is not already in its cache, it issues a "readmiss" command to the system control logic. That command will retrieve an up-to-date copy of the requested data element and store it in the CPU's cache. The associated status information will indicate that the data is in an unmodified state. If the CPU needs to modify a data element that is not already in its cache, it issues a "read-miss-modify" command to the system control logic. That command will retrieve an up-to-date copy of the requested data element and store it in the CPU's cache. The associated status information for this data block will indicate that the data is in an exclusive, modified state. When a data element is in this exclusive modified state, it is considered the "most up to date" copy of the data element in the system.

The system control logic receives commands from a plurality of CPUs. The system control logic includes an arbitration circuit through which these commands arbitrate for access to the system's duplicate tag store and main memory resources. The output stage of this arbitration circuit, referred to as the "system serialization point", produces a serial stream of CPU commands which are issued directly to the duplicate tag and the main memory. For each command in this serial stream, the system control logic performs a duplicate tag store lookup operation. This operation returns the cache status for each CPU, for the specific data element referenced by the command.

Specifically, the lookup operation will return status information indicating which CPUs have copies of the referenced data element and which of these copies is the most up-to-date version of the data element. Therefore, if memory does not have the most up to date version of the data in the system, the duplicate tag store will indicate which CPU does. When the system is processing a readmiss command or a read-miss-modify command it uses this information to determine whether to fetch data from main memory or another CPU. If it must fetch data from another CPU, it does so by issuing a message referred to as a "forwarded-read" or a "probe read" to that other CPU. Probe messages like the forwarded-read are issued to their target CPUs through a set of "probe queues" in the system control logic. Each CPU is associated with one probe queue from that set of probe queues.

The system control logic also executes a duplicate tag store update operation for each command in the serial stream. When the system control logic is processing read-miss or read-miss-modify commands, it updates the duplicate tag store state of the issuing CPU to indicate that the referenced block is now an element of the issuing CPU's cache. In response to a read-miss-modify command the system control logic also updates the duplicate tag to indicate that the copy of the data element in the issuing CPU's cache is the most up-to-date copy of the element in the system.

When the arbitration circuit of the system control logic issues a command to the duplicate tag store, it simultaneously issues the same command to the main memory of the computer system. If the command is a readmiss command or a read-miss-modify command and the duplicate tag indicates that the most up-to-date copy of the data element is in memory, then the system control logic will return a copy of the data element from main memory to the requesting CPU via a "fill" message. Similarly, if the duplicate tag indicates that the most up-to-date copy of the data element is in another CPU's cache, then the system control logic will return a copy of the data element from the other CPU to the requesting CPU, also via a fill message.

Readmiss and read-miss-modify commands that are serviced from data stored in another CPU, require the issuance and servicing of a probe read message. Because of these added operations it can take a longer amount of time to return a fill message for a readmiss or read-miss-modify command serviced from another CPU's cache than it does for a readmiss or read-miss-modify command that is serviced from main memory. All fill messages are returned to their issuing CPUs via a set of "fill queues" in the system control logic. Each CPU is associated with one fill queue from that set.

Since fill messages are returned to the issuing CPU with variable timing and since the fill queue and the probe queue associated with a given CPU operate independent from each other, it is possible for a probe message to reach the top of a CPU's probe queue before a fill message that was generated in response to a command issued to the system serialization point before the command that caused generation of the probe.

In such a computer system it is possible that a first CPU issues a read-miss-modify command to the system control logic, concurrently with a readmiss or read-miss-modify command issued from a second CPU that references that same data element, and wherein the most up-to-date copy of the data element resides in a third CPU's cache. If the read-miss-modify command from the first CPU is issued to the system serialization point before the readmiss command from the second CPU, then the duplicate tag store lookup for the read-miss-modify command from the first CPU will cause a first probe read message to be issued to the third CPU. The system control logic then updates the duplicate tag store to indicate that the first CPU now has the most up-to-date copy of the data in the system. When the arbitration circuit in the system control logic issues the second CPU's readmiss command, the associated duplicate tag store lookup will detect the duplicate tag store update from the first CPU's read-miss-modify command and a second probe read message will be issued to the first CPU. This second probe read message may reach the top of the first CPU's probe queue before the fill message associated with the first probe read message reaches that same CPU. Since the fill message associated with the first probe read contains the data required by the second probe read, the second probe read cannot be serviced.

Prior art systems have resolved this issue through a variety of means. Many computer systems, such as the AlphaServer 8000 series of computers manufactured by Digital Equipment Corporation, include arbitration circuits in the system control logic that block the issuance of the readmiss command from the second CPU until the fill message for the first CPU has reached the first CPU, or until that fill message is guaranteed to reach the first CPU before the second probe read message reaches the top of the probe queue. Implementations of such "arbitration blocking" mechanisms are either complex, e.g. where the mechanism blocks only references to a given data element, or present limitations to system performance, e.g. where the mechanism blocks access to an entire region of memory.

SUMMARY OF THE INVENTION

The invention resides in allowing each probe queue in a multiprocessor computer system to be individually stalled when a probe message, that targets data not yet stored in an associated cache or victim data buffer, reaches the top of that queue. Such an arrangement improves the performance of the computer system by allowing all regions of memory and all probe queues that are not associated with the stall condition, to remain in operation.

More specifically, a multiprocessor computer system utilizing the present invention operates as follows. When a first CPU in the computer system requires an element of a data block not stored in its cache, it issues a readmiss command to the system control logic to retrieve a specified data block. When the command wins arbitration in the system control logic, the entries of the duplicate tag store are compared with the address of the data block to determine if another CPU has a most up-to-date copy stored in its cache. Because the duplicate tag store entries are updated when each command wins arbitration rather than when the command is actually completed, it is possible for the duplicate tag store to indicate that a CPU has a most up-to-date copy of a data element that is still in the process of being retrieved. Therefore, the duplicate tag store could indicate that the cache of a second CPU has a most up-to-date copy of the specified data when, in fact, it is still in the process of retrieving that data. The system control logic will responsively place a probe read message on the second CPU's probe queue. If the data has not been retrieved by the second CPU before the probe message reaches that top of the probe queue, that queue will be stalled. The rest of the computer system will continue to operate as if that queue were not stalled, and therefore the performance of the other CPUs is not hampered.

BRIEF DESCRIPTION OF THE DRAWINGS

The foregoing features of this invention, as well as the invention itself, may be more fully understood from the following detailed description when read in conjunction with the accompanying drawings, in which:

FIG. 1 is a block diagram of a computer system including multiple central processing units;

FIG. 2 depicts one of the central processing units of FIG. 1;

FIG. 3 depicts a block diagram of several central processing units of the computer system of FIG. 1;

FIG. 4 is a flow diagram of the distributed data dependency stall mechanism implemented by a CPU of the computer system of FIG. 1;

FIG. 5 is a flow diagram of an embodiment of the distributed data dependency stall mechanism implemented by a CPU of the computer system of FIG. 1;

FIG. 6 is a flow diagram of a method for independently deallocating a victim data buffer coupled to a CPU of the computer system of FIG. 1;

FIG. 7 is a block diagram of a computer system which does not include a duplicate tag store;

FIG. 8 depicts a flow diagram of a method for independently deallocating a victim data buffer coupled to a CPU of the computer system of FIG. 7;

FIG. 9 depicts a single processor computer system which does not implement a duplicate tag store;

FIGS. 10A and 10B depict flow diagrams of a separate probe and victim buffer read and release mechanism implemented by a CPU of the computer system of FIG. 1;

FIGS. 11, 11A and 11B depict one of the central processing units of the computer system of FIG. 1, together with a probe counter and a plurality of victim release counters associated with that CPU and included in the address control chip;

FIG. 12 depicts a flow diagram of the operation of the probe counter and the victim release counters of FIG. 11;

FIG. 13 depicts a block diagram of several central processing units of the computer system of FIG. 1 together with probe counters and victim release counters associated with those CPUs and included in the address control chip;

FIG. 14 depicts a flow diagram illustrating a problem solved by clean-victim commands executed on the computer system of FIG. 1;

FIG. 15 depicts a flow diagram of the operation of clean-victim commands executed on the computer system of FIG. 1;

FIG. 16 depicts a further enhancement to the operation of clean-victim commands as implemented on one of the central processing units of the computer system of FIG. 1; and

FIG. 17 depicts a flow diagram of the enhancement to the operation of clean-victim commands as implemented on one of the central processing units of the computer system of FIG. 1.

DESCRIPTION OF A PREFERRED EMBODIMENT

Referring to FIG. 1, a multiprocessor computer system 10 is shown to include four processor modules 11a, 11b, 11c, and 11d, each including a central processing unit (CPU). In the preferred embodiment, Alpha.RTM. 21264 central processing unit chips manufactured by Digital Equipment Corporation.RTM., are used however other types of processor chips capable of supporting the invention may alternatively be used.

The multiprocessor computer system 10 includes a memory 42 which may comprise a number of memory modules 42a-42d, a system control logic 18 and an I/O processor module (IOP) 14. The IOP 14 is coupled to an I/O bus 14a, such as a Peripheral Computer Interconnect (PCI) bus for transferring data to and from the multiprocessor computer system 10 and external devices as set forth in the applicable PCI standards. Associated with the IOP 14 is an IOP tag store 14b, for storing addresses and coherency status information relating to data that is being used by the external devices.

The IOP 14, processor modules 11a-11d and memory modules 42 are coupled to the system control logic 18 by bi-directional data links 16a-16i and address links 20a-20e. The QSD devices 15 of the system control logic 18 provide a switch interconnect for data sent between the processor modules 11a-11d, memory modules 42a-42d and IOP 14. The QSD devices 15 are slaved to the control signal sent from the address control chip 17. FIG. 1 shows four QSD devices included in the system control logic 18, with each QSD device controlling a portion of the data path. Other embodiments may use more QSD devices depending on the width of the data path that is implemented.

The system control logic 18 includes an address control chip (QSA) 17 and data slice chips (QSDs) 15. The address control 17 is a master controller for all command paths to processor modules 11a-11d, and to the IOP module 14. The address control chip 17 provides control signals to the QSD devices 15 for controlling the switch interconnect and the data interconnect between the QSDs and each of the processor modules. The address control chip 17 also includes a central arbitration circuit 60 for determining the order in which processor requests are serialized and receive access to a remote duplicate tag store 23 and to main memory 42. The address control chip 17 serializes commands, such that one per cycle wins arbitration and is asserted on the arbitration (Arb) bus 21.

Address control chip 17 further includes fill queues 80a-80d and probe queues 79a-79d. Each probe and fill queue is associated with one of the processor modules 11a-11d. Each probe and fill queue provides probe messages and fill messages, respectively, in a first in first out (FIFO) manner to the processor module to which it is coupled. A probe message, or simply a probe, is issued by the system control logic in response to a request by a processor module or the IOP 14 to retrieve (probe read message) or change the status of (probe invalidate message) a most recent version of a data element that is stored in another processor's cache memory. For example, as each command wins arbitration in the arbitration circuit 60, the address control chip 17 performs a duplicate tag store lookup operation for the data element targeted by the command. This lookup operation indicates which CPU has a most up-to-date copy of the targeted data element stored in its cache or, alternatively, indicates that main memory 42 contains the most up-to-date copy. If the command that won arbitration is a request to retrieve a copy of the targeted data (a readmiss command) the address control chip 17 uses the result of the duplicate tag store lookup operation to determine whether to retrieve it from main memory or from another CPU's cache. If it must retrieve the data from another CPU's cache it does so by issuing a probe read message to that CPU through the associated probe queue. When the probe message reaches a point in the CPU where the target address of the probe can be compared against the entries of its local tag store, it is referred to as being at the "top" of that probe queue and can be processed by the CPU.

In reply to a probe message that has reached the top of the probe queue, the associated processor initiates a probe response in which the system control logic 18 is notified that the probe message's target address has been compared against the entries of the local tag store. When the probe message is a probe read message, the probe response indicates that the data targeted by the probe message is ready to be transferred to the system control logic 18. The system control logic 18 then commands the central processing unit to transfer the data and incorporates it in a fill message. The fill message is placed on the fill queue of the central processing unit that issued the associated readmiss command, i.e. the requesting CPU, such that the data will be stored in the cache memory of the requesting CPU.

The shared, serially accessed duplicate tag store (dtag) 23 is used to maintain data coherency within the multiprocessor computer system 10. The duplicate tag store 23 receives commands from the address control chip 17 via arb bus 21 and transfers information to the address control chip 17 via bus 19. The duplicate tag store 23 is further coupled to memory modules 42a-42d by arb bus 21 and is partitioned into a plurality of storage locations for retaining the tag addresses of data elements stored in the backup cache memories of each processor module 11a-11d. These tag addresses are referred to as backup cache tags or duplicate tags and allow the address control chip 17 to quickly determine the state of each data element stored in a given processor module's cache memory. Based on this information, the address control chip 17 will issue probe messages only to processors that have a most up-to-date copy of the requested data.

Referring now to FIG. 2 processor module 11a, representative of processor modules 11b-11d of multiprocessor computer system 10, is shown in more detail. Processor module 11a is shown coupled to its associated probe queue 79a and fill queue 80a via bus 20a. In addition, processor module 11a includes an internal probe queue 81a that functions as an extension of probe queue 79a. Processor module 11a also includes a central processing unit (CPU) 12a, a backup cache 29a, and a backup cache tag store 30a. Data cache 22a is typically smaller and faster than backup cache 29a. The tag portion of each backup cache entry's address, as well as its status flags, are stored in tag store 30a. The status flags include a dirty bit, a valid bit, and a shared bit. The valid bit indicates that the data is the most recent version of the particular data element. The dirty bit indicates that the data has been modified since it was retrieved and thus indicates that the CPU coupled to the cache is the "owner" of the data. Being the owner of a data element means that the coupled CPU is responsible for servicing all requests that target that data until another processor takes ownership of the data, or until the command to write the data back to main memory wins arbitration. The shared bit indicates that another CPU also has an identical copy of the data element stored in its cache.

The status flags stored in tag store 30a are similar to the status "code" stored in duplicate tag store 23 (shown in FIG. 1). That status code indicates whether the entry is valid, invalid, dirty-probed, or dirty-not-probed. As in tag store 30, the valid and invalid portions of the status code indicate whether or not the associated CPU has an unmodified copy of the data. The dirty-probed portion of the status code indicates that the associated processor has a dirty copy of the data and that a probe read message has been previously issued to that CPU to retrieve a copy of it. Likewise, the dirty-not-probed portion of the status code indicates that the associated CPU has a dirty copy of the data but a probe read message has not previously been issued to the CPU to retrieve it. Accordingly, the status information stored by tag store 30 and by duplicate tag store 23 are not identical.

CPU 12a includes several groups of logic that enable it to perform the major operations that the computer system 10 requires. The Ibox 34a, or instruction fetch and decode unit, controls instruction pre-fetching, instruction decoding, branch prediction, instruction issuance, and interrupt handling. The Ebox 36a, or integer execution unit, handles the functions of addition, shifting, byte manipulation, logic operations, and multiplication for integer values stored in the system. These same operations, for floating point values, are controlled by the Fbox 38a, or floating point execution unit. The Mbox 40a, or memory address translation unit, translates virtual addresses, generated by programs running on the system, into physical addresses which are used to access locations in the computer system 10. The Ebox 36a and Fbox 38a operate on data items and are primarily coupled to Data cache 22a via busses 25a and 26a respectively. Also, Mbox 40a and Ibox 34a are coupled to the Instruction cache 24a via busses 27a and 28a respectively.

Lastly the Cbox 30a, or cache control and bus interface unit, includes logic for controlling the backup cache 29a, memory related external interface functions, and all accesses initiated by the Mbox 40a. The Cbox 30a also includes victim data buffers 78a, a victim address file (VAF) 87a, and a miss-address file (MAF) 86a for operations related to retrieving data elements. The victim data buffers 78a store data elements that have been evicted from backup cache 29a. The VAF 87a stores the addresses of data elements stored in each victim data buffer. Also, the MAF 86a stores a copy of each command that central processing unit 11a issues to the system but which has not yet completed. Further, the Cbox 30a also includes a path for probe and fill messages to enter the central processing unit 11a and operate on specified data stored in che memory or in the victim data buffers. The path is an extension of probe queue 79a and fill queue 80a and includes an internal probe queue 81a for probe messages, and a bypass path for fill messages, as will be described in more detail below.

I. Distributed Data Dependency Stall Mechanism

Referring now to FIG. 3, a simplified depiction of a multiprocessor computer system 10 is shown to include a plurality of processor modules 11a-11d in relation to address control chip 17. Each processor module 11a-11d is minimally shown to include a Central Processor Unit 12a-12d, and a backup cache 29a-29d. Each CPU 12a-12d is shown to include a miss-address file (MAF) 86a-86d, a set of victim data buffers (VDB) 78a-78d, an internal probe queue 81a-81d and a primary data cache 22a-22d as described above. The processor modules 11a-11d are coupled to the address control chip 17 which includes the central arbitration circuit 60, and a probe and fill queue pair (79a-79d and 80a-80d) for each processor module 11a-11d.

During normal operation, a central processing unit 12a-12d will attempt to retrieve data elements from its primary data cache 22a-22d and backup cache 29a-29d before issuing a command to the system control logic to retrieve the requested data. If the memory block that contains the requested data is not stored in the CPU's cache, a cache miss occurs and the data must be retrieved from another source such as main memory 42 or another CPU's cache.

In order to retrieve the data from a source other than an attached cache, the CPU issues a command to the system control logic 18. If the CPU only needs to access the data, it issues a readmiss command to the system control logic 18. That command will cause the system control logic 18 to retrieve the most up-to-date copy of the requested data element and store it in the CPU's cache. The associated status information will be updated to indicate that the data is in an unmodified state. If the CPU needs to modify the data, it issues a read-miss-modify command to the system control logic 18. The read-miss-modify command will cause the system control logic 18 to retrieve an up-to-date copy of the data, invalidate all other copies of that data via a probe invalidate message, store it in the requesting CPU's cache and update the associated status information to indicate that the data is in an exclusive modified state. When a data element is in an exclusive state it means that it is stored only in that cache and, therefore, is considered the most up-to-date version in the computer system. When a readmiss or read-miss-modify command is issued, it is input to central arbitration circuit 60.

Central arbitration circuit 60 includes an arbitration algorithm for determining the order in which the command will gain access to the duplicate tag store 23 and main memory 42. Such arbitration algorithms can include round-robin arbitration which is well known in the art and will not be explained further. Further, central arbitration circuit 60 operates responsive to the number of probe messages stored on each of the probe queues. When one of the probe queues becomes "full", i.e. when each of its entries is filled with a pending probe message, the central arbitration logic is stalled by the address control chip 17. When the central arbitration logic 17 is stalled, commands that are issued to the address control chip 17, from the central processing units, will not be input to the central arbitration circuit until the full probe queue has a predetermined number of free locations. Also, the central arbitration circuit 60 will be prevented from issuing any more probe messages to the system serialization point. For example, when the probe queue is full, a full flag coupled to the probe queue, is set. At this point the central arbitration circuit 60 will be prevented from issuing further probe messages to the system serialization point. Eventually, as the associated central processing unit processes probe messages that reach the top of the probe queue, entries of the probe queue will be made available to hold further probe messages. When a predetermined number of entries are available, an almost-full flag will be set. Responsively, central arbitration circuit 60 will resume normal operation.

When a readmiss or read-miss-modify command wins arbitration, the system control logic 18 performs a lookup operation on the entries of duplicate tag store 23 to determine if a most up-to-date version of the requested data is stored in one of the backup cache memories 29a-29d of the Central Processor Units 11a-11d. Concurrently, an access of main memory 42 is initiated. If none of the backup cache memories 29a-29d have a most up-to-date copy of the data, it is retrieved from main memory 42. Alternatively, if an up-to-date copy of the requested data is stored in another CPU's cache, a corresponding probe read message (and for read-miss-modify commands, a probe read-invalidate message) is placed on the probe queue 79a-79d of the central processing unit that has the requested data stored in its backup cache 29a-29d. After comparing the address of the requested data with the addresses of data stored in its cache, the central processing unit storing the requested data replies to the probe read message by initiating a probe response. The probe response indicates to the system control logic 18 that the requested data is ready to be accessed. Subsequently, the system control logic obtains a copy of the data and incorporates it in a fill message placed on the fill queue of the requesting CPU. When the fill message reaches the top of the fill queue, the data is stored in cache.

It should be noted that if the command was a read-miss-modify command, the status of any other entry in the duplicate tag store 23 that matches the data targeted by the command, is changed to invalid via a probe invalidate message issued to each CPU other than the one which is supplying the data. When a read-miss-modify command wins arbitration, a probe read-invalidate message is issued to the CPU that owns the data and a probe invalidate message is issued to other CPUs that are storing copies of the data. A probe read-invalidate message retrieves a copy of the data from a CPU and also changes its status in that CPU to invalid. The copies of the data in that CPU and all other CPUs are invalidated because the requested data is to be modified by the requesting CPU. Accordingly, the modified value will thereafter be the only valid copy of the data and, because of the invalidation of the other CPU's entries in duplicate tag store 23, the system will no longer issue probe read messages to retrieve the data from those sources. In addition, future probe messages are directed to the CPU that issued the read-miss-modify command so that the up-to-date version of the data is retrieved from its cache memory.

Before a readmiss or read-miss-modify command is to be issued to retrieve a requested data element, a determination is made as to the location where the requested data element will be stored in cache. When a data element is already stored at that location the CPU needs to displace the stored data to make room for the requested data. Such evicted data is referred to as "victim" data. When the status of the victim data is modified, or dirty, it should be written into main memory to preserve that modified value. The CPU writes victim data back to memory by issuing a "victim" command along with the associated readmiss or read-miss-modify command. The combination of commands is referred to as a readmiss/victim or read-miss-modify/victim command pair. While waiting to be written back to main memory, the victim data is stored in a victim data buffer in the CPU. By displacing the modified data to a victim data buffer 77, storage space is released in the cache while allowing a copy of the victim data to be accessed until all probe messages that were issued to the CPU before the associated victim command won arbitration have passed through the probe queue 79.

Probe messages and fill messages pass through separate and independently operating probe queues and fill queues before being input to the CPU to which they were issued. Probe messages and fill messages also require access to different sets of system resources. Fill messages require access to the CPU's command bus, data bus and cache. Probe messages, on the other hand, require access to the CPU's address bus, data bus and internal buffer resources. Although these sets of system resources overlap to some extent, the specific dependencies are unique, i.e. probe messages use resources that drive data to the system control logic while fill messages use resources that drive data from the system control logic to a CPU. As a result of this difference in resources, probe messages in the probe queue may make progress slower than fill messages in the fill queue, and vice versa. The most likely condition is that the probe messages will progress slower than the fill messages because of the probe message's dependence on the internal processor buffer resources. Accordingly, by segregating the two types of messages, the faster executing fill messages are not delayed by the slower executing probe messages and, in certain circumstances, vice versa. Further, since the speed of a computer system is largely determined by the time it takes to retrieve data, such an architecture greatly improves the rate at which requested data is stored in cache and therefore improves system performance.

As described above, the probe messages and the fill messages can execute at different speeds. However, a problem arises whenever a probe read message reaches the top of the probe queue 79 before a fill message containing the requested data reaches the top of the fill queue with that data. In this situation, the data is not present to satisfy the probe message, and therefore the probe queue 79 is stalled until the data is retrieved and stored in the cache. System performance can be improved by allowing individual probe queues to be stalled without stalling the entire system, i.e. by distributing such data dependent probe queue stalls. With such an arrangement, a series of probe messages targeting the same data can be chained together, e.g. a first probe message can be issued to a first CPU which has an associated second probe message that is in the process of retrieving the requested data from a second CPU.

In such a computer system it is possible that a first CPU issues a read-miss-modify command to the system control logic, concurrently with a readmiss or read-miss-modify command issued from a second CPU that references that same data element, and wherein the most up-to-date copy of the data element resides in a third CPU's cache. If the read-miss-modify command from the first CPU is issued to the system serialization point before the readmiss command from the second CPU, then the duplicate tag store lookup for the read-miss-modify command from the first CPU will cause a first probe read message to be issued to the third CPU. The system control logic then updates the duplicate tag store to indicate that the first CPU now has the most up-to-date copy of the data in the system. When the arbitration circuit in the system control logic issues the second CPU's readmiss command, the associated duplicate tag store lookup will detect the duplicate tag store update from the first CPU's read-miss-modify command and a second probe read message will be issued to the first CPU. This second probe read message may reach the top of the first CPU's probe queue before the fill message associated with the first probe read message reaches that same CPU. Since the fill message associated with the first probe read contains the data required by the second probe read, the second probe read cannot be serviced.

Delays caused by the above mentioned situation are prevented in the computer system shown in FIG. 3 where the individual probe queues can be stalled by the central processing unit 12a-12d attached thereto. As previously stated, each CPU 11a-11d includes a miss address file which stores references to each data element for which that CPU has issued a readmiss or read-miss-modify command that has not completed. The attached processor module 11a-11d will not process a probe message that reaches the top of probe queue 79 if the associated miss address file 86a-86d contains a reference for the data targeted by that probe message. Accordingly, the central processing unit 12a-12d compares the target address of the probe message with each entry in its MAF 86a-86d. If a match is detected the processor stalls the probe queue by allowing the probe message to remain at the top of the probe queue without being processed until the requested data is returned via a fill message and the corresponding entry in the MAF is deleted. Meanwhile, the other probe queues in the system continue to operate normally. Therefore the architecture of the computer system 10 prevents the above mentioned problem from occurring by allowing individual probe queues to be stalled by their associated CPUs without stalling the entire system.

Referring now to FIG. 4, a flow diagram depicts the functionality of the computer system's central processing units and their associated logic with respect to issuance of a readmiss or read-miss-modify command and the corresponding probe read message. Consider a multiprocessor computer system 10, such as depicted in FIG. 3, comprised of a plurality of CPUs 12a-12d (hereinafter CPU.sub.1, CPU.sub.2, CPU.sub.3, and CPU.sub.4 respectively). A sequence of commands is issued wherein CPU.sub.1 first issues a read-miss-modify command for data block A (step 100) to the system control logic 18. After the command wins arbitration, the address control chip 17 determines that the most up-to-date copy of data block A is stored in main memory 42. The address control chip 17 updates the corresponding entry in the duplicate tag store 23, retrieves a copy of data block A from main memory 42 and incorporates it in a fill message placed on fill queue 80a (step 102). Following these events, CPU.sub.2 also issues a read-miss-modify command targeting data block A (step 104). When that read-miss-modify command for data block A wins arbitration, the address control chip 17 performs a lookup operation wherein the duplicate tag store 23 compares the data block's address with each of its entries, and sends the result to