WikiPatents - Community Patent Review
Create Free Account  |  License or Sell Your Patent  |  WikiPatents Marketplace  |  WikiPatents Blog
Username:  Password:  
    
Advanced Search
Writeback cancellation processing system for use in a packet switched cache coherent multiprocessor system    
United States Patent5684977   
Link to this pagehttp://www.wikipatents.com/5684977.html
Inventor(s)Van Loo; William C. (Palo Alto, CA); Ebrahim; Zahir (Mountain View, CA); Nishtala; Satyanarayana (Cupertino, CA); Normoyle; Kevin (San Jose, CA); Loewenstein; Paul (Palo Alto, CA); Coffin, III; Louis F. (San Jose, CA)
AbstractA multiprocessor computer system is provided having a multiplicity of sub-systems and a main memory coupled to a system controller. An interconnect module, interconnects the main memory and sub-systems in accordance with interconnect control signals received from the system controller. At least two of the sub-systems are data processors, each having a respective cache memory that stores multiple blocks of data and a set of master cache tags (Etags), including one cache tag for each data block stored by the cache memory. Each data processor includes a master interface for sending memory transaction requests to the system controller. The system controller processes each memory transaction and maintains a set of duplicate cache tags (Dtags) for each data processor. Finally, the system controller contains transaction execution circuitry for activating a transaction for servicing by the interconnect. The transaction execution circuitry pipelines memory access requests from the data processors, and includes invalidation circuitry for processing each writeback request from a given data processor prior to activation to determine if the Dtag index corresponding to the victimized cache line is invalid. Thereafter, the invalidation circuitry activates writeback requests only if the Dtag index is not invalid and cancels the writeback request if the Dtag index is invalid.
   














 Title Information Submit all comments and votes
 
Patent Text Patent PDF Print Page Summary File History
Plain text PDF images Print Summary File History
Inventor     Van Loo; William C. (Palo Alto, CA); Ebrahim; Zahir (Mountain View, CA); Nishtala; Satyanarayana (Cupertino, CA); Normoyle; Kevin (San Jose, CA); Loewenstein; Paul (Palo Alto, CA); Coffin, III; Louis F. (San Jose, CA)
Owner/Assignee     Sun Microsystems, Inc. (Mountain View, CA)
Patent assignment
All assignments
Publication Date     November 4, 1997
Application Number     08/415,040
PAIR File History     Application Data   Transaction History
Image File Wrapper   Patent Term   Fees
Litigation
Filing Date     March 31, 1995
US Classification     711/143 711/120
Int'l Classification     G06F 012/08
Examiner     Chan; Eddie P.
Assistant Examiner     Yip; Vincent
Attorney/Law Firm     Williams; Gary S. Flehr Hohbach Test Albritton & Herbert LLP
Address
Parent Case    
Priority Data    
USPTO Field of Search     395/447 395/470 395/471 395/473 395/474
Patent Tags     writeback cancellation processing packet switched cache coherent multiprocessor
   
Enter a comma (,) or semicolon (;) between multiple tag words/phrases.
Describe this patent:
 Amusing   
 Clever   
 Complex   
 Efficient   
 Historic   
 Important   
 Innovative   
 Interesting   
 Practical   
 Simple   
[no votes]
Patent WIKI

Share information and news about this patent, including information and news about the technology, inventors, company, ligation and licensing.

 References Submit all comments and votes
 
*references marked with an asterisk below are user-added references
 U.S. References
 
Add a new US reference:  
ReferenceRelevancyCommentsReferenceRelevancyComments
5537569
Masubuchi
711/121
Jul,1996

[0 after 0 votes]
5537575
Foley
711/141
Jul,1996

[0 after 0 votes]
5530835
Vashi
711/147
Jun,1996

[0 after 0 votes]
5428799
Woods
710/266
Jun,1995

[0 after 0 votes]
5404480
Suzuki
711/117
Apr,1995

[0 after 0 votes]
5375220
Ishikawa
711/141
Dec,1994

[0 after 0 votes]
5319753
MacKenna
710/48
Jun,1994

[0 after 0 votes]
5119485
Ledbetter, Jr.
711/146
Jun,1992

[0 after 0 votes]
5036459
den Haan
709/237
Jul,1991

[0 after 0 votes]
4228503
Waite
711/121
Oct,1980

[0 after 0 votes]
 Foreign References
 Other References
 Market Review Submit all comments and votes
   
Market Size
Estimate the gross annual revenues of the relevant market sector:
> $10B
$5B - $10B
$2B - $5B
$500M - $2B
$100M - $500M
$10M - $100M
$1M - $10M
$500K - $1M
$100K - $500K
< $100K
[No votes]
$0
 
$0   $2.5B   $5B   $7.5B   $10B
Market Share
Estimate the percentage of the relevant market sector this invention will capture:
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Reasonable Royalty
What percentage of gross sales should the inventor or assignee be paid?
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Public's "Guesstimation" of Royalty Value
Market SizeN/A[No votes]
xMarket ShareN/A[No votes]
xReasonable RoyaltyN/A[No votes]

N/A

License Availablity
If you are NOT the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
License Availablity
If you ARE the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
Competitive Advantage
Does this invention have a significant competitive advantage over similar technologies?
Yes

No



[No votes]
Most helpful competitive advantage comment
[No comments]

Commercial Alternatives
Are there viable commercial alternatives for this invention?
Yes

No



[No votes]
Most helpful commercial alternative comment
[No comments]

 Technical Review Submit all comments and votes
 Claims Submit all comments and votes
 


What is claimed is:

1. A computer system, comprising:

a system controller;

a main memory coupled to the system controller;

a plurality of data processors, each respective data processor having a cache memory having a plurality of cache lines for storing a like plurality data blocks, and a like plurality of master cache tags (Etags), including one Etag for each cache line in the cache memory of the respective data processor, and a writeback buffer for storing a dirty victim data block displaced from the cache memory; the Etag for each cache line storing an address index and an Etag state value that indicates whether the data block stored in the cache line includes data modified by the respective data processor;

each respective data processor including a master interface, coupled to the system controller, for sending memory transaction requests to the system controller, the memory transaction requests including read requests and writeback requests; each memory transaction request specifying an address for an associated data block to be read or written;

the master interface for each respective data processor further including cache coherence logic for responding to a cache miss on any cache line in the cache memory by (A) generating a read request, and (B) when the cache miss requires a cache line to be victimized which includes modified data, according to the Etag state value in the corresponding Etag, storing the data block having the modified data in the writeback buffer and generating a writeback request;

the system controller including a set of duplicate cache tags (Dtags) for each respective data processor, each Dtag corresponding to one of the Etags and storing a Dtag state value and the same address index as the corresponding Etag; the Dtag state value indicating whether the data block stored in the corresponding cache line includes data modified by the corresponding data processor;

the system controller including memory transaction request logic for processing each memory transaction request by any of the respective data processors, the memory transaction request logic including transaction execution circuitry for pipelining and executing the memory transaction requests from the data processors,

the transaction execution circuitry including Dtag updating circuitry for updating the contents of a Dtag, if any, corresponding to each data processor where the Dtag has an address index matching an address associated with the memory transaction request being executed, wherein the Dtag updating circuitry can invalidate Dtags corresponding to one or more data processors, including data processors other than the data processor that made the memory transaction request being executed by the transaction execution circuitry;

the transaction execution circuitry further including writeback circuitry for processing each writeback request from each data processor prior to activation to determine if the Dtag corresponding to the data processor that submitted the writeback request and the address specified by the writeback request is invalid, the writeback circuitry activating the writeback request if the corresponding Dtag is not invalid and canceling the writeback request if the corresponding Dtag is invalid;

the system controller's memory transaction request logic including writeback processing logic for processing the activated writeback request by writing the data block in the writeback buffer into the main memory and invalidating the corresponding Dtag.

2. The computer system of claim 1,

the master interface of each data processor including at least two parallel outgoing request queues for storing memory transaction requests to be sent to the system controller;

the master interface of each data processor further including cache coherence logic for responding to a cache miss on any cache line in the cache memory of the respective data processor by (A) storing a read request in a first one of the outgoing request queues, and (B) when the cache miss occurs on a cache line storing a data block that, according to the Etag state value in the corresponding Etag, includes modified data, storing the data block having the modified data in the writeback buffer and storing a writeback request in a second one of the outgoing request queues.

3. A method of canceling writeback transactions in a packet switched cache coherent multiprocessor system having a system controller coupled to a main memory and to a plurality of data processors each having a cache memory, comprising the steps of:

storing master cache tags (Etags) in each data processor, including one Etag for each cache line in the cache memory of the respective data processor, the Etag for each cache line storing an address index and an Etag state value that indicates whether a data block stored in the cache line includes data modified by the respective data processor;

storing in a writeback buffer in each data processor a dirty victim data block displaced from the cache memory of the respective data processor;

storing in the system controller duplicate tags (Dtags) for each respective data processor, each Dtag corresponding to one of the Etags in a respective data processor and storing a Dtag state value and the same address index as the corresponding Etag;

sending memory transaction requests from the data processors to the system controller, the memory transaction requests including read requests and writeback requests; each memory transaction request specifying an address for an associated data block to be read or written;

upon processing a memory transaction request from one of the data processors requiring Dtag invalidation, invaliding Dtags having an address index matching an address associated with the memory transaction request being executed, wherein the invalidated Dtags including Dtags corresponding to one or more data processors, including data processors other than the data processor that made the memory transaction request being executed;

activating a writeback request for processing by the system controller by:

determining if the Dtag corresponding to the data processor that made the writeback request and the address specified by the writeback request is invalid;

canceling the writeback request if the Dtag is invalid;

activating the writeback request if the Dtag is not invalid; and

processing activated writeback requests by writing the data block in respective ones of the writeback buffers into the main memory and invalidating the corresponding Dtags.

4. The method of claim 3,

wherein the activating step cancels the writeback request by sending a first reply message to the data processor that instructs the data processor not to send the dirty victim data block to the main memory, and activates the writeback request by sending a second reply message to the data processor that instructs the data processor to send the dirty victim data block to the main memory.
 Description Submit all comments and votes
 


The present invention relates generally to multiprocessor computer systems in which the processors share memory resources, and particularly to a multiprocessor computer system that utilizes an interconnect architecture and cache coherence methodology to minimize memory access latency by parallelizing read and writeback transactions and providing for a mechanism to cancel pending writeback transactions upon the subsequent modification of the pending writeback data.

BACKGROUND OF THE INVENTION

The need to maintain "cache coherence" in multiprocessor systems is well known. Maintaining "cache coherence" means, at a minimum, that whenever data is written into a specified location in a shared address space by one processor, the caches for any other processors which store data for the same address location are either invalidated, or updated with the new data.

There are two primary system architectures used for maintaining cache coherence. One, herein called the cache snoop architecture, requires that each data processor's cache include logic for monitoring a shared address bus and various control lines so as to detect when data in shared memory is being overwritten with new data, determining whether it's data processor's cache contains an entry for the same memory location, and updating its cache contents and/or the corresponding cache tag when data stored in the cache is invalidated by another processor. Thus, in the cache snoop architecture, every data processor is responsible for maintaining its own cache in a state that is consistent with the state of the other caches.

In a second cache coherence architecture, herein called the memory directory architecture, main memory includes a set of status bits for every block of data that indicate which data processors, if any, have the data block stored in cache. The main memory's status bits may store additional information, such as which processor is considered to be the "owner" of the data block if the cache coherence architecture requires storage of such information.

In these cache coherence architectures, read-writeback transaction pairs arise when a read miss requires victimizing a cache line which has modified data, thereby necessitating a writeback to main memory. In the prior art, these transactions normally are strictly ordered, with the victimizing read transaction executing prior to the writeback transaction in order to allow the requesting processor to receive the data right away. In addition to the strict ordering, cache coherence architectures of the prior art required these read and writeback transactions be sequentially executed, not allowing for any other coherent transactions to be executed from the same processor between the read and the writeback transactions, even when transactions are directed to a different cache index.

One problem in these architectures arises when a writeback transaction is scheduled and the data associated with the writeback becomes invalid, due to a subsequent data write to the same address (such new data being stored in a cache line of one of the data processors). In these cases, the writeback transaction is no longer required, since the data is no longer valid. However, in these prior art architectures no simple way exists to cancel the writeback transactions once they are generated. This is because there does not exist a single shared address bus which a given processor can snoop on to look for all of these invalidating transactions. Similarly, cancellation of unexecuted or unscheduled writeback transactions (ones that are pending as part of a read-writeback transaction pair), cannot be easily accomplished because of the lack of visibility to these transactions. Accordingly, unnecessary writebacks are characteristic of these prior art systems resulting in reduced system performance.

Accordingly, an architecture which provides for an easy mechanism to cancel pending writeback transactions upon the occurrence of an invalidating transaction would yield an improvement in the overall transaction throughput.

SUMMARY OF THE INVENTION

In summary, the present invention is system and method for cancelling unnecessary writebacks of dirty victims displaced from cache memory in a cache coherent multiprocessor computer system. The multiprocessor system has a multiplicity of sub-systems and a main memory coupled to a system controller. An interconnect module, interconnects the main memory and sub-systems in accordance with interconnect control signals received from the system controller.

At least two of the sub-systems are data processors, each having a respective cache memory that stores multiple blocks of data and a respective master cache index. Each master cache index has a set of N master cache tags (Etags), including one cache tag for each data block stored by the cache memory.

Each data processor includes a master interface for sending memory transaction requests, including read and writeback transactions, to the system controller and for receiving cache access requests from the system controller corresponding to memory transaction requests by other ones of the data processors. The data processors each further include a writeback buffer for storing a victimized cache line until an associated writeback transaction is completed.

The system controller includes memory transaction request logic for processing each memory transaction request by a data processor, for determining which one of the cache memories and main memory to couple to the requesting data processor, for sending corresponding interconnect control signals to the interconnect module so as to couple the requesting data processor to the determined one of the cache memories and main memory, and for sending a reply message to the requesting data processor to prompt the requesting data processor to transmit or receive one data packet to or from the determined one of the cache memories and main memory.

The system controller maintains a duplicate cache index having a set of N duplicate cache tags (Dtags) for each of the data processors, the set of N duplicate cache tags for each data processor having an equal number of cache tags as the corresponding set of master cache tags. Each master cache tag denotes a master cache state and an address tag; the duplicate cache tag corresponding to each master cache tag denotes a second cache state and the same address tag as the corresponding master cache tag. The system controller includes an Nth+1 entry in the duplicate cache index, the Nth+1 entry corresponding to the cache state of a victimized cache line stored in the writeback buffer of an associated data processor.

The system controller includes logic circuitry for executing memory transaction requests from the data processors, and includes invalidation circuitry for processing each writeback request from a given data processor to determine if the Dtag index corresponding to the victimized cache line is invalid. In particular, if a first data processor has performed a memory transaction that invalidates the same cache line that is the subject of the writeback request from a second data processor, the writeback is unnecessary because the data in that cache line to be written back has been invalidated and will be overwritten by the first data processor. The invalidation circuitry allows the data transfer to main memory associated with the writeback request to be executed only if the Dtag index for the address specified in the writeback request is not invalid and cancels the writeback request if the Dtag index is invalid.

BRIEF DESCRIPTION OF THE DRAWINGS

Additional objects and features of the invention will be more readily apparent from the following detailed description and appended claims when taken in conjunction with the drawings, in which:

FIG. 1 is a block diagram of a computer system incorporating the present invention.

FIG. 2 is a block diagram of a computer system showing the data bus and address bus configuration used in one embodiment of the present invention.

FIG. 3 depicts the signal lines associated with a port in a preferred embodiment of the present invention.

FIG. 4 is a block diagram of the interfaces and port ID register found in a port in a preferred embodiment of the present invention.

FIG. 5 is a block diagram of a computer system incorporating the present invention, depicting request and data queues used while performing data transfer transactions.

FIG. 6 is a block diagram of the System Controller Configuration register used in a preferred embodiment of the present invention.

FIG. 7 is a block diagram of a caching UPA master port and the cache controller in the associated UPA module.

FIGS. 8, 8A, 8B, 8C, 8D is a simplified flow chart of typical read/write data flow transactions in a preferred embodiment of the present invention.

FIG. 9 depicts the writeback buffer and Dtag Transient Buffers used for handling coherent cache writeback operations.

FIGS. 10A, 10B, 10C, 10D and 10E shows the data packet formats for various transaction request packets.

FIG. 11 is a state transition diagram of the cache tag line states for each cache entry in an Etag array in a preferred embodiment of the present invention.

FIG. 12 is a state transition diagram of the cache tag line states for each cache entry in an Dtag array in a preferred embodiment of the present invention.

FIG. 13 depicts the logic circuitry for activating transactions. FIGS. 14A-14D are block diagrams of status information data structures used by the system controller in a preferred embodiment of the present invention.

FIG. 15 is a block diagram of the Dtag lookup and update logic in the system controller in a preferred embodiment of the present invention.

FIG. 16 is a block diagram of the S.sub.-- Request and S.sub.-- Reply logic in the system controller in a preferred embodiment of the present invention.

FIG. 17 is a block diagram of the datapath scheduler in a preferred embodiment of the present invention.

FIG. 18 is a block diagram of the S.sub.-- Request and S.sub.-- Reply logic in the system controller in a second preferred embodiment of the present invention.

DESCRIPTION OF THE PREFERRED EMBODIMENTS

The following is a glossary of terms used in this document.

Cache Coherence: keeping all copies of each data block consistent.

Tag: a tag is a record in a cache index for indicating the status of one cache line and for storing the high order address bits of the address for the data block stored in the cache line.

Etag: the primary array of cache tags for a cache memory. The Etag array is accessed and updated by the data processor module in a UPA port.

Dtag: a duplicate array of cache tags maintained by the system controller.

Interconnect: The set of system components that interconnect data processors, I/O processors and their ports. The "interconnect" includes the system controller 110, interconnect module 112, data busses 116, address busses 114, and reply busses 120 (for S.sub.-- REPLY's), 122 (for P.sub.-- REPLY's) in the preferred embodiment.

Victim: a data block displaced from a cache line.

Dirty Victim: a data block that was updated by the associated data processor prior to its being displaced from the cache by another data block. Dirty victims must normally be written back to main memory, except that in the present invention the writeback can be canceled if the same data block is invalidated by another data processor prior to the writeback transaction becoming "Active."

Line: the unit of memory in a cache memory used to store a single data block.

Invalidate: changing the status of a cache line to "invalid" by writing the appropriate status value in the cache line's tag.

Master Class: an independent request queue in the UPA port for a data processor. A data processor having a UPA port with K master classes can issue transaction requests in each of the K master classes. Each master class has its own request FIFO buffer for issuing transaction requests to the System Controller as well as its own distinct inbound data buffer for receiving data packets in response to transaction requests and its own outbound data buffer for storing data packets to be transmitted.

Writeback: copying modified data from a cache memory into main memory.

The following is a list of abbreviations used in this document:

DVMA: direct virtual memory access (same as DMA, direct memory access for purposes of this document)

DVP: dirty victim pending

I/O: input/output

IVP: Invalidate me Advisory

MOESI: the five Etag states: Exclusive Modified (M), Shared Modified (O), Exclusive Clean (E), Shared Clean (S), Invalid (I).

MOSI: the four Dtag states: Exclusive and Potentially Modified (M), Shared Modified (O), Shared Clean (S), Invalid (I).

NDP: no data tag present

PA›xxx!: physical address ›xxx!

SC: System Controller

UPA: Universal Port Architecture

Referring to FIG. 1, there is shown a multiprocessor computer system 100 incorporating the computer architecture of the present invention. The multiprocessor computer system 100 includes a set of "UPA modules." UPA modules 102 include data processors as well as slave devices such as I/O handlers and the like. Each UPA module 102 has a port 104, herein called a UPA port, where "UPA" stands for "universal port architecture." For simplicity, UPA modules and their associated ports will often be called, collectively, "ports" or "UPA ports," with the understanding that the port or UPA port being discussed includes both a port and its associated UPA module.

The system 100 further includes a main memory 108, which may be divided into multiple memory banks 109 Bank.sub.o to Bank.sub.m, a system controller 110, and an interconnect module 112 for interconnecting the ports 104 and main memory 108. The interconnect module 112, under the control of datapath setup signals from the System Controller 110, can form a datapath between any port 104 and any other port 104 or between any port 104 and any memory bank 109. The interconnect module 112 can be as simple as a single, shared data bus with selectable access ports for each UPA port and memory module, or can be a somewhat more complex crossbar switch having m ports for m memory banks and n ports for n UPA ports, or can be a combination of the two. The present invention is not dependent on type of interconnect module 112 used, and thus the present invention can be used with many different interconnect module configurations.

A UPA port 104 interfaces with the interconnect module 112 and the system controller 110 via a packet switched address bus 114 and packet switched data bus 116 respectively, each of which operates independently. A UPA module logically plugs into a UPA port. The UPA module 102 may contain a data processor, an I/O controller with interfaces to I/O busses, or a graphics frame buffer. The UPA interconnect architecture in the preferred embodiment supports up to thirty-two UPA pods, and multiple address and data busses in the interconnect. Up to four UPA ports 104 can share the same address bus 114, and arbitrate for its mastership with a distributed arbitration protocol.

The System Controller 110 is a centralized controller and performs the following functions:

Coherence control;

Memory and Datapath control; and

Address crossbar-like connectivity for multiple address busses.

The System Controller 110 controls the interconnect module 112, and schedules the transfer of data between two UPA ports 104, or between UPA port 104 and memory 108. The architecture of the present invention supports an arbitrary number of memory banks 109. The System Controller 110 controls memory access timing in conjunction with datapath scheduling for maximum utilization of both resources.

The System Controller 110, the interconnect module 112, and memory 108 are in the "interconnect domain," and are coupled to UPA modules 102 by their respective UPA ports 104. The interconnect domain is fully synchronous with a centrally distributed system clock signal, generated by a System Clock 118, which is also sourced to the UPA modules 104. If desired, each UPA module 102 can synchronize its private internal clock with the system interconnect clock. All references to clock signals in this document refer to the system clock, unless otherwise noted.

Each UPA address bus 114 is a 36-bit bidirectional packet switched request bus, and includes 1-bit odd-parity. It carries address bits PA›40:4! of a 41-bit physical address space as well as transaction identification information.

Referring to FIGS. 1 and 2, there may be multiple address busses 114 in the system, with up to four UPA ports 104 on each UPA address bus 114. The precise number of UPA address busses is variable, and will generally be dependent on system speed requirements. Since putting more ports on an address bus 114 will slow signal transmissions over the address bus, the maximum number of ports per address bus will be determined by the signal transmission speed required for the address bus.

The datapath circuitry (i.e., the interconnect module 112) and the address busses 114 are independently scaleable. As a result, the number of address busses can be increased, or decreased, for a given number of processors so as to optimize the speed/cost tradeoff for the transmission of transaction requests over the address busses totally independently of decisions regarding the speed/cost tradeoffs associated with the design of the interconnect module 112.

FIG. 3 shows the full set of signals received and transmitted by a UPA port having all four interfaces (described below) of the preferred embodiment. Table 1 provides a short description of each of the signals shown in FIG. 3.

TABLE 1 ______________________________________ UPA Port Interface Signal Definitions Signal Name Description ______________________________________ Data Bus Signals UPA.sub.-- Databus›128! 128-bit data bus. Depending on speed requirements and the bus technology used, a system can have as many as one 128-bit data bus for each UPA port, or each data bus can be shared by several ports. UPA.sub.-- ECC›16! Bus for carrying error correction codes. UPA.sub.-- ECC<15:8>carries the ECC for UPA.sub.-- Databuse<127:64>. UPA.sub.-- ECC<7:0> carries the ECC for UPA Databus<63:0>. UPA.sub.-- ECC.sub.-- Valid ECC valid. A unidirectional signal from the System Controller to each UPA port, driven by the System Controller to indicate whether the ECC is valid for the data on the data bus. Address Bus Signals UPA.sub.-- Addressbus›36! 36-bit packet switched transaction request bus. See packet format in FIGS. 9A, 9B, 9C. UPA.sub.-- Req.sub.-- In›3! Arbitration request lines for up to three other UPA ports that might be sharing this UPA.sub.-- .sub.-- Addressbus. UPA.sub.-- Req.sub.-- Out Arbitration request from this UPA port. UPA.sub.-- SC.sub.-- Req.sub.-- In Arbitration request from System Controller. UPA.sub.-- Arb.sub.-- Reset.sub.-- L Arbitration Reset, asserted at the same time that UPA.sub.-- Reset.sub.-- L is asserted. UPA.sub.-- AddrValid There is a separate, bidirectional, address valid signal line between the System Controller and each UPA port. It is driven by the Port which wins the arbitration or by the System Controller when it drives the address bus. UPA.sub.-- Data.sub.-- Stall Data stall signal, driven by the System Controller to each UPA port to indicate, during transmission of a data packet, whether there is a data stall in between quad.sub.-- words of a data packet. Reply Signals UPA.sub.-- P.sub.-- Reply›5! Port's reply packet, driven by a UPA port directly to the System Controller. There is a dedicated UPA.sub.-- P.sub.-- Reply bus for each UPA port. UPA.sub.-- S.sub.-- Reply›6! System Controller's reply packet, driven by System Controller directly to the UPA port. There is a dedicated UPA.sub.-- S.sub.-- Reply bus for each UPA port. Miscellaneous Signals: UPA.sub.-- Port.sub.-- ID›5! Five bit hardwired UPA Port Identification. UPA.sub.-- Reset.sub.-- L Reset. Driven by System Controller at power-on and on any fatal system reset. UPA.sub.-- Sys.sub.-- Clk›2! Differential UPA system clock, supplied by the system clock to all UPA ports. UPA.sub.-- CPU.sub.-- Clk›2! Differential processor clock, supplied by the system clock controller only to processor UPA ports. UPA.sub.-- Speed›3! Used only for processor UPA ports, this hardwired three bit signal encodes the maximum speed at which the UPA port can operate. UPA.sub.-- IO.sub.-- Speed Used only by IO UPA ports, this signal encodes the maximum speed at which the UPA port can operate. UPA.sub.-- Ratio Used only for processor UPA ports, this signal encodes the ratio of the system clock to the processor clock, and is used by the processor to internally synchronize the system clock and processor clock if it uses a synchronous internal interface. UPA.sub.-- JTAG›5! JTAG scan control signals, TDI, TMS, TCLK, TRST.sub.-- L and TDO. TDO is output by the UPA port, the others are inputs. UPA.sub.-- Slave.sub.-- Int.sub.-- L Interrupt, for slave-only UPA ports. This is a dedicated line from the UPA port to the System Controller. UPA.sub.-- XIR.sub.-- L XIR reset signal, asserted by the System Controller to signal XIR reset. ______________________________________

A valid packet on the UPA address bus 114 is identified by the driver (i.e., the UPA port 104 or the System Controller 110) asserting the UPA.sub.-- Addr.sub.-- valid signal.

The System Controller 110 is connected to each UPA address bus 114 in the system 100. The UPA ports 104 and System Controller 110 arbitrate for use of each UPA address bus 114 using a distributed arbitration protocol. The arbitration protocol is described in patent application Ser. No. 08/414,559, filed Mar. 31, 1995, which is hereby incorporated by reference.

UPA ports do not communicate directly with other UPA ports on a shared UPA address bus 114. Instead, when a requesting UPA port generates a request packet that requests access to an addressed UPA port, the System Controller 110 forwards a slave access to the addressed UPA port by retransmitting the request packet and qualifying the destination UPA port with its UPA.sub.-- Addr.sub.-- valid signal.

A UPA port also does not "snoop" on the UPA address bus to maintain cache coherence. The System Controller 110 performs snooping on behalf of those UPA ports whose respective UPA modules include cache memory using a write-invalidate cache coherence protocol described below.

The UPA address bus 114 and UPA data bus 116 coupled to any UPA port 104 are independent. An address is associated with its data through ordering rules discussed below.

The UPA data bus is a 128-bit quad-word bidirectional data bus, plus 16 additional ECC (error correction code) bits. A "word" is defined herein to be a 32-bit, 4-byte datum. A quad-word consists of four words, or 16 bytes. In some embodiments, all or some of the data busses 116 in the system 110 can be 64-bit double word bidirectional data bus, plus 8 additional bits for ECC. The ECC bits are divided into two 8-bit halves for the 128-bit wide data bus. Although the 64-bit wide UPA data bus has half as many signal lines, it carries the same number of bytes per transaction as the 128-bit wide UPA data bus, but in twice the number of clock cycles. In the preferred embodiment, the smallest unit of coherent data transfer is 64 bytes, requiring four transfers of 16 bytes during four successive system clock cycles over the 128-bit UPA data bus.

A "master" UPA port, also called a UPA master port, is herein defined to be one which can initiate data transfer transactions. All data processor UPA modules must have a master UPA port 104.

Note that graphics devices, which may include some data processing capabilities, typically have only a slave interface. Slave interfaces are described below. For the purposes of this document, a "data processor" is defined to be a programmable computer or data processing device (e.g., a microprocessor) that both reads and writes data from and to main memory. Most, but not necessarily all, "data processors" have an associated cache memory. For instance, an I/O controller is a data processor and its UPA port will be a master UPA port. However, in may cases an I/O controller will not have a cache memory (or at least not a cache memory for storing data in the coherence domain).

A caching UPA master port is a master UPA port for a data processor that also has a coherent cache. The caching UPA master port participates in the cache coherence protocol.

A "slave" UPA port is herein defined to be one which cannot initiate data transfer transactions, but is the recipient of such transactions. A slave port responds to requests from the System Controller. A slave port has an address space associated with it for programmed I/O. A "slave port" within a master UPA port (i.e., a slave interface within a master UPA port) also handles copyback requests for cache blocks, and handles interrupt transactions in a UPA port which contains a data processor.

Each set of 8 ECC bits carry Shigeo Kaneda's 64-bit SEC-DED-S4ED code. The interconnect does not generate or check ECC. Each UPA port sourcing data generates the corresponding ECC bits, and the UPA port receiving the data checks the ECC bits. UPA ports with master capability support ECC. Slave-only UPA port containing a graphics framebuffer need not support ECC (See UPA.sub.-- ECC.sub.-- Valid signal).

The UPA data bus 116 is not a globally shared common data bus. As shown in FIGS. 1 and 2, there may be more than one UPA data bus 116 in the system, and the precise number is implementation specific. Data is always transferred in units of 16 bytes per clock-cycle on the 128-bit wide UPA data bus, and in units of 16 bytes per two clock-cycles on the 64-bit wide UPA data bus.

The size of each cache line in the preferred embodiment is 64 bytes, or sixteen 32-bit words. As will be described below, 64 bytes is the minimum unit of data transfer for all transactions involving the transfer of cached data. That is, each data packet of cached data transferred via the interconnect is 64 packets. Transfers of non-cached data can transfer 1 to 16 bytes within a single quad-word transmission, qualified with a 16-bit bytemask to indicate which bytes within the quad-word contain the data being transferred.

System Controller 110 schedules a data transfer on a UPA data bus 116 using a signal herein called the S.sub.-- REPLY. For block transfers, if successive quadwords cannot be read or written in successive clock cycles from memory, the UPA.sub.-- Data.sub.-- Stall signal is asserted by System Controller 110 to the UPA port.

For coherent block read and copyback transactions of 64-byte data blocks, the quad-word (16 bytes) addressed on physical address bits PA›5:4! is delivered first, and the successive quad words are delivered in the wrap order shown in Table 2. The addressed quad-word is delivered first so that the requesting data processor can receive and begin processing the addressed quad-word prior to receipt of the last quad-word in the associated data block. In this way, latency associated with the cache update transaction is reduced. Non-cached block read and block writes of 64 byte data blocks are always aligned on a 64-byte block boundary (PA›5:4!=0.times.0).

Note that these 64-byte data packets are delivered without an attached address, address tag, or transaction tag. Address information and data are transmitted independently over independent busses. While this is efficient, in order to match up incoming data packets with cache miss data requests an ordering constraint must be applied: data packets must be transmitted to a UPA port in the same order as the corresponding requests within each master class. (There is no ordering requirement for data requests in different master classes.) When this ordering constraint is followed, each incoming data packet must be in response to the longest outstanding cache miss transaction request for the corresponding master class.

TABLE 2 ______________________________________ Quad-word wrap order for block reads on the UPA data bus Address First Qword on Second Qword