WikiPatents - Community Patent Review
Create Free Account  |  License or Sell Your Patent  |  WikiPatents Marketplace  |  WikiPatents Blog
Username:  Password:  
    
Advanced Search
Performance optimization and system bus duty cycle reduction by I/O bridge partial cache line writes    
United States Patent6128711   
Link to this pagehttp://www.wikipatents.com/6128711.html
Inventor(s)Duncan; Samuel Hammond (Arlington, MA), Herdeg; Glenn Arthur (Leominster, MA), Hetherington; Ricky Charles (Westboro, MA), Keefer; Craig Durand (Nashua, NH), Steinman; Maurice Bennet (Marlboro, MA), Guglielmi; Paul Michael (Westboro, MA)
AbstractA multiprocessor having improved bus efficiency is shown to include a number of processing units and a memory coupled to a system bus. Also coupled to the system bus are at least one I/O bridge systems. A method for improving partial cache line writes from I/O devices to the central processing units incorporates cache coherency protocol and an enhanced invalidation scheme to ensure atomicity which minimizing the bus utilization. In addition, a method for allowing peer-to-peer communication between I/O devices coupled to the system bus via different I/O bridges includes a command and address space configuration that allows for communication without the involvement of any central processing device. Interrupt performance is improved through the storage of an interrupt data structure in main memory. The I/O bridges maintain the data structure, and when the CPU is available the interrupts can be accessed by a fast memory read; thereby reducing the requirement of I/O reads for interrupt handling.
   














 Title Information Submit all comments and votes
 
Patent Text Patent PDF Print Page Summary File History
Plain text PDF images Print Summary File History
Drawing from US Patent 6128711
Performance optimization and system bus duty cycle reduction by I/O
     bridge partial cache line writes - US Patent 6128711 Drawing
Performance optimization and system bus duty cycle reduction by I/O bridge partial cache line writes
Inventor     Duncan; Samuel Hammond (Arlington, MA) , Herdeg; Glenn Arthur (Leominster, MA) , Hetherington; Ricky Charles (Westboro, MA) , Keefer; Craig Durand (Nashua, NH) , Steinman; Maurice Bennet (Marlboro, MA) , Guglielmi; Paul Michael (Westboro, MA)
Owner/Assignee     Compaq Computer Corporation (Houston, TX)
Patent assignment
All assignments
Publication Date     October 3, 2000
Application Number     08/745,553
PAIR File History     Application Data   Transaction History
Image File Wrapper   Patent Term   Fees
Litigation
Filing Date     November 12, 1996
US Classification     711/155 711/119 711/120 711/121 711/124 711/133 711/141 711/146 711/159
Int'l Classification    
Examiner     Peikari; B. James
Assistant Examiner    
Attorney/Law Firm     Hamilton, Brook, Smith & Reynolds, P.C.
Address
Parent Case    
Priority Data    
USPTO Field of Search     395/856 395/894 395/858 711/120 711/121 711/124 711/141 711/146 711/155 711/1 711/5 711/118 711/119 711/138 711/122 711/133 711/159
Patent Tags     performance optimization bus duty cycle reduction i/o bridge partial cache line writes
   
Enter a comma (,) or semicolon (;) between multiple tag words/phrases.
Describe this patent:
 Amusing   
 Clever   
 Complex   
 Efficient   
 Historic   
 Important   
 Innovative   
 Interesting   
 Practical   
 Simple   
[no votes]
Patent WIKI

Share information and news about this patent, including information and news about the technology, inventors, company, ligation and licensing.

 References Submit all comments and votes
 
*references marked with an asterisk below are user-added references
 U.S. References
 
Add a new US reference:  
ReferenceRelevancyCommentsReferenceRelevancyComments
5737758
Merchant

Apr,1998

[0 after 0 votes]
5737759
Merchant

Apr,1998

[0 after 0 votes]
5659709
Quach

Aug,1997

[0 after 0 votes]
5652859
Mulla et al.

Jul,1997

[0 after 0 votes]
5651137
MacWilliams et al.

Jul,1997

[0 after 0 votes]
5630095
Snyder

May,1997

[0 after 0 votes]
5623633
Zeller et al.

Apr,1997

[0 after 0 votes]
5572702
Sarangdhar et al.

Nov,1996

[0 after 0 votes]
5572703
MacWilliams et al.

Nov,1996

[0 after 0 votes]
5561799
Jackson et al.

Oct,1996

[0 after 0 votes]
5526512
Arimilli et al.

Jun,1996

[0 after 0 votes]
5515514
Dhuey et al.

May,1996

[0 after 0 votes]
5511226
Zilka

Apr,1996

[0 after 0 votes]
5428761
Herlihy et al.

Jun,1995

[0 after 0 votes]
5335335
Jackson et al.

Aug,1994

[0 after 0 votes]
5325503
Stevens et al.

Jun,1994

[0 after 0 votes]
5276851
Thacker et al.

Jan,1994

[0 after 0 votes]
5265235
Sindhu et al.

Nov,1993

[0 after 0 votes]
5185861
Valencia

Feb,1993

[0 after 0 votes]
5155831
Emma et al.

Oct,1992

[0 after 0 votes]
5072369
Theus et al.

Dec,1991

[0 after 0 votes]
4622631
Frank et al.

Nov,1986

[0 after 0 votes]
 Foreign References
 Other References
 Market Review Submit all comments and votes
   
Market Size
Estimate the gross annual revenues of the relevant market sector:
> $10B
$5B - $10B
$2B - $5B
$500M - $2B
$100M - $500M
$10M - $100M
$1M - $10M
$500K - $1M
$100K - $500K
< $100K
[No votes]
$0
 
$0   $2.5B   $5B   $7.5B   $10B
Market Share
Estimate the percentage of the relevant market sector this invention will capture:
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Reasonable Royalty
What percentage of gross sales should the inventor or assignee be paid?
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Public's "Guesstimation" of Royalty Value
Market SizeN/A[No votes]
xMarket ShareN/A[No votes]
xReasonable RoyaltyN/A[No votes]

N/A

License Availablity
If you are NOT the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
License Availablity
If you ARE the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
Competitive Advantage
Does this invention have a significant competitive advantage over similar technologies?
Yes

No



[No votes]
Most helpful competitive advantage comment
[No comments]

Commercial Alternatives
Are there viable commercial alternatives for this invention?
Yes

No



[No votes]
Most helpful commercial alternative comment
[No comments]

 Technical Review Submit all comments and votes
 Claims Submit all comments and votes
 


What is claimed is:

1. A method for communicating between at least one non-cacheable device and a multi-processor computer system, said multi-processor computer system comprising a first memory and a plurality of cacheable devices coupled by a bus, said cacheable devices capable of temporarily storing and modifying data from said memory, said non-cacheable device also coupled to said bus, said method comprising the steps of:

said non-cacheable device issuing, on said bus, a request for write access to data from said memory, said data comprising a portion of an associated cache line;

said non-cacheable device completing said write access to said data from said memory;

said cacheable devices each determining whether they have stored the latest version of said requested data in an associated cache;

responsive to one of said cacheable devices determining that they are storing the latest version of said requested data in said associated cache, said one of said cacheable devices issuing an indicating signal to said devices coupled to said bus;

said non-cacheable device issuing a read/modify command in response to said indicating signal between asserted in a predetermined period of time;

said one of said cacheable devices, having said latest version of said requested data, responsively transferring said latest version of said requested data to said non-cacheable device; and;

performing a write operation to the entirety of said cache line in response to said assertion of said indicating signal, said write operation writing data, generated by said read/modify command, to said memory.

2. The method according to claim 1, wherein said cacheable devices each comprise a second memory, said second memory being relatively smaller and faster than said first memory, said second memory comprising a plurality of entries, each of said entries having a fixed data length, and wherein said write access updates a portion of said entry for each transaction, said portion being less than said fixed data length of said entry.

3. The method of claim 1, wherein said multiprocessor computer system further comprises one or more bridge devices, coupled to said bus and to said non-cacheable devices, for providing a communication link between said non-cacheable devices and said bus.
 Description Submit all comments and votes
 


FIELD OF THE INVENTION

This invention relates generally to computer systems and more specifically to a method for improving the performance of external device access.

BACKGROUND OF THE INVENTION

As it is known in the art, multi-processor computer systems are designed to accommodate a number of central processing units, coupled via a common system bus or switch to a memory and a number of external Input/Output devices. The purpose of providing multiple central processing units is to increase the performance of operations by sharing tasks between the processors. Such an arrangement allows the computer to simultaneously support a number of different applications while supporting I/O devices that are communicating over a network and displaying images on attached display devices.

To enhance performance, all of the devices coupled to the bus must communicate efficiently. Idle cycles on the system bus represent time periods in which an application is not being supported, and therefore represent reduced performance.

A number of situations arise in multi-processor computer system design in which the bus, although not idle, is not being used efficiently by the processors coupled to the bus. Some of these situations arise due to the differing nature of the devices that are coupled to the bus. For example, central processing units typically include cache logic for temporary storage of data from the memory. A coherency protocol is implemented to ensure that each central processor unit only retrieves the most up to date version of data from the cache. Therefore, central processing units are commonly referred to as `cacheable` devices.

However, external Input/Output (I/O) devices are non-cacheable devices. They typically do not implement the same cache coherency protocol that is used by the CPUs, although measures must also be taken to ensure that they only retrieve valid data for their operations. Typically I/O devices retrieve data from memory, or a cacheable device, via a Direct Memory Access (DMA) operation, in which data is retrieved in a large block. Typically I/O devices also store data to memory via DMA; when the block of data to be stored is less than a cache block the bridge in the coherent domain reads the block and modifies portions of the data, then writes it back to memory via a DMA as a large block. One mechanism used to ensure coherency is to place a `lock` on the data block that is used by the I/O device. When a lock is placed on a data block, other cacheable devices in the system do not have access to that data block for the duration of the lock period. If the I/O device is only updating a portion of the block, then restricting the other cacheable devices from using that block results in unnecessary delay that reduces performance. Thus it would be desirable to provide a method for allowing communication between CPUs and I/O devices at increased performance levels.

Similarly, situations may arise in which one I/O device seeks to communicate with other I/O devices coupled to the system. For example, a graphics device or a network device may require data that is stored on a disk. If that device is coupled to the same I/O bus as the original device, then the transfer may be performed by straightforward transfer between the devices over the I/O bus.

However, typically in large multi-processor systems, there may be more than one I/O bus coupled to the system to accommodate more I/O devices. When an I/O device wants to communicate with an I/O device on another bus it must be accomplished via a system bus transfer. Typically, in such a situation, the I/O device issues a DMA transaction to the system, which stores the data in system memory temporarily. Then one of the CPUs issues an I/O write to transfer the contents of the system memory to the I/O device on the second I/O bus. Such an arrangement utilizes system bus bandwidth and CPU compute cycles in an undesirable manner.

A further performance problem arises as a result of system interrupts. Interrupts are a mechanism that are used by the system for indicating to the CPU that an event has occurred that requires attention or repair. Typically, interrupts are used for indicating to the CPU that a transaction has completed, that a service has been requested or, on rare occasion, for a hard or soft error at the I/O device. In addition, interrupts can be used to mark an occurrence of an event, such as the end of a time interval. When the interrupt event occurs, an interrupt signal is forwarded to the CPU. At the end of an instruction sequence, if the interrupt signal is asserted the CPU will halt execution of further instructions and service the interrupt.

Usually there are a number of interrupt event conditions, and each of the conditions is saved as one bit of an interrupt vector that is stored in an interrupt register. The occurrence of an interrupt event causes a signal to be asserted, and the signal assertion is logged in the appropriate location of the interrupt register. The interrupt signal is monitored by the CPU to determine which interrupts have occurred and their priority relative to the active process executing on the CPU.

If the interrupt is associated with the CPU, the interrupt register is readily available for examination and determination of the proper interrupt handling process. However, if the interrupt is associated with an I/O device the interrupt register is stored at the I/O device. The I/O device issues an interrupt signal to the I/O interface, which stores an interrupt status bit for each device. The CPU must periodically examine the interrupt status register of the I/O interfaces to determine which device had an interrupt. The CPU then fetches the interrupt vector from the indicated I/O device and handles the interrupt. This process for determining interrupt conditions suffers performance disadvantages because valuable compute cycles are wasted while the CPU fetches the interrupt vector.

Accordingly, it can be seen that there are a number of situations that may arise during the operation of a multi-processor computer system that decrease the efficiency of system bus. Therefore it would be desirable to determine a method or apparatus that would provide increased multi-processor performance through improved utilization of system bus bandwidth.

SUMMARY OF THE INVENTION

According to one aspect of the invention a method for communicating between at least one non-cacheable device and a multi-processor computer system is described. The multi-processor computer system includes a first memory and a plurality of cacheable devices coupled by a bus or switch, where the cacheable devices are capable of temporarily storing and modifying data from the memory. The non-cacheable device is also coupled to the bus. To provide communication between a non-cacheable device and the multiprocessor computer system, the following steps are issued: the non-cacheable device issues, on the bus, a request for write access to data from the memory. The cacheable devices monitor the bus and check the request to each determine whether they have stored the latest version of the requested data. If one of the cacheable devices determine that they are storing the latest version of the requested data, that cacheable device issues an indicating signal to the devices coupled to the bus. In response to the indicating signal, the non-cacheable device issues a read/modify command. If none of the cacheable devices determine that they are storing the latest version of the data, the non-cacheable device simply completes its access to memory.

With such an arrangement, existing cache coherency logic may be used to increase efficiency of write operations by non-cacheable devices. The integrity of the data block is ensured by the atomicity of the I/O operation and an appropriate invalidation protocol. As a result, performance is increased for partial cache line writes by I/O devices.

BRIEF DESCRIPTION OF THE DRAWINGS

The above-mentioned and other features of the invention will now become more apparent by reference to the following description taken in connection with the accompanying drawings in which:

FIG. 1 is a block diagram of a multi-processor computer system according to the present invention;

FIG. 2 is a block diagram of a cache control subsystem of a central processor unit of the multi-processor computer system of FIG. 1;

FIG. 3A is a flow diagram illustrating a cache write protocol for providing improved coherency of writes from non-cacheable devices to cacheable devices;

FIGS. 3B and 3C are timing diagrams for illustrating the operation of the cache write protocol of FIG. 3A;

FIG. 4 is a block diagram of a multi-processor computer system similar to FIG. 1 for illustrating a prior art communication technique between I/O devices coupled to different I/O bridges;

FIG. 5 illustrates the address space encodings for the memory and I/O subsystems of the computer systems of FIG. 1;

FIG. 6 illustrates a translation flow from a system space address to an I/O space address in the computer system of FIG. 1;

FIG. 7 is a high-level block diagram of a computer system similar to FIG. 1, for illustrating a communication from between I/O devices coupled to different I/O bridges;

FIG. 8 illustrates logic included in a PCI bridge of the computer system of FIGS. 7 or 1, the logic determining a target hit for I/O write purposes;

FIG. 9 illustrates logic included in a PCI bridge of the computer system of FIGS. 7 or 1, the logic for translating via PCI address space and dense system address space;

FIG. 10 illustrates logic included in a PCI bridge of the computer system of FIGS. 7 or 1, the logic for translating via PCI address space and sparse system address space;

FIG. 11 is a flow diagram for illustrating the protocol used for allowing peer-to-peer communication between I/O devices coupled to different I/O busses;

FIGS. 12A and 12B illustrate the address space allocation for an interrupt data structure stored in a system memory of a multi-processor computer system such as that shown in FIGS. 7 or 1;

FIG. 13 illustrates one entry of the data structure allocated according to the address space considerations of FIG. 12;

FIG. 14 is a high-level diagram for illustrating the interrupt mechanisms provided in the multiprocessor computer systems of FIG. 7 or FIG. 1;

FIG. 15 is a block diagram of a portion of the multi-processor system of FIG. 7, including an exploded view of the logic in the Command and Address Path (CAP) unit of the PCI bridge; and

FIG. 16 is a flow diagram illustrating the process used for signalling interrupts to the central processing units in the multi-processor computer system of FIGS. 7 or 1.

DESCRIPTION OF THE PREFERRED EMBODIMENT

Referring now to FIG. 1, a multi-processor computer system 10 is shown to include a plurality of Central Processor Units (CPU) 12-18 coupled together via a system bus 20. The system bus is shown to include two portions, a data and error correction code (ECC) portion 20a and a command/address portion 20b. Also coupled to the system bus is Input/Output (I/O) module 22. I/O module 22 is shown to include two (or more) I/O bridge units 24 and 26, which are each individually coupled to the system bus 20, also referred to as the MC bus.

The I/O bridge chips 24 and 26 here interface the system bus 20 to respective buses 24a and 26a, each of which operate according to the Peripheral Computer Interface (PCI) bus protocol. As it is known in the art, the PCI bus typically comprises 64 or 32 bits, for transferring data at a rate of either 267 MB/sec or 133 MB/sec respectively. Here, each of the PCI buses 24a and 24b comprise 64 bits of data.

The PCI buses are forwarded to an external device module 31. The external device module comprises a number of expansion slots into which external devices that communicate via the PCI protocol may be connected. In addition, PCI bus 26a is coupled to an EISA bridge chip 32. The EISA bus is another well known bus protocol to which various other external devices are designed. PCI and EISA expansion slots 36b are included on the external device module 31 to support 4 PCI devices and 3 EISA devices.

The output from the EISA bridge chip is fed to a transceiver 38 to provide xbus 39. Coupled to xbus 39 are various devices communicating via the EISA protocol, such as a real time clock, a keyboard, a mouse, and an operator control panel. In addition, also coupled to xbus 39 is a combo chip 42. The combo chip 42 may be used to provide data to a floppy drive and to receive network data from a serial and a parallel network port.

Coupled to PCI bus 24a is a Small Computer System Interconnect (SCSI) chip 34. This SCSI chip is used to couple PCI bus 24a to other external devices such as tape drives or CD Rom drives.

Multi-processor computer system 10 is shown to be a self contained unit comprising a plurality of processors that may be used to handle accesses from the devices coupled to external device module 31 via the I/O module 22. It should be noted, however, that the present invention is not limited to the arrangement or existence of certain ones of the external devices on the external device module 31. Rather, it will become apparent after a thorough reading of this specification that the techniques described may be equally advantageous in other multi-processor system configurations.

Also coupled to system bus 20 is main memory 30. The main memory is a resource that may be accessed by any of the central processing units 12-18 or the external devices via the I/O interface 22. The system bus 20 comprises 128 bits of data and 16 bits of error correction code. Thus, for each bus transaction, the main memory must be able to provide, or be able to consume, 128 bits of data per cycle in order to maximize system bus utilization and thereby maximize performance.

SYSTEM BUS COMMUNICATION PROTOCOL

The system bus 20 runs synchronous to the CPU and to the memory, with the bus cycle time being a multiple of the CPU clock. Central arbitration of the system bus 20 is controlled by arbitration logic 35. Associated with each of the devices coupled to the system bus, is a module identifier (MID). As each component forwards data onto the system bus, its module identifier is read by the arbitration logic 35. The arbitration logic determines, based on the module I.D. number and the pending request, which component has control of the system bus 20 for given transaction.

The protocol used to control arbitration is modifiable. This is because various arrangements and numbers of processor cards, or PCI cards, may be coupled to the bus, and as a result various types of arbitration protocols may be more appropriate to provide the highest level of performance.

For example, one arbitration mode is strict round robin, where device order is granted first to the PCI bridge and then to each successive PCI device in order. Another type of arbitration is referred to as the modified round robin type of arbitration. In this type of arbitration, the PCI bridge wins control of the PCI bus for every other transaction; thus the order would be the PCI device then the MC-PCI bus bridge, then the PCI device and then the MC-PCI bus bridge, etc.

Other methods of arbitration could similarly be used in the present invention as will become apparent upon further reading of this specification, and therefore it should be recognized that the method of arbitration should not be viewed as a limiting element.

Suffice it to say that data is transferred on the system bus in four consecutive cycles during either a write, read, or fill transaction. Write data is always driven in four preassigned data cycles relative to the start of the write transaction. Read data is returned either in four preassigned data cycles relative to the start of the read transaction (non-pended read) or at a later time during a separate system bus fill transaction (pended read). Fill data is always driven in four preassigned data cycles relative to the start of the fill transaction.

A dead cycle will usually be inserted on the system bus after each first of four data cycles to allow for a tri-state turn-on turn-off at each of the elements coupled to the bus. One exception to this dead data cycle occurs when two non-pended reads are returned from the same element RAM device in main memory 30. In this case, the dead cycle is not inserted and the data can be issued at the maximum bandwidth of the system bus of 1.066 gigabyte/second.

During operation, when any of the elements coupled to system bus 20 require access to the bus, each of the nodes sends a request signal to the arbitration logic 35 on MG.sub.-- REQ lines 36. The arbiter sends a grant signal to each of the eight nodes on the corresponding signals MC.sub.-- Grant.sub.-- L 7-0.

Each of the elements coupled to the bus drives only one of the request lines and reads only one of the grant lines. When none of the request lines are asserted, the central arbiter 35 asserts the grant line to the last element that was granted a system bus. This behavior is referred to as bus parking. Note that even when the bus is granted to a node, a new transaction will not begin until the request line is asserted.

When a new transaction begins on the system bus, the arbiter 35 asserts the signal MC.sub.-- CA. On an idle system bus, this signal may be asserted either one or three cycles after the request line is asserted. If the grant line was already asserted as the result of the bus parking feature described above, then when the request line is asserted, the arbiter 35 will assert the MC.sub.-- CA signal one cycle later. If the grant signal was not asserted on an idle bus cycle, then when the request line is asserted the arbiter will assert the MC.sub.-- CA signal three cycles later.

PARTIAL CACHE LINE WRITES

Referring now to FIG. 2, an example one of the CPU cards 12 is shown. The CPU card is shown to include a processor chip 40, which, for example, may be an Alpha.RTM. 21164 CPU chip manufactured by Digital Equipment Corporation.TM.. The processor chip is shown to include processor logic 42 coupled to receive instructions from a primary cache that includes instructions store 44 and data store 46. In this version of the processor chip, a secondary cache 48 is included on chip for providing data to the respective instruction store and data stores to reduce the time required to obtain data from external memory.

The processor chip additionally includes a tag store 50 and a third level cache, here referred to as B-cache 52. As in the secondary cache, the B-cache is for temporary storage of large blocks of data that are retrieved from the main memory 30 (FIG. 1). By temporarily storing large portions of data in the B-cache, the period of time that the processor must wait to retrieve data and instructions from the main memory can be reduced to provide higher processor performance.

The B-cache is apportioned into blocks of data, (also referred to as cache lines) where each block of data may comprise, for example, 64 Kbytes of data. Associated with each block of data in the B-cache 52 is an entry in tag store 50. The tag store includes, for each entry, a group of status bits including a valid bit, a shared bit, and a dirty bit. The valid bit indicates that data in the block is valid and may be used by the associated processor. A Shared bit indicates whether or not that data has been loaded into the B-cache of more than one processor. If data is shared between processors, care must be taken to ensure that one processor does not make a modification that isn't reflected to the other processor. The Dirty bit indicates whether or not that block of data has been modified by the associated processor.

Table I below illustrates legal assertions of the Valid, Shared and Dirty bits for each Tag store entry, how those entry values indicate that the cache should control that cache entry, and how other processors treat a cache entry with those bits set in the tag store of another cache:

TABLE I ______________________________________ Associated Other Valid Shared Dirty Processor Processors ______________________________________ 1 0 0 may read freely, Invalid cannot write 1 0 1 may read freely, invalid may write freely, has most up-to- date copy 1 1 0 may read freely may read freely must broadcast from cache, writes must broadcast write 1 1 1 may read freely may read freely must broadcast from cache, write must broadcast has most up to write date copy ______________________________________

Before a processor can mark an entry as DIRTY (i.e. modify an entry) it must first broadcast invalidates to all the other cacheable devices in the system. In addition, any device that attempts to read that block of data from memory must retrieve the Dirty block of data from the modifying processor in order to maintain data coherency.

Here, in order to maintain cache coherency, each device that includes a cache and is coupled to bus 20 operates according to a snoopy protocol. According to the snoopy protocol, the bus control logic 50 monitors or `snoops` the bus to determine the type and address of requests being made of memory. Thus when processors execute requests over the system bus 20 to the arbiter 35 for access to main memory, each element that is coupled to the system bus monitors the transaction to determine whether or not these transactions will affect the content of the data stored in their B-cache. The use of a duplicate tag store 54 facilitates a determination as to the contents of the B-cache of the processor.

As discussed previously, both the PCI cards and the CPU cards are coupled to exchange data with main memory 30. However, typically I/O devices, for a cost sake, do not include the complex memory management architecture that is found on the central processing card. As such, when the I/O devices seek to access a block of data, they must make sure that they have the correct, updated version of data and they must make sure that other devices do not interfere with their transaction.

Atomic operations are typically used to ensure coherency for I/O devices. During atomic operations, no other devices are allowed to access the system bus between a read, modify and write operation. Rather, the I/O device has exclusive access to the system bus for the period of the atomic operation. However, atomic operations reduce performance because typically the I/O device is not modifying the entire block, yet other devices are prohibited from accessing memory until the end of the atomic transaction.

Performance is further reduced for instances in which the I/O device only writes a portion of a cache line for each operation. When only 16 bytes of a 64 byte line are modified by the I/O device, a Read/Modify/Write operation must be performed to accurately update the system memory contents. As a result, other devices are precluded from accessing memory for an even longer period of time for these partial cache line write operations. In addition, it is typically the case that the I/O device only updates a portion of the cache line, and thus this performance hit is incurred on the majority of I/O accesses.

The present invention improves the perfo