|
|
|
| United States Patent | 6128711 |
| Link to this page | http://www.wikipatents.com/6128711.html |
| Inventor(s) | Duncan; Samuel Hammond (Arlington, MA), Herdeg; Glenn Arthur (Leominster, MA), Hetherington; Ricky Charles (Westboro, MA), Keefer; Craig Durand (Nashua, NH), Steinman; Maurice Bennet (Marlboro, MA), Guglielmi; Paul Michael (Westboro, MA) |
| Abstract | A multiprocessor having improved bus efficiency is shown to include a
number of processing units and a memory coupled to a system bus. Also
coupled to the system bus are at least one I/O bridge systems. A method
for improving partial cache line writes from I/O devices to the central
processing units incorporates cache coherency protocol and an enhanced
invalidation scheme to ensure atomicity which minimizing the bus
utilization. In addition, a method for allowing peer-to-peer communication
between I/O devices coupled to the system bus via different I/O bridges
includes a command and address space configuration that allows for
communication without the involvement of any central processing device.
Interrupt performance is improved through the storage of an interrupt data
structure in main memory. The I/O bridges maintain the data structure, and
when the CPU is available the interrupts can be accessed by a fast memory
read; thereby reducing the requirement of I/O reads for interrupt
handling. |
|
|
|
Title Information  |
|
|
|
|
|
Drawing from US Patent 6128711 |
|
|
Performance optimization and system bus duty cycle reduction by I/O
bridge partial cache line writes |
|
| Inventor |
Duncan; Samuel Hammond (Arlington, MA) , Herdeg; Glenn Arthur (Leominster, MA) , Hetherington; Ricky Charles (Westboro, MA) , Keefer; Craig Durand (Nashua, NH) , Steinman; Maurice Bennet (Marlboro, MA) , Guglielmi; Paul Michael (Westboro, MA) |
|
|
|
| Publication Date |
October 3, 2000 |
|
|
|
|
|
| Filing Date |
November 12, 1996 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Title Information  |
|
|
Description  |
|
|
FIELD OF THE INVENTION
This invention relates generally to computer systems and more specifically to a method for improving the performance of external device access.
BACKGROUND OF THE INVENTION
As it is known in the art, multi-processor computer systems are designed to accommodate a number of central processing units, coupled via a common system bus or switch to a memory and a number of external Input/Output devices. The purpose of
providing multiple central processing units is to increase the performance of operations by sharing tasks between the processors. Such an arrangement allows the computer to simultaneously support a number of different applications while supporting I/O
devices that are communicating over a network and displaying images on attached display devices.
To enhance performance, all of the devices coupled to the bus must communicate efficiently. Idle cycles on the system bus represent time periods in which an application is not being supported, and therefore represent reduced performance.
A number of situations arise in multi-processor computer system design in which the bus, although not idle, is not being used efficiently by the processors coupled to the bus. Some of these situations arise due to the differing nature of the
devices that are coupled to the bus. For example, central processing units typically include cache logic for temporary storage of data from the memory. A coherency protocol is implemented to ensure that each central processor unit only retrieves the
most up to date version of data from the cache. Therefore, central processing units are commonly referred to as `cacheable` devices.
However, external Input/Output (I/O) devices are non-cacheable devices. They typically do not implement the same cache coherency protocol that is used by the CPUs, although measures must also be taken to ensure that they only retrieve valid data
for their operations. Typically I/O devices retrieve data from memory, or a cacheable device, via a Direct Memory Access (DMA) operation, in which data is retrieved in a large block. Typically I/O devices also store data to memory via DMA; when the
block of data to be stored is less than a cache block the bridge in the coherent domain reads the block and modifies portions of the data, then writes it back to memory via a DMA as a large block. One mechanism used to ensure coherency is to place a
`lock` on the data block that is used by the I/O device. When a lock is placed on a data block, other cacheable devices in the system do not have access to that data block for the duration of the lock period. If the I/O device is only updating a
portion of the block, then restricting the other cacheable devices from using that block results in unnecessary delay that reduces performance. Thus it would be desirable to provide a method for allowing communication between CPUs and I/O devices at
increased performance levels.
Similarly, situations may arise in which one I/O device seeks to communicate with other I/O devices coupled to the system. For example, a graphics device or a network device may require data that is stored on a disk. If that device is coupled
to the same I/O bus as the original device, then the transfer may be performed by straightforward transfer between the devices over the I/O bus.
However, typically in large multi-processor systems, there may be more than one I/O bus coupled to the system to accommodate more I/O devices. When an I/O device wants to communicate with an I/O device on another bus it must be accomplished via
a system bus transfer. Typically, in such a situation, the I/O device issues a DMA transaction to the system, which stores the data in system memory temporarily. Then one of the CPUs issues an I/O write to transfer the contents of the system memory to
the I/O device on the second I/O bus. Such an arrangement utilizes system bus bandwidth and CPU compute cycles in an undesirable manner.
A further performance problem arises as a result of system interrupts. Interrupts are a mechanism that are used by the system for indicating to the CPU that an event has occurred that requires attention or repair. Typically, interrupts are used
for indicating to the CPU that a transaction has completed, that a service has been requested or, on rare occasion, for a hard or soft error at the I/O device. In addition, interrupts can be used to mark an occurrence of an event, such as the end of a
time interval. When the interrupt event occurs, an interrupt signal is forwarded to the CPU. At the end of an instruction sequence, if the interrupt signal is asserted the CPU will halt execution of further instructions and service the interrupt.
Usually there are a number of interrupt event conditions, and each of the conditions is saved as one bit of an interrupt vector that is stored in an interrupt register. The occurrence of an interrupt event causes a signal to be asserted, and the
signal assertion is logged in the appropriate location of the interrupt register. The interrupt signal is monitored by the CPU to determine which interrupts have occurred and their priority relative to the active process executing on the CPU.
If the interrupt is associated with the CPU, the interrupt register is readily available for examination and determination of the proper interrupt handling process. However, if the interrupt is associated with an I/O device the interrupt
register is stored at the I/O device. The I/O device issues an interrupt signal to the I/O interface, which stores an interrupt status bit for each device. The CPU must periodically examine the interrupt status register of the I/O interfaces to
determine which device had an interrupt. The CPU then fetches the interrupt vector from the indicated I/O device and handles the interrupt. This process for determining interrupt conditions suffers performance disadvantages because valuable compute
cycles are wasted while the CPU fetches the interrupt vector.
Accordingly, it can be seen that there are a number of situations that may arise during the operation of a multi-processor computer system that decrease the efficiency of system bus. Therefore it would be desirable to determine a method or
apparatus that would provide increased multi-processor performance through improved utilization of system bus bandwidth.
SUMMARY OF THE INVENTION
According to one aspect of the invention a method for communicating between at least one non-cacheable device and a multi-processor computer system is described. The multi-processor computer system includes a first memory and a plurality of
cacheable devices coupled by a bus or switch, where the cacheable devices are capable of temporarily storing and modifying data from the memory. The non-cacheable device is also coupled to the bus. To provide communication between a non-cacheable
device and the multiprocessor computer system, the following steps are issued: the non-cacheable device issues, on the bus, a request for write access to data from the memory. The cacheable devices monitor the bus and check the request to each determine
whether they have stored the latest version of the requested data. If one of the cacheable devices determine that they are storing the latest version of the requested data, that cacheable device issues an indicating signal to the devices coupled to the
bus. In response to the indicating signal, the non-cacheable device issues a read/modify command. If none of the cacheable devices determine that they are storing the latest version of the data, the non-cacheable device simply completes its access to
memory.
With such an arrangement, existing cache coherency logic may be used to increase efficiency of write operations by non-cacheable devices. The integrity of the data block is ensured by the atomicity of the I/O operation and an appropriate
invalidation protocol. As a result, performance is increased for partial cache line writes by I/O devices.
BRIEF DESCRIPTION OF THE DRAWINGS
The above-mentioned and other features of the invention will now become more apparent by reference to the following description taken in connection with the accompanying drawings in which:
FIG. 1 is a block diagram of a multi-processor computer system according to the present invention;
FIG. 2 is a block diagram of a cache control subsystem of a central processor unit of the multi-processor computer system of FIG. 1;
FIG. 3A is a flow diagram illustrating a cache write protocol for providing improved coherency of writes from non-cacheable devices to cacheable devices;
FIGS. 3B and 3C are timing diagrams for illustrating the operation of the cache write protocol of FIG. 3A;
FIG. 4 is a block diagram of a multi-processor computer system similar to FIG. 1 for illustrating a prior art communication technique between I/O devices coupled to different I/O bridges;
FIG. 5 illustrates the address space encodings for the memory and I/O subsystems of the computer systems of FIG. 1;
FIG. 6 illustrates a translation flow from a system space address to an I/O space address in the computer system of FIG. 1;
FIG. 7 is a high-level block diagram of a computer system similar to FIG. 1, for illustrating a communication from between I/O devices coupled to different I/O bridges;
FIG. 8 illustrates logic included in a PCI bridge of the computer system of FIGS. 7 or 1, the logic determining a target hit for I/O write purposes;
FIG. 9 illustrates logic included in a PCI bridge of the computer system of FIGS. 7 or 1, the logic for translating via PCI address space and dense system address space;
FIG. 10 illustrates logic included in a PCI bridge of the computer system of FIGS. 7 or 1, the logic for translating via PCI address space and sparse system address space;
FIG. 11 is a flow diagram for illustrating the protocol used for allowing peer-to-peer communication between I/O devices coupled to different I/O busses;
FIGS. 12A and 12B illustrate the address space allocation for an interrupt data structure stored in a system memory of a multi-processor computer system such as that shown in FIGS. 7 or 1;
FIG. 13 illustrates one entry of the data structure allocated according to the address space considerations of FIG. 12;
FIG. 14 is a high-level diagram for illustrating the interrupt mechanisms provided in the multiprocessor computer systems of FIG. 7 or FIG. 1;
FIG. 15 is a block diagram of a portion of the multi-processor system of FIG. 7, including an exploded view of the logic in the Command and Address Path (CAP) unit of the PCI bridge; and
FIG. 16 is a flow diagram illustrating the process used for signalling interrupts to the central processing units in the multi-processor computer system of FIGS. 7 or 1.
DESCRIPTION OF THE PREFERRED EMBODIMENT
Referring now to FIG. 1, a multi-processor computer system 10 is shown to include a plurality of Central Processor Units (CPU) 12-18 coupled together via a system bus 20. The system bus is shown to include two portions, a data and error
correction code (ECC) portion 20a and a command/address portion 20b. Also coupled to the system bus is Input/Output (I/O) module 22. I/O module 22 is shown to include two (or more) I/O bridge units 24 and 26, which are each individually coupled to the
system bus 20, also referred to as the MC bus.
The I/O bridge chips 24 and 26 here interface the system bus 20 to respective buses 24a and 26a, each of which operate according to the Peripheral Computer Interface (PCI) bus protocol. As it is known in the art, the PCI bus typically comprises
64 or 32 bits, for transferring data at a rate of either 267 MB/sec or 133 MB/sec respectively. Here, each of the PCI buses 24a and 24b comprise 64 bits of data.
The PCI buses are forwarded to an external device module 31. The external device module comprises a number of expansion slots into which external devices that communicate via the PCI protocol may be connected. In addition, PCI bus 26a is
coupled to an EISA bridge chip 32. The EISA bus is another well known bus protocol to which various other external devices are designed. PCI and EISA expansion slots 36b are included on the external device module 31 to support 4 PCI devices and 3 EISA
devices.
The output from the EISA bridge chip is fed to a transceiver 38 to provide xbus 39. Coupled to xbus 39 are various devices communicating via the EISA protocol, such as a real time clock, a keyboard, a mouse, and an operator control panel. In
addition, also coupled to xbus 39 is a combo chip 42. The combo chip 42 may be used to provide data to a floppy drive and to receive network data from a serial and a parallel network port.
Coupled to PCI bus 24a is a Small Computer System Interconnect (SCSI) chip 34. This SCSI chip is used to couple PCI bus 24a to other external devices such as tape drives or CD Rom drives.
Multi-processor computer system 10 is shown to be a self contained unit comprising a plurality of processors that may be used to handle accesses from the devices coupled to external device module 31 via the I/O module 22. It should be noted,
however, that the present invention is not limited to the arrangement or existence of certain ones of the external devices on the external device module 31. Rather, it will become apparent after a thorough reading of this specification that the
techniques described may be equally advantageous in other multi-processor system configurations.
Also coupled to system bus 20 is main memory 30. The main memory is a resource that may be accessed by any of the central processing units 12-18 or the external devices via the I/O interface 22. The system bus 20 comprises 128 bits of data and
16 bits of error correction code. Thus, for each bus transaction, the main memory must be able to provide, or be able to consume, 128 bits of data per cycle in order to maximize system bus utilization and thereby maximize performance.
SYSTEM BUS COMMUNICATION PROTOCOL
The system bus 20 runs synchronous to the CPU and to the memory, with the bus cycle time being a multiple of the CPU clock. Central arbitration of the system bus 20 is controlled by arbitration logic 35. Associated with each of the devices
coupled to the system bus, is a module identifier (MID). As each component forwards data onto the system bus, its module identifier is read by the arbitration logic 35. The arbitration logic determines, based on the module I.D. number and the pending
request, which component has control of the system bus 20 for given transaction.
The protocol used to control arbitration is modifiable. This is because various arrangements and numbers of processor cards, or PCI cards, may be coupled to the bus, and as a result various types of arbitration protocols may be more appropriate
to provide the highest level of performance.
For example, one arbitration mode is strict round robin, where device order is granted first to the PCI bridge and then to each successive PCI device in order. Another type of arbitration is referred to as the modified round robin type of
arbitration. In this type of arbitration, the PCI bridge wins control of the PCI bus for every other transaction; thus the order would be the PCI device then the MC-PCI bus bridge, then the PCI device and then the MC-PCI bus bridge, etc.
Other methods of arbitration could similarly be used in the present invention as will become apparent upon further reading of this specification, and therefore it should be recognized that the method of arbitration should not be viewed as a
limiting element.
Suffice it to say that data is transferred on the system bus in four consecutive cycles during either a write, read, or fill transaction. Write data is always driven in four preassigned data cycles relative to the start of the write transaction. Read data is returned either in four preassigned data cycles relative to the start of the read transaction (non-pended read) or at a later time during a separate system bus fill transaction (pended read). Fill data is always driven in four preassigned
data cycles relative to the start of the fill transaction.
A dead cycle will usually be inserted on the system bus after each first of four data cycles to allow for a tri-state turn-on turn-off at each of the elements coupled to the bus. One exception to this dead data cycle occurs when two non-pended
reads are returned from the same element RAM device in main memory 30. In this case, the dead cycle is not inserted and the data can be issued at the maximum bandwidth of the system bus of 1.066 gigabyte/second.
During operation, when any of the elements coupled to system bus 20 require access to the bus, each of the nodes sends a request signal to the arbitration logic 35 on MG.sub.-- REQ lines 36. The arbiter sends a grant signal to each of the eight
nodes on the corresponding signals MC.sub.-- Grant.sub.-- L 7-0.
Each of the elements coupled to the bus drives only one of the request lines and reads only one of the grant lines. When none of the request lines are asserted, the central arbiter 35 asserts the grant line to the last element that was granted a
system bus. This behavior is referred to as bus parking. Note that even when the bus is granted to a node, a new transaction will not begin until the request line is asserted.
When a new transaction begins on the system bus, the arbiter 35 asserts the signal MC.sub.-- CA. On an idle system bus, this signal may be asserted either one or three cycles after the request line is asserted. If the grant line was already
asserted as the result of the bus parking feature described above, then when the request line is asserted, the arbiter 35 will assert the MC.sub.-- CA signal one cycle later. If the grant signal was not asserted on an idle bus cycle, then when the
request line is asserted the arbiter will assert the MC.sub.-- CA signal three cycles later.
PARTIAL CACHE LINE WRITES
Referring now to FIG. 2, an example one of the CPU cards 12 is shown. The CPU card is shown to include a processor chip 40, which, for example, may be an Alpha.RTM. 21164 CPU chip manufactured by Digital Equipment Corporation.TM.. The
processor chip is shown to include processor logic 42 coupled to receive instructions from a primary cache that includes instructions store 44 and data store 46. In this version of the processor chip, a secondary cache 48 is included on chip for
providing data to the respective instruction store and data stores to reduce the time required to obtain data from external memory.
The processor chip additionally includes a tag store 50 and a third level cache, here referred to as B-cache 52. As in the secondary cache, the B-cache is for temporary storage of large blocks of data that are retrieved from the main memory 30
(FIG. 1). By temporarily storing large portions of data in the B-cache, the period of time that the processor must wait to retrieve data and instructions from the main memory can be reduced to provide higher processor performance.
The B-cache is apportioned into blocks of data, (also referred to as cache lines) where each block of data may comprise, for example, 64 Kbytes of data. Associated with each block of data in the B-cache 52 is an entry in tag store 50. The tag
store includes, for each entry, a group of status bits including a valid bit, a shared bit, and a dirty bit. The valid bit indicates that data in the block is valid and may be used by the associated processor. A Shared bit indicates whether or not that
data has been loaded into the B-cache of more than one processor. If data is shared between processors, care must be taken to ensure that one processor does not make a modification that isn't reflected to the other processor. The Dirty bit indicates
whether or not that block of data has been modified by the associated processor.
Table I below illustrates legal assertions of the Valid, Shared and Dirty bits for each Tag store entry, how those entry values indicate that the cache should control that cache entry, and how other processors treat a cache entry with those bits
set in the tag store of another cache:
TABLE I ______________________________________ Associated Other Valid Shared Dirty Processor Processors ______________________________________ 1 0 0 may read freely, Invalid cannot write 1 0 1 may read freely, invalid may write freely,
has most up-to- date copy 1 1 0 may read freely may read freely must broadcast from cache, writes must broadcast write 1 1 1 may read freely may read freely must broadcast from cache, write must broadcast has most up to write date copy
______________________________________
Before a processor can mark an entry as DIRTY (i.e. modify an entry) it must first broadcast invalidates to all the other cacheable devices in the system. In addition, any device that attempts to read that block of data from memory must retrieve
the Dirty block of data from the modifying processor in order to maintain data coherency.
Here, in order to maintain cache coherency, each device that includes a cache and is coupled to bus 20 operates according to a snoopy protocol. According to the snoopy protocol, the bus control logic 50 monitors or `snoops` the bus to determine
the type and address of requests being made of memory. Thus when processors execute requests over the system bus 20 to the arbiter 35 for access to main memory, each element that is coupled to the system bus monitors the transaction to determine whether
or not these transactions will affect the content of the data stored in their B-cache. The use of a duplicate tag store 54 facilitates a determination as to the contents of the B-cache of the processor.
As discussed previously, both the PCI cards and the CPU cards are coupled to exchange data with main memory 30. However, typically I/O devices, for a cost sake, do not include the complex memory management architecture that is found on the
central processing card. As such, when the I/O devices seek to access a block of data, they must make sure that they have the correct, updated version of data and they must make sure that other devices do not interfere with their transaction.
Atomic operations are typically used to ensure coherency for I/O devices. During atomic operations, no other devices are allowed to access the system bus between a read, modify and write operation. Rather, the I/O device has exclusive access to
the system bus for the period of the atomic operation. However, atomic operations reduce performance because typically the I/O device is not modifying the entire block, yet other devices are prohibited from accessing memory until the end of the atomic
transaction.
Performance is further reduced for instances in which the I/O device only writes a portion of a cache line for each operation. When only 16 bytes of a 64 byte line are modified by the I/O device, a Read/Modify/Write operation must be performed
to accurately update the system memory contents. As a result, other devices are precluded from accessing memory for an even longer period of time for these partial cache line write operations. In addition, it is typically the case that the I/O device
only updates a portion of the cache line, and thus this performance hit is incurred on the majority of I/O accesses.
The present invention improves the perfo | | |