|
Description  |
|
|
BACKGROUND OF THE INVENTION
1. Field of the Invention
This invention relates to the field of multiprocessor computer systems and,
more particularly, to storing coherency states within multiple subnodes of
processing nodes in distributed shared memory multiprocessing computer
systems.
2. Description of the Relevant Art
Multiprocessing computer systems include two or more processors which may
be employed to perform computing tasks. A particular computing task may be
performed upon one processor while other processors perform unrelated
computing tasks. Alternatively, components of a particular computing task
may be distributed among multiple processors to decrease the time required
to perform the computing task as a whole. Generally speaking, a processor
is a device configured to perform an operation upon one or more operands
to produce a result. The operation is performed in response to an
instruction executed by the processor.
A popular architecture in commercial multiprocessing computer systems is
the symmetric multiprocessor (SMP) architecture. Typically, an SMP
computer system comprises multiple processors connected through a cache
hierarchy to a shared bus. Additionally connected to the bus is a memory,
which is shared among the processors in the system. Access to any
particular memory location within the memory occurs in a similar amount of
time as access to any other particular memory location. Since each
location in the memory may be accessed in a uniform manner, this structure
is often referred to as a uniform memory architecture (UMA).
Processors are often configured with internal caches, and one or more
caches are typically included in the cache hierarchy between the
processors and the shared bus in an SMP computer system. Multiple copies
of data residing at a particular main memory address may be stored in
these caches. In order to maintain the shared memory model, in which a
particular address stores exactly one data value at any given time, shared
bus computer systems employ cache coherency. Generally speaking, an
operation is coherent if the effects of the operation upon data stored at
a particular memory address are reflected in each copy of the data within
the cache hierarchy. For example, when data stored at a particular memory
address is updated, the update may be supplied to the caches which are
storing copies of the previous data. Alternatively, the copies of the
previous data may be invalidated in the caches such that a subsequent
access to the particular memory address causes the updated copy to be
transferred from main memory. For shared bus systems, a snoop bus protocol
is typically employed. Each coherent transaction performed upon the shared
bus is examined (or "snooped") against data in the caches. If a copy of
the affected data is found, the state of the cache line containing the
data may be updated in response to the coherent transaction.
Unfortunately, shared bus architectures suffer from several drawbacks which
limit their usefulness in multiprocessing computer systems. A bus is
capable of a peak bandwidth (e.g. a number of bytes/second which may be
transferred across the bus). As additional processors are attached to the
bus, the bandwidth required to supply the processors with data and
instructions may exceed the peak bus bandwidth. Since some processors are
forced to wait for available bus bandwidth, performance of the computer
system suffers when the bandwidth requirements of the processors exceeds
available bus bandwidth.
Additionally, adding more processors to a shared bus increases the
capacitive loading on the bus and may even cause the physical length of
the bus to be increased. The increased capacitive loading and extended bus
length increases the delay in propagating a signal across the bus. Due to
the increased propagation delay, transactions may take longer to perform.
Therefore, the peak bandwidth of the bus may decrease as more processors
are added.
These problems are further magnified by the continued increase in operating
frequency and performance of processors. The increased performance enabled
by the higher frequencies and more advanced processor microarchitectures
results in higher bandwidth requirements than previous processor
generations, even for the same number of processors. Therefore, buses
which previously provided sufficient bandwidth for a multiprocessing
computer system may be insufficient for a similar computer system
employing the higher performance processors.
Another structure for multiprocessing computer systems is a distributed
shared memory architecture. A distributed shared memory architecture
includes multiple nodes within which processors and memory reside. The
multiple nodes communicate via a network coupled there between. When
considered as a whole, the memory included within the multiple nodes forms
the shared memory for the computer system. Typically, directories are used
to identify which nodes have cached copies of data corresponding to a
particular address. Coherency activities may be generated via examination
of the directories.
Distributed shared memory systems are scaleable, overcoming the limitations
of the shared bus architecture. Since many of the processor accesses are
completed within a node, nodes typically have much lower bandwidth
requirements upon the network than a shared bus architecture must provide
upon its shared bus. The nodes may operate at high clock frequency and
bandwidth, accessing the network when needed. Additional nodes may be
added to the network without affecting the local bandwidth of the nodes.
Instead, only the network bandwidth is affected.
Many distributed shared memory systems suffer from a limitation upon the
memory which may be included within a node. The limitation arises not from
the number of memory modules (such as dynamic random access memory, or
DRAM, modules which are popular in the industry) which may be configured
into a node to form the memory, but instead arises from the amount of
memory which may be used to store the access rights of the node to a
particular coherency unit within the memory. In order to maintain
system-wide memory coherency, the access rights granted to a particular
node must be respected by that node. However, the node typically employs
high speed internal communications, such that the access rights must by
accessible very quickly. DRAM is typically not suitable for high speed
access. Instead, static random access memory (SRAM) modules are typically
used to store the access rights.
While SRAM modules may respond with speeds suitable for use in storing
access rights, SRAM modules suffer from other drawbacks. SRAM modules are
not fabricated with the densities typified by DRAM. In other words, a much
larger number of SRAM modules must be used to store the same number of
bits as a particular number of DRAM modules. Unfortunately, the lack of
density in SRAM modules leads to increased pinouts on modules housing the
control logic which interfaces to the SRAM modules in order to store,
retrieve, and analyze access rights corresponding to a coherency unit
accessed by a transaction occurring within the node. The number of SRAM
modules which may be used is therefore limited by the number of pins
available on the control logic modules. Hence, the number of access rights
(and therefore the number of coherency units) which may be stored in the
node is limited. Additionally, SRAM modules are significantly more
expensive then DRAM modules. In order to minimize the cost of the computer
system, it is important to minimize the number of SRAM modules included.
For at least the above reasons, the amount of memory needed to store access
rights may limit the amount of main memory which may be included within
the node. Still further, if less than the maximum amount of memory is
included in a node, it is desirable to reduce the memory dedicated to
storing access rights accordingly. In addition, it is desirable to be able
to upgrade the amount of memory in a given node subsequent to manufacture
of the computer system. Therefore, the amount of memory used for storing
access rights must be similarly increasable.
SUMMARY OF THE INVENTION
The problems outlined above are in large part solved by a computer system
in accordance with the present invention. The computer system includes one
or more processing nodes, each of which includes one or more subnodes. One
of the subnodes (the controller subnode) manages the interface between the
processing node and the remainder of the computer system. Other subnodes
(snooper subnodes) are employed to store access rights for coherency units
within the memory. The processing node's memory is logically divided into
portions, and each subnode stores access rights for a particular memory
portion. When a transaction is initiated within the processing node, the
subnode storing the access rights for the coherency unit affected by the
transaction analyzes the access rights and determines if the transaction
may complete locally within the processing node. If coherency activity is
required, the subnode asserts an ignore signal causing the transaction to
be delayed while coherency activity is performed to acquire sufficient
access rights.
The access rights are updated concurrent with reissue of a transaction for
which coherency activity is performed. In this manner, the updated access
rights are available to subsequent transactions even though the access
rights may be stored in a different subnode than the controller subnode
(which performs the reissue transaction). In one embodiment, the updated
access rights are conveyed within one of the address phases of the reissue
transaction. A bytemask field within one of the address phases is used.
By dividing the access rights storage into multiple subnodes, subnodes may
be added to increase the number of access rights which may be stored
within a processing node. Consequently, the amount of memory (measured in
coherency units) may be increased beyond the number of coherency units
manageable by one subnode. Advantageously, the computer system exhibits a
high degree of flexibility and reconfigurability. For example, the
computer system may be purchased with a small amount of memory and later
upgraded to a larger amount of memory relatively easily.
Additionally, the division of access rights storage solves the physical
problems of storing the access rights in fast but sparse SRAM-type memory.
Each subnode may be configured with a certain number of banks of SRAM (for
example, two). When the number of access rights to be stored requires more
than the certain number of banks, then another subnode may be added. In
this manner, the number of signal lines to which the control logic within
any given subnode connects is limited to a smaller number than if a single
subnode were used. Advantageously, the control logic may be use
commercially available packaging since the number of pins required is
minimized. Still further, the fast SRAM may be used, satisfying speed
requirements for access with during high speed intranode communications.
Broadly speaking, the present invention contemplates a method for
completing a transaction in a processing node of a multiprocessing
computer system. The transaction is reissued within the processing node
upon completion of coherency activity performed with respect to the
transaction. Concurrently, a new coherency state corresponding to a
coherency unit affected by the transaction is broadcast. The coherency
state is recorded in a position within a table of coherency states. The
position corresponds to the coherency unit.
The present invention further contemplates a system interface comprising a
first subnode and a second subnode. The first subnode is configured to
communicate between a local bus of a processing node and a network.
Coupled to the local bus, the second subnode is configured to store a
first plurality of coherency states corresponding to a first plurality of
coherency units stored within the processing node.
The present invention still further contemplates a computer system
comprising a network and a first processing node. The first processing
node is coupled to the network and includes a controller subnode and a
snooper subnode. The controller subnode is configured to effectuate
communication upon the network and to reissue a transaction for which the
communication is effectuated upon completion of the communication.
Furthermore, the controller subnode is configured to broadcast a coherency
state achieved via the communication. The snooper subnode is configured to
store a first plurality of coherency states corresponding to a first
plurality of coherency units stored within the first processing node.
Additionally, the snooper subnode is configured to capture the coherency
state broadcast by the controller subnode if the coherency state is one of
the plurality of coherency states.
BRIEF DESCRIPTION OF THE DRAWINGS
Other objects and advantages of the invention will become apparent upon
reading the following detailed description and upon reference to the
accompanying drawings in which:
FIG. 1 is a block diagram of a multiprocessor computer system.
FIG. 1A is a conceptualized block diagram depicting a non-uniform memory
architecture supported by one embodiment of the computer system shown in
FIG. 1.
FIG. 1B is a conceptualized block diagram depicting a cache-only memory
architecture supported by one embodiment of the computer system shown in
FIG. 1.
FIG. 2 is a block diagram of one embodiment of a symmetric multiprocessing
node depicted in FIG. 1.
FIG. 2A is an exemplary directory entry stored in one embodiment of a
directory depicted in FIG. 2.
FIG. 3 is a block diagram of one embodiment of a system interface shown in
FIG. 1.
FIG. 4 is a diagram depicting activities performed in response to a typical
coherency operation between a request agent, a home agent, and a slave
agent.
FIG. 5 is an exemplary coherency operation performed in response to a read
to own request from a processor.
FIG. 6 is a flowchart depicting an exemplary state machine for one
embodiment of a request agent shown in FIG. 3.
FIG. 7 is a flowchart depicting an exemplary state machine for one
embodiment of a home agent shown in FIG. 3.
FIG. 8 is a flowchart depicting an exemplary state machine for one
embodiment of a slave agent shown in FIG. 3.
FIG. 9 is a table listing request types according to one embodiment of the
system interface.
FIG. 10 is a table listing demand types according to one embodiment of the
system interface.
FIG. 11 is a table listing reply types according to one embodiment of the
system interface.
FIG. 12 is a table listing completion types according to one embodiment of
the system interface.
FIG. 13 is a table describing coherency operations in response to various
operations performed by a processor, according to one embodiment of the
system interface.
FIG. 14 is a block diagram of a second embodiment of a symmetric
multiprocessing node depicted in FIG. 1.
FIG. 15 is a timing diagram depicting a portion of a transaction upon a bus
of the symmetric multiprocessing node shown in FIG. 14.
FIG. 16 is a timing diagram depicting a portion of two transactions upon
the bus of the symmetric multiprocessing node shown in FIG. 14,
highlighting update of an MTAG state with respect to the first of the two
transactions.
FIG. 17 is a diagram depicting fields of a bytemask transmitted upon the
bus of the symmetric multiprocessing node shown in FIG. 14, according to
one embodiment of the symmetric multiprocessing node.
FIG. 18 is a diagram of control registers employed in one embodiment of the
symmetric multiprocessing node shown in FIG. 14.
FIG. 19 is a diagram of an address space of one embodiment of the symmetric
multiprocessing node shown in FIG. 14.
FIG. 20 is a diagram depicting an exemplary MTAG layout employed by one
embodiment of the symmetric multiprocessing node shown in FIG. 14.
While the invention is susceptible to various modifications and alternative
forms, specific embodiments thereof are shown by way of example in the
drawings and will herein be described in detail. It should be understood,
however, that the drawings and detailed description thereto are not
intended to limit the invention to the particular form disclosed, but on
the contrary, the intention is to cover all modifications, equivalents and
alternatives falling within the spirit and scope of the present invention
as defined by the appended claims.
DETAILED DESCRIPTION OF THE INVENTION
Turning now to FIG. 1, a block diagram of one embodiment of a
multiprocessing computer system 10 is shown. Computer system 10 includes
multiple SMP nodes 12A-12D interconnected by a point-to-point network 14.
Elements referred to herein with a particular reference number followed by
a letter will be collectively referred to by the reference number alone.
For example, SMP nodes 12A-12D will be collectively referred to as SMP
nodes 12. In the embodiment shown, each SMP node 12 includes multiple
processors, external caches, an SMP bus, a memory, and a system interface.
For example, SMP node 12A is configured with multiple processors including
processors 16A-16B. The processors 16 are connected to external caches 18,
which are further coupled to an SMP bus 20. Additionally, a memory 22 and
a system interface 24 are coupled to SMP bus 20. Still further, one or
more input/output (I/O) interfaces 26 may be coupled to SMP bus 20. I/O
interfaces 26 are used to interface to peripheral devices such as serial
and parallel ports, disk drives, modems, printers, etc. Other SMP nodes
12B-12D may be configured similarly.
Generally speaking, system interface 24 comprises one or more subnodes. One
of the subnodes (the controller subnode) includes an interface to network
14, while other subnodes simply maintain storage for access rights to
coherency units stored within memory 22. Upon completion of coherency
activity via network 14 in response to a transaction, the controller node
reissues the transaction upon SMP bus 20. Concurrently, the controller
node provides an updated access rights value for storage in the subnode
corresponding to the affected coherency unit. Because the updated access
rights value is provided concurrent with the reissued transaction, the
updated access rights are available to subsequent transactions through the
corresponding subnode. Advantageously, the access rights are logically
updated at the same time as the transaction completes, in accordance with
the memory coherency model supported by computer system 10. The update is
performed even though the access rights may be stored in a subnode other
than the controller subnode. It is noted that the controller subnode may
include a portion of the access rights memory (referred to herein as the
MTAG) as well.
In one embodiment, each subnode which forms a portion of system interface
24 comprises a printed circuit board which is independently inserted into
a backplane comprising SMP bus 20. The number of subnodes included is
configurable, and therefore is expandable if the size of memory 22 is
expanded. Advantageously, the amount of MTAG memory is adjustable to match
the amount needed for the size of memory 22. For example, if each
coherency unit is 64 bytes and access rights comprise two bits per
coherency unit, then the amount of MTAG memory is 1/256.sup.th of the size
of memory 22. Computer system 10 may initially be manufactured with a
particular amount of memory, and subsequently memory may be added or
deleted by adding or deleting memory modules from memory 22 and adding or
deleting subnodes from system interface 24.
Generally speaking, a memory operation is an operation causing transfer of
data from a source to a destination. The source and/or destination may be
storage locations within the initiator, or may be storage locations within
memory. When a source or destination is a storage location within memory,
the source or destination is specified via an address conveyed with the
memory operation. Memory operations may be read or write operations. A
read operation causes transfer of data from a source outside of the
initiator to a destination within the initiator. Conversely, a write
operation causes transfer of data from a source within the initiator to a
destination outside of the initiator. In the computer system shown in FIG.
1, a memory operation may include one or more transactions upon SMP bus 20
as well as one or more coherency operations upon network 14.
Architectural Overview
Each SMP node 12 is essentially an SMP system having memory 22 as the
shared memory. Processors 16 are high performance processors. In one
embodiment, each processor 16 is a SPARC processor compliant with version
9 of the SPARC processor architecture. It is noted, however, that any
processor architecture may be employed by processors 16.
Typically, processors 16 include internal instruction and data caches.
Therefore, external caches 18 are labeled as L2 caches (for level 2,
wherein the internal caches are level 1 caches). If processors 16 are not
configured with internal caches, then external caches 18 are level 1
caches. It is noted that the "level" nomenclature is used to identify
proximity of a particular cache to the processing core within processor
16. Level 1 is nearest the processing core, level 2 is next nearest, etc.
External caches 18 provide rapid access to memory addresses frequently
accessed by the processor 16 coupled thereto. It is noted that external
caches 18 may be configured in any of a variety of specific cache
arrangements. For example, set-associative or direct-mapped configurations
may be employed by external caches 18.
SMP bus 20 accommodates communication between processors 16 (through caches
18), memory 22, system interface 24, and I/O interface 26. In one
embodiment, SMP bus 20 includes an address bus and related control
signals, as well as a data bus and related control signals. Because the
address and data buses are separate, a split-transaction bus protocol may
be employed upon SMP bus 20. Generally speaking, a split-transaction bus
protocol is a protocol in which a transaction occurring upon the address
bus may differ from a concurrent transaction occurring upon the data bus.
Transactions involving address and data include an address phase in which
the address and related control information is conveyed upon the address
bus, and a data phase in which the data is conveyed upon the data bus.
Additional address phases and/or data phases for other transactions may be
initiated prior to the data phase corresponding to a particular address
phase. An address phase and the corresponding data phase may be correlated
in a number of ways. For example, data transactions may occur in the same
order that the address transactions occur. Alternatively, address and data
phases of a transaction may be identified via a unique tag.
Memory 22 is configured to store data and instruction code for use by
processors 16. Memory 22 preferably comprises dynamic random access memory
(DRAM), although any type of memory may be used. Memory 22, in conjunction
with similar illustrated memories in the other SMP nodes 12, forms a
distributed shared memory system. Each address in the address space of the
distributed shared memory is assigned to a particular node, referred to as
the home node of the address. A processor within a different node than the
home node may access the data at an address of the home node, potentially
caching the data. Therefore, coherency is maintained between SMP nodes 12
as well as among processors 16 and caches 18 within a particular SMP node
12A-12D. System interface 24 provides internode coherency, while snooping
upon SMP bus 20 provides intranode coherency.
In addition to maintaining internode coherency, system interface 24 detects
addresses upon SMP bus 20 which require a data transfer to or from another
SMP node 12. System interface 24 performs the transfer, and provides the
corresponding data for the transaction upon SMP bus 20. In the embodiment
shown, system interface 24 is coupled to a point-to-point network 14.
However, it is noted that in alternative embodiments other networks may be
used. In a point-to-point network, individual connections exist between
each node upon the network. A particular node communicates directly with a
second node via a dedicated link. To communicate with a third node, the
particular node utilizes a different link than the one used to communicate
with the second node.
It is noted that, although four SMP nodes 12 are shown in FIG. 1,
embodiments of computer system 10 employing any number of nodes are
contemplated.
FIGS. 1A and 1B are conceptualized illustrations of distributed memory
architectures supported by one embodiment of computer system 10.
Specifically, FIGS. 1A and 1B illustrate alternative ways in which each
SMP node 12 of FIG. 1 may cache data and perform memory accesses. Details
regarding the manner in which computer system 10 supports such accesses
will be described in further detail below.
Turning now to FIG. 1A, a logical diagram depicting a first memory
architecture 30 supported by one embodiment of computer system 10 is
shown. Architecture 30 includes multiple processors 32A-32D, multiple
caches 34A-34D, multiple memories 36A-36D, and an interconnect network 38.
The multiple memories 36 form a distributed shared memory. Each address
within the address space corresponds to a location within one of memories
36.
Architecture 30 is a non-uniform memory architecture (NUMA). In a NUMA
architecture, the amount of time required to access a first memory address
may be substantially different than the amount of time required to access
a second memory address. The access time depends upon the origin of the
access and the location of the memory 36A-36D which stores the accessed
data. For example, if processor 32A accesses a first memory address stored
in memory 36A, the access time may be significantly shorter than the
access time for an access to a second memory address stored in one of
memories 36B-36D. That is, an access by processor 32A to memory 36A may be
completed locally (e.g. without transfers upon network 38), while a
processor 32A access to memory 36B is performed via network 38. Typically,
an access through network 38 is slower than an access completed within a
local memory. For example, a local access might be completed in a few
hundred nanoseconds while an access via the network might occupy a few
microseconds.
Data corresponding to addresses stored in remote nodes may be cached in any
of the caches 34. However, once a cache 34 discards the data corresponding
to such a remote address, a subsequent access to the remote address is
completed via a transfer upon network 38.
NUMA architectures may provide excellent performance characteristics for
software applications which use addresses that correspond primarily to a
particular local memory. Software applications which exhibit more random
access patterns and which do not confine their memory accesses to
addresses within a particular local memory, on the other hand, may
experience a large amount of network traffic as a particular processor 32
performs repeated accesses to remote nodes.
Turning now to FIG. 1B, a logic diagram depicting a second memory
architecture 40 supported by the computer system 10 of FIG. 1 is shown.
Architecture 40 includes multiple processors 42A-42D, multiple caches
44A-44D, multiple memories 46A-46D, and network 48. However, memories 46
are logically coupled between caches 44 and network 48. Memories 46 serve
as larger caches (e.g. a level 3 cache), storing addresses which are
accessed by the corresponding processors 42. Memories 46 are said to
"attract" the data being operated upon by a corresponding processor 42. As
opposed to the NUMA architecture shown in FIG. 1A, architecture 40 reduces
the number of accesses upon the network 48 by storing remote data in the
local memory when the local processor accesses that data.
Architecture 40 is referred to as a cache-only memory architecture (COMA).
Multiple locations within the distributed shared memory formed by the
combination of memories 46 may store data corresponding to a particular
address. No permanent mapping of a particular address to a particular
storage location is assigned. Instead, the location storing data
corresponding to the particular address changes dynamically based upon the
processors 42 which access that particular address. Conversely, in the
NUMA architecture a particular storage location within memories 46 is
assigned to a particular address. Architecture 40 adjusts to the memory
access patterns performed by applications executing thereon, and coherency
is maintained between the memories 46.
In a preferred embodiment, computer system 10 supports both of the memory
architectures shown in FIGS. 1A and 1B. In particular, a memory address
may be accessed in a NUMA fashion from one SMP node 12A-12D while being
accessed in a COMA manner from another SMP node 12A-12D. In one
embodiment, a NUMA access is detected if certain bits of the address upon
SMP bus 20 identify another SMP node 12 as the home node of the address
presented. Otherwise, a COMA access is presumed. Additional details will
be provided below.
In one embodiment, the COMA architecture is implemented using a combination
of hardware and software techniques. Hardware maintains coherency between
the locally cached copies of pages, and software (e.g. the operating
system employed in computer system 10) is responsible for allocating and
deallocating cached pages.
FIG. 2 depicts details of one implementation of an SMP node 12A that
generally conforms to the SMP node 12A shown in FIG. 1. Other nodes 12 may
be configured similarly. It is noted that alternative specific
implementations of each SMP node 12 of FIG. 1 are also possible. The
implementation of SMP node 12A shown in FIG. 2 includes multiple subnodes
such as subnodes 50A and 50B. Each subnode 50 includes two processors 16
and corresponding caches 18, a memory portion 56, an address controller
52, and a data controller 54. The memory portions 56 within subnodes 50
collectively form the memory 22 of the SMP node 12A of FIG. 1. Other
subnodes (not shown) are further coupled to SMP bus 20 to form the I/O
interfaces 26.
As shown in FIG. 2, SMP bus 20 includes an address bus 58 and a data bus
60. Address controller 52 is coupled to address bus 58, and data
controller 54 is coupled to data bus 60. FIG. 2 also illustrates system
interface 24, including a system interface logic block 62, a translation
storage 64, a directory 66, and a memory tag (MTAG) 68. Logic block 62 is
coupled to both address bus 58 and data bus 60, and asserts an ignore
signal 70 upon address bus 58 under certain circumstances as will be
explained further below. Additionally, logic block 62 is coupled to
translation storage 64, directory 66, MTAG 68, and network 14.
For the embodiment of FIG. 2, each subnode 50 is configured upon a printed
circuit board which may be inserted into a backplane upon which SMP bus 20
is situated. In this manner, the number of processors and/or I/O
interfaces 26 included within an SMP node 12 may be varied by inserting or
removing subnodes 50. For example, computer system 10 may initially be
configured with a small number of subnodes 50. Additional subnodes 50 may
be added from time to time as the computing power required by the users of
computer system 10 grows.
Address controller 52 provides an interface between caches 18 and the
address portion of SMP bus 20. In the embodiment shown, address controller
52 includes an out queue 72 and some number of in queues 74. Out queue 72
buffers transactions from the processors connected thereto until address
controller 52 is granted access to address bus 58. Address controller 52
performs the transactions stored in out queue 72 in the order those
transactions were placed into out queue 72 (i.e. out queue 72 is a FIFO
queue). Transactions performed by address controller 52 as well as
transactions received from address bus 58 which are to be snooped by
caches 18 and caches internal to processors 16 are placed into in queue
74.
Similar to out queue 72, in queue 74 is a FIFO queue. All address
transactions are stored in the in queue 74 of each subnode 50 (even within
the in queue 74 of the subnode 50 which initiates the address
transaction). Address transactions are thus presented to caches 18 and
processors 16 for snooping in the order they occur upon address bus 58.
The order that transactions occur upon address bus 58 is the order for SMP
node 12A. However, the complete system is expected to have one global
memory order. This ordering expectation creates a problem in both the NUMA
and COMA architectures employed by computer system 10, since the global
order may need to be established by the order of operations upon network
14. If two nodes perform a transaction to an address, the order that the
corresponding coherency operations occur at the home node for the address
defines the order of the two transactions as seen within each node. For
example, if two write transactions are performed to the same address, then
the second write operation to arrive at the address' home node should be
the second write transaction to complete (i.e. a byte location which is
updated by both write transactions stores a value provided by the second
write transaction upon completion of both transactions). However, the node
which performs the second transaction may actually have the second
transaction occur first upon SMP bus 20. Ignore signal 70 allows the
second transaction to be transferred to system interface 24 without the
remainder of the SMP node 12 reacting to the transaction.
Therefore, in order to operate effectively with the ordering constraints
imposed by the out queue/in queue structure of address controller 52,
system interface logic block 62 employs ignore signal 70. When a
transaction is presented upon address bus 58 and system interface logic
block 62 detects that a remote transaction is to be performed in response
to the transaction, logic block 62 asserts the ignore signal 70. Assertion
of the ignore signal 70 with respect to a transaction causes address
controller 52 to inhibit storage of the transaction into in queues 74.
Therefore, other transactions which may occur subsequent to the ignored
transaction and which complete locally within SMP node 12A may complete
out of order with respect to the ignored transaction without violating the
ordering rules of in queue 74. In particular, transactions performed by
system interface 24 in response to coherency activity upon network 14 may
be performed and completed subsequent to the ignored tran | | |