|
Description  |
|
|
CROSS REFERENCE TO RELATED PATENT APPLICATIONS
This patent application is related to the following copending, commonly
assigned patent applications, the disclosures of which are incorporated
herein by reference in their entirety:
1. "Extending The Coherence Domain Beyond A Computer System Bus" by
Hagersten et al., Ser. No. 08/673,059, now U.S. Pat. No. 5,829,033 filed
concurrently herewith.
2. "Method And Apparatus Optimizing Global Data Replies In A Computer
System" by Hagersten, Ser. No. 08/675,284, filed concurrently herewith.
3. "Method And Apparatus Providing Short Latency Round-Robin Arbitration
For Access To A Shared Resource" by Hagersten et al., Ser. No. 08/675,286,
filed concurrently herewith.
4. "Implementing Snooping On A Split-Transaction Computer System Bus" by
Singhal et al., Ser. No. 08/673,038, filed concurrently herewith.
5. "Split Transaction Snooping Bus Protocol" by Singhal et al., Ser. No.
08/673,967, filed concurrently herewith.
6. "Interconnection Subsystem For A Multiprocessor Computer System With A
Small Number Of Processors Using A Switching Arrangement Of Limited
Degree" by Heller et al., Ser. No. 08/675,629, filed concurrently
herewith.
7. "System And Method For Performing Deadlock Free Message Transfer In
Cyclic Multi-Hop Digital Computer Network" by Wade et al., Ser. No.
08/674,277, filed concurrently herewith.
8. "Synchronization System And Method For Plesiochronous Signaling" by
Cassiday et al., Ser. No. 08/674,316, now U.S. Pat. No. 5,799,175, filed
concurrently herewith.
9. "Methods And Apparatus For A Coherence Transformer For Connecting
Computer System Coherence Domains" by Hagersten et al., Ser. No.
08/677,015, filed concurrently herewith.
10. "Methods And Apparatus For A Coherence Transformer With Limited Memory
For Connecting Computer System Coherence Domains" by Hagersten et al.,
Ser. No. 08/677,014, now U.S. Pat. No. 5,829,034 filed concurrently
herewith.
11. "Methods And Apparatus For Sharing Stored Data Objects In A Computer
System" by Hagersten et al., Ser. No. 08/673,130, filed concurrently
herewith.
12. "Methods And Apparatus For A Directory-Less Memory Access Protocol In A
Distributed Shared Memory Computer System" by Hagersten et al., Ser. No.
08/671,303, filed concurrently herewith.
13. "Hybrid Memory Access Protocol In A Distributed Shared Memory Computer
System" by Hagersten et al., Ser. No. 08/673,957, filed concurrently
herewith.
14. "Methods And Apparatus For Substantially Memory-Less Coherence
Transformer For Connecting Computer System Coherence Domains" by Hagersten
et al., Ser. No. 08/677,012, filed concurrently herewith.
15. "A Multiprocessing System Including An Enhanced Blocking Mechanism For
Read To Share Transactions In A NUMA Mode" by Hagersten, Ser. No.
08/674,271, filed concurrently herewith.
16. "Encoding Method For Directory State In Cache Coherent Distributed
Shared Memory Systems" by Guzovskiy et al., Ser. No. 08/672,946, now U.S.
Pat. No. 5,752,258 filed concurrently herewith.
17. "Software Use Of Address Translation Mechanism" by Nesheim et al., Ser.
No. 08/673,043, filed concurrently herewith.
18. "Directory-Based, Shared-Memory, Scaleable Multiprocessor Computer
System Having Deadlock-free Transaction Flow Sans Flow Control Protocol"
by Lowenstein et al., Ser. No. 08/674,358, filed concurrently herewith.
19. "Maintaining A Sequential Stored Order (SSO) In A Non-SSO Machine" by
Nesheim, Ser. No. 08/673,049, filed concurrently herewith.
20. "Node To Node Interrupt Mechanism In A Multiprocessor System" by
Wong-Chan, Ser. No. 08/672,947, filed concurrently herewith.
21. "Deterministic Distributed Multicache Coherence Protocol" by Hagersten
et al., filed Apr. 8, 1996, Ser. No. 08/630,703.
22. "A Hybrid NUMA Coma Caching System And Methods For Selecting Between
The Caching Modes" by Hagersten et al., filed Dec. 22, 1995, Ser. No.
08/577,283, now U.S. Pat. No. 5,710,907.
23. "A Hybrid NUMA Coma Caching System And Methods For Selecting Between
The Caching Modes" by Wood et al., filed Dec. 22, 1995, Ser. No.
08/575,787.
24. "Flushing Of Cache Memory In A Computer System" by Hagersten et al.,
Ser. No. 08/673,881, filed concurrently herewith.
25. "Efficient Allocation Of Cache Memory Space In A Computer System" by
Hagersten et al., Ser. No. 08/675,306, filed concurrently herewith.
26. "Efficient Selection Of Memory Storage Modes In A Computer System" by
Hagersten et al., Ser. No. 08/674,029, now U.S. Pat. No. 5,802,563 filed
concurrently herewith.
27. "Skip-level Write-through In A Multi-level Memory Of A Computer System"
by Hagersten et al., Ser. No. 08/674,560, filed concurrently herewith.
28. "A Multiprocessing System Configured to Perform Efficient Write
Operations" by Hagersten, Ser. No. 08/675,634, now U.S. Pat. No. 5,749,095
filed concurrently herewith.
29. "A Multiprocessing System Configured to Perform Efficient Block Copy
Operations" by Hagersten, Ser. No. 08/674,269, filed concurrently
herewith.
30. "A Multiprocessing System Including An Apparatus For Optimizing
Spin-Lock Operations" by Hagersten, Ser. No. 08/674,272, filed
concurrently herewith.
31. "A Multiprocessing System Configured to Detect and Efficiently Provide
for Migratory Data Access Patterns" by Hagersten et al., Ser. No.
08/674,330, now U.S. Pat. No. 5,734,922 filed concurrently herewith.
32. "A Multiprocessing System Configured to Store Coherency State within
Multiple Subnodes of a Processing Node" by Hagersten, Ser. No. 08/674,274,
filed concurrently herewith.
33. "A Multiprocessing System Configured to Perform Prefetching Operations"
by Hagersten et al., Ser. No. 08/674,327, filed concurrently herewith.
34. "A Multiprocessing System Configured to Perform Synchronization
Operations" by Hagersten et al., Ser. No. 08/674,328, filed concurrently
herewith.
35. "A Multiprocessing System Having Coherency-Related Error Logging
Capabilities" by Hagersten et al., Ser. No. 08/674,276, filed concurrently
herewith.
36. "Multiprocessing System Employing A Three-Hop Communication Protocol"
by Hagersten, Ser. No. 08/674,270, filed concurrently herewith.
37. "A Multiprocessing Computer System Employing Local and Global Address
Spaces and Multiple Access Modes" by Hagersten, Ser. No. 08/675,635, filed
concurrently herewith.
38. "Multiprocessing System Employing A Coherency Protocol Including A
Reply Count" by Hagersten et al., Ser. No. 08/674,314, filed concurrently
herewith.
BACKGROUND OF THE INVENTION
1. Field of the Invention
This invention relates to the field of multiprocessor computer systems and,
more particularly, to communication protocols employed within
multiprocessor computer systems having distributed shared memory
architectures.
2. Description of the Relevant Art
Multiprocessing computer systems include two or more processors which may
be employed to perform computing tasks. A particular computing task may be
performed upon one processor while other processors perform unrelated
computing tasks. Alternatively, components of a particular computing task
may be distributed among multiple processors to decrease the time required
to perform the computing task as a whole. Generally speaking, a processor
is a device configured to perform an operation upon one or more operands
to produce a result. The operation is performed in response to an
instruction executed by the processor.
A popular architecture in commercial multiprocessing computer systems is
the symmetric multiprocessor (SMP) architecture. Typically, an SMP
computer system comprises multiple processors connected through a cache
hierarchy to a shared bus. Additionally connected to the bus is a memory,
which is shared among the processors in the system. Access to any
particular memory location within the memory occurs in a similar amount of
time as access to any other particular memory location. Since each
location in the memory may be accessed in a uniform manner, this structure
is often referred to as a uniform memory architecture (UMA).
Processors are often configured with internal caches, and one or more
caches are typically included in the cache hierarchy between the
processors and the shared bus in an SMP computer system. Multiple copies
of data residing at a particular main memory address may be stored in
these caches. In order to maintain the shared memory model, in which a
particular address stores exactly one data value at any given time, shared
bus computer systems employ cache coherency. Generally speaking, an
operation is coherent if the effects of the operation upon data stored at
a particular memory address are reflected in each copy of the data within
the cache hierarchy. For example, when data stored at a particular memory
address is updated, the update may be supplied to the caches which are
storing copies of the previous data. Alternatively, the copies of the
previous data may be invalidated in the caches such that a subsequent
access to the particular memory address causes the updated copy to be
transferred from main memory. For shared bus systems, a snoop bus protocol
is typically employed. Each coherent transaction performed upon the shared
bus is examined (or "snooped") against data in the caches. If a copy of
the affected data is found, the state of the cache line containing the
data may be updated in response to the coherent transaction.
Unfortunately, shared bus architectures suffer from several drawbacks which
limit their usefulness in multiprocessing computer systems. A bus is
capable of a peak bandwidth (e.g. a number of bytes/second which may be
transferred across the bus). As additional processors are attached to the
bus, the bandwidth required to supply the processors with data and
instructions may exceed the peak bus bandwidth. Since some processors are
forced to wait for available bus bandwidth, performance of the computer
system suffers when the bandwidth requirements of the processors exceeds
available bus bandwidth.
Additionally, adding more processors to a shared bus increases the
capacitive loading on the bus and may even cause the physical length of
the bus to be increased. The increased capacitive loading and extended bus
length increases the delay in propagating a signal across the bus. Due to
the increased propagation delay, transactions may take longer to perform.
Therefore, the peak bandwidth of the bus may decrease as more processors
are added.
These problems are further magnified by the continued increase in operating
frequency and performance of processors. The increased performance enabled
by the higher frequencies and more advanced processor microarchitectures
results in higher bandwidth requirements than previous processor
generations, even for the same number of processors. Therefore, buses
which previously provided sufficient bandwidth for a multiprocessing
computer system may be insufficient for a similar computer system
employing the higher performance processors.
Another structure for multiprocessing computer systems is a distributed
shared memory architecture. A distributed shared memory architecture
includes multiple nodes within which processors and memory reside. The
multiple nodes communicate via a network coupled there between. When
considered as a whole, the memory included within the multiple nodes forms
the shared memory for the computer system. Typically, directories are used
to identify which nodes have cached copies of data corresponding to a
particular address. Coherency activities may be generated via examination
of the directories.
Distributed shared memory systems are scaleable, overcoming the limitations
of the shared bus architecture. Since many of the processor accesses are
completed within a node, nodes typically have much lower bandwidth
requirements upon the network than a shared bus architecture must provide
upon its shared bus. The nodes may operate at high clock frequency and
bandwidth, accessing the network when needed. Additional nodes may be
added to the network without affecting the local bandwidth of the nodes.
Instead, only the network bandwidth is affected.
The coherence between nodes in a distributed shared memory system is often
kept using a distributed implementation of coherence protocols. Many such
coherence protocols employ four-hop replies wherein a request is first
sent to a home node from a requesting node. The home node responsively
sends read/invalidate demands to slave nodes holding cached copies of the
data. The slaves reply back to the home node according to the demands. The
four-hop reply protocol is completed when the home node replies back to
the requesting node.
Unfortunately, the communication patterns generated when data must be
accessed from a remote node causes a significant amount of network
traffic. In addition, after all slave nodes have replied to the home node,
the requesting node must wait until the home node sends a completion
indication back to the requesting node before the requesting node can
treat the transaction as completed. This may add to the overall latency of
the critical path associated with the coherency transaction.
A multiprocessor computer system having a distributed shared memory system
is thus desirable wherein network traffic is reduced and wherein the
latency in replying to a requesting node is reduced.
SUMMARY OF THE INVENTION
The problems outlined above are in large part solved by a multiprocessor
computer system employing local and global address spaces and multiple
access modes in accordance with the present invention. In one embodiment,
the multiprocessing computer system comprises a first processing node
including a first processor, a first memory, and a first system interface,
and a second processing node coupled to the first processing node. The
second processing node may include a second memory. The first memory and
the second memory collectively comprise a distributed shared memory
system. The first processor may be configured to initiate a first
transaction having a first address in which the first address may contain
a first value indicative of a location of its location. The first value
may also correspond to a coherency unit stored in the first memory. In
this embodiment, the first address is a local physical address if the
first value identifies the location as within the first memory of the
first processing node, while the first address is a global address if the
first value identifies a location of that is not local to the processing
node as it is stored within the second memory. Further, in this
embodiment, the first system interface may be configured to initiate a
NUMA coherency request if the first address is a global address, and a
COMA coherency request if the address is a local physical address and a
corresponding coherency unit stored within the first memory is a copy of a
third coherency unit stored within the second memory. The NUMA coherency
request causes the first processor to complete transactions upon data
stored within the second processing node. The COMA coherency request
causes the first processor to complete transactions upon data stored local
to the first processing node. In one embodiment, when a request is sent by
a requesting node to a home node, the home node sends read and/or
invalidate demands to any slave nodes holding cached copies of the
requested data. The demands from the home node to the slave nodes may each
advantageously include a value indicative of the number of replies the
requesting agent should expect to receive. The slaves reply back to the
requesting node with either data or an acknowledge. Each reply may further
include the number of replies the requester should expect. Upon receiving
all expected replies, the requesting node may treat the transaction as
completed and proceed with subsequent processing. In this manner, all
communications may require at most a three-hop communication on the
critical path of the cache coherence protocol. Accordingly, the overall
network traffic as a result of the cache coherence protocol may be
advantageously reduced. Furthermore, the latency of the critical path for
a requesting node to complete a transaction may be reduced.
In one implementation, after the requesting node has received all expected
replies, the requesting node may send a completion message back to the
home. The home node may then remove a "block" placed upon the coherency
unit of the completed transaction.
The requesting node may further or alternatively send data back to the home
node to achieve memory reflection after receiving data from a slave node.
Furthermore, in cases where the home node contains the requested data in
an appropriate state, e.g., state shared for a read-to-own request, the
home node does not send any demands to other nodes. Instead, the home node
replies directly to the requesting node.
A system and method in accordance with the present invention may
advantageously allow for an efficient and simple implementation of a
global coherency protocol in a multiprocessing computer system. The
protocol allows for an owner-based protocol wherein several dirty cached
copies may reside in differing nodes with one of them in the owner state
and a copy in the home node which is stale.
BRIEF DESCRIPTION OF THE DRAWINGS
Other objects and advantages of the invention will become apparent upon
reading the following detailed description and upon reference to the
accompanying drawings in which:
FIG. 1 is a block diagram of a multiprocessor computer system.
FIG. 1A is a conceptualized block diagram depicting a non-uniform memory
architecture supported by one embodiment of the computer system shown in
FIG. 1.
FIG. 1B is a conceptualized block diagram depicting a cache-only memory
architecture supported by one embodiment of the computer system shown in
FIG. 1.
FIG. 2 is a block diagram of one embodiment of a symmetric multiprocessing
node depicted in FIG. 1.
FIG. 2A is an exemplary directory entry stored in one embodiment of a
directory depicted in FIG. 2.
FIG. 3 is a block diagram of one embodiment of a system interface shown in
FIG. 1.
FIG. 4 is a diagram depicting activities performed in response to a typical
coherency operation between a request agent, a home agent, and a slave
agent.
FIG. 5A is a diagram of an exemplary coherency operation performed in
response to a read to own request from a processor.
FIG. 5B is a diagram depicting coherency activity in response to a read to
own request when a slave agent is the current owner of the coherency unit
and other slave agents have shared copies of the coherency unit.
FIG. 5C is a diagram that depicts coherency activity when a request agent
has a shared copy and sends a read to own request to a home agent.
FIG. 5D is a diagram depicting coherency activity in response to a read to
share request when a slave is the owner of a coherency unit.
FIG. 6 is a flowchart depicting an exemplary state machine for one
embodiment of a request agent shown in FIG. 3.
FIG. 7 is a flowchart depicting an exemplary state machine for one
embodiment of a home agent shown in FIG. 3.
FIG. 8 is a flowchart depicting an exemplary state machine for one
embodiment of a slave agent shown in FIG. 3.
FIG. 9 is a table listing request types according to one embodiment of the
system interface.
FIG. 10 is a table listing demand types according to one embodiment of the
system interface.
FIG. 11 is a table listing reply types according to one embodiment of the
system interface.
FIG. 12 is a table listing completion types according to one embodiment of
the system interface.
FIG. 13 is a table describing coherency operations in response to various
operations performed by a processor, according to one embodiment of the
system interface.
While the invention is susceptible to various modifications and alternative
forms, specific embodiments thereof are shown by way of example in the
drawings and will herein be described in detail. It should be understood,
however, that the drawings and detailed description thereto are not
intended to limit the invention to the particular form disclosed, but on
the contrary, the intention is to cover all modifications, equivalents and
alternatives falling within the spirit and scope of the present invention
as defined by the appended claims.
DETAILED DESCRIPTION OF THE INVENTION
Turning now to FIG. 1, a block diagram of one embodiment of a
multiprocessing computer system 10 is shown. Computer system 10 includes
multiple SMP nodes 12A-12D interconnected by a point-to-point network 14.
Elements referred to herein with a particular reference number followed by
a letter will be collectively referred to by the reference number alone.
For example, SMP nodes 12A-12D will be collectively referred to as SMP
nodes 12. In the embodiment shown, each SMP node 12 includes multiple
processors, external caches, an SMP bus, a memory, and a system interface.
For example, SMP node 12A is configured with multiple processors including
processors 16A-16B. The processors 16 are connected to external caches 18,
which are further coupled to an SMP bus 20. Additionally, a memory 22 and
a system interface 24 are coupled to SMP bus 20. Still further, one or
more input/output (I/O) interfaces 26 may be coupled to SMP bus 20. I/O
interfaces 26 are used to interface to peripheral devices such as serial
and parallel ports, disk drives, modems, printers, etc. Other SMP nodes
12B-12D may be configured similarly.
Generally speaking, for any given transaction a particular SMP node 12 may
serve as a requesting node, a home node, or a slave node. When a request
is sent by a requesting node to a home node, the home node sends read
and/or invalidate requests to any slave nodes holding cached copies of the
requested data. The demands from the home node to the slave nodes
advantageously includes a value indicative of the number of replies the
requesting agent should expect to receive. The slaves reply back to the
requesting node with either data or an acknowledge. Each reply may further
include the number of replies the requester should expect Upon receiving
all expected replies, the requesting node may treat the transaction as
completed and proceed with subsequent processing. In this manner, all
communications may require at most a three-hop communication on the
critical path of the cache coherence protocol. Accordingly, the overall
network traffic as a result of the cache coherence protocol may be
advantageously reduced. Furthermore, the latency of the critical path for
a requesting node to complete a transaction may be reduced.
In one implementation, after the requesting node has received all expected
replies, the requesting node may send a completion message back to the
home. The home node may remove a "block" placed upon the coherency unit of
the completed transaction.
The requesting node may further or alternatively send data back to the home
node to achieve memory reflection after receiving data from a slave node.
Furthermore, in cases where the home node contains the requested data in
an appropriate state, e.g., state shared for a read-to-own request, the
home node does not send any demands to other nodes. Instead, the home node
replies directly to the requesting node. Further details regarding the
communication protocol associated with system 10 are provided further
below.
As used herein, a memory operation is an operation causing transfer of data
from a source to a destination. The source and/or destination may be
storage locations within the initiator, or may be storage locations within
memory. When a source or destination is a storage location within memory,
the source or destination is specified via an address conveyed with the
memory operation. Memory operations may be read or write operations. A
read operation causes transfer of data from a source outside of the
initiator to a destination within the initiator. Conversely, a write
operation causes transfer of data from a source within the initiator to a
destination outside of the initiator. In the computer system shown in FIG.
1, a memory operation may include one or more transactions upon SMP bus 20
as well as one or more coherency operations upon network 14.
Each SMP node 12 is essentially an SMP system having memory 22 as the
shared memory. Processors 16 are high performance processors. In one
embodiment, each processor 16 is a SPARC processor compliant with version
9 of the SPARC processor architecture. It is noted, however, that any
processor architecture may be employed by processors 16.
Typically, processors 16 include internal instruction and data caches.
Therefore, external caches 18 are labeled as L2 caches (for level 2,
wherein the internal caches are level 1 caches). If processors 16 are not
configured with internal caches, then external caches 18 are level 1
caches. It is noted that the "level" nomenclature is used to identify
proximity of a particular cache to the processing core within processor
16. Level 1 is nearest the processing core, level 2 is next nearest, etc.
External caches 18 provide rapid access to memory addresses frequently
accessed by the processor 16 coupled thereto. It is noted that external
caches 18 may be configured in any of a variety of specific cache
arrangements. For example, set-associative or direct-mapped configurations
may be employed by external caches 18.
SMP bus 20 accommodates communication between processors 16 (through caches
18), memory 22, system interface 24, and I/O interface 26. In one
embodiment, SMP bus 20 includes an address bus and related control
signals, as well as a data bus and related control signals. Because the
address and data buses are separate, a split-transaction bus protocol may
be employed upon SMP bus 20. Generally speaking, a split-transaction bus
protocol is a protocol in which a transaction occurring upon the address
bus may differ from a concurrent transaction occurring upon the data bus.
Transactions involving address and data include an address phase in which
the address and related control information is conveyed upon the address
bus, and a data phase in which the data is conveyed upon the data bus.
Additional address phases and/or data phases for other transactions may be
initiated prior to the data phase corresponding to a particular address
phase. An address phase and the corresponding data phase may be correlated
in a number of ways. For example, data transactions may occur in the same
order that the address transactions occur. Alternatively, address and data
phases of a transaction may be identified via a unique tag.
Memory 22 is configured to store data and instruction code for use by
processors 16. Memory 22 preferably comprises dynamic random access memory
(DRAM), although any type of memory may be used. Memory 22, in conjunction
with similar illustrated memories in the other SMP nodes 12, forms a
distributed shared memory system. Each address in the address space of the
distributed shared memory is assigned to a particular node, referred to as
the home node of the address. A processor within a different node than the
home node may access the data at an address of the home node, potentially
caching the data. Therefore, coherency is maintained between SMP nodes 12
as well as among processors 16 and caches 18 within a particular SMP node
12A-12D. System interface 24 provides internode coherency, while snooping
upon SMP bus 20 provides intranode coherency.
In addition to maintaining internode coherency, system interface 24 detects
addresses upon SMP bus 20 which require a data transfer to or from another
SMP node 12. System interface 24 performs the transfer, and provides the
corresponding data for the transaction upon SMP bus 20. In the embodiment
shown, system interface 24 is coupled to a point-to-point network 14.
However, it is noted that in alternative embodiments other networks may be
used. In a point-to-point network, individual connections exist between
each node upon the network. A particular node communicates directly with a
second node via a dedicated link. To communicate with a third node, the
particular node utilizes a different link than the one used to communicate
with the second node.
It is noted that, although four SMP nodes 12 are shown in FIG. 1,
embodiments of computer system 10 employing any number of nodes are
contemplated.
FIGS. 1A and 1B are conceptualized illustrations of distributed memory
architectures supported by one embodiment of computer system 10.
Specifically, FIGS. 1A and 1B illustrate alternative ways in which each
SMP node 12 of FIG. 1 may cache data and perform memory accesses. Details
regarding the manner in which computer system 10 supports such accesses
will be described in further detail below.
Turning now to FIG. 1A, a logical diagram depicting a first memory
architecture 30 supported by one embodiment of computer system 10 is
shown. Architecture 30 includes multiple processors 32A-32D, multiple
caches 34A-34D, multiple memories 36A-36D, and an interconnect network 38.
The multiple memories 36 form a distributed shared memory. Each address
within the address space corresponds to a location within one of memories
36.
Architecture 30 is a non-uniform memory architecture (NUMA). In a NUMA
architecture, the amount of time required to access a first memory address
may be substantially different than the amount of time required to access
a second memory address. The access time depends upon the origin of the
access and the location of the memory 36A-36D which stores the accessed
data. For example, if processor 32A accesses a first memory address stored
in memory 36A, the access time may be significantly shorter than the
access time for an access to a second memory address stored in one of
memories 36B-36D. That is, an access by processor 32A to memory 36A may be
completed locally (e.g. without transfers upon network 38), while a
processor 32A access to memory 36B is performed via network 38. Typically,
an access through network 38 is slower than an access completed within a
local memory. For example, a local access might be completed in a few
hundred nanoseconds while an access via the network might occupy a few
microseconds.
Data corresponding to addresses stored in remote nodes may be cached in any
of the caches 34. However, once a cache 34 discards the data corresponding
to such a remote address, a subsequent access to the remote address is
completed via a transfer upon network 38.
NUMA architectures may provide excellent performance characteristics for
software applications which use addresses that correspond primarily to a
particular local memory. Software applications which exhibit more random
access patterns and which do not confine their memory accesses to
addresses within a particular local memory, on the other hand, may
experience a large amount of network traffic as a particular processor 32
performs repeated accesses to remote nodes.
Turning now to FIG. 1B, a logic diagram depicting a second memory
architecture 40 supported by the computer system 10 of FIG. 1 is shown.
Architecture 40 includes multiple processors 42A-42D, multiple caches
44A-44D, multiple memories 46A-46D, and network 48. However, memorie | | |