|
|
|
| United States Patent | 5303362 |
| Link to this page | http://www.wikipatents.com/5303362.html |
| Inventor(s) | Butts, Jr.; H. Bruce (Redmond, WA);
Orbits; David A. (Redmond, WA);
Abramson; Kenneth D. (Seattle, WA) |
| Abstract | A coherent coupled memory multiprocessor computer system that includes a
plurality of processor modules (11a, 11b . . . ), a global interconnect
(13), an optional global memory (15) and an input/output subsystem (17,19)
is disclosed. Each processor module (11a, 11b . . . ) includes: a
processor (21); cache memory (23); cache memory controller logic (22);
coupled memory (25); coupled memory control logic (24); and a global
interconnect interface (27). Coupled memory (25) associated with a
specific processor (21), like global memory (15), is available to other
processors (21). Coherency between data stored in coupled (or global)
memory and similar data replicated in cache memory is maintained by either
a write-through or a write-back cache coherency management protocol. The
selected protocol is implemented in hardware, i.e., logic, form,
preferably incorporated in the coupled memory control logic (24) and in
the cache memory controller logic (22). In the write-through protocol,
processor writes are propagated directly to coupled memory while
invalidating corresponding data in cache memory. In contrast, the
write-back protocol allows data owned by a cache to be continuously
updated until requested by another processor, at which time the coupled
memory is updated and other cache blocks containing the same data are
invalidated. |
|
|
|
Title Information  |
|
|
|
|
|
Drawing from US Patent 5303362 |
|
|
Coupled memory multiprocessor computer system including cache coherency
management protocols |
|
|
|
|
|
| Publication Date |
April 12, 1994 |
|
|
|
|
|
| Filing Date |
March 20, 1991 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Title Information  |
|
|
Description  |
|
|
TECHNICAL AREA
This invention relates to multiprocessor computer systems and, more
particularly, to multiprocessor computer systems in which system memory is
distributed such that a portion of system memory is coupled to each
processor of the system.
BACKGROUND OF THE INVENTION
Memory latency, i.e., the time required to access data or instructions
stored in the memory of a computer, has increasingly become the bottleneck
that prevents the full realization of the speed of contemporary single and
multiprocessor computer systems. This result is occurring because the
speed of integrated processors has outstripped memory subsystem speed. In
order to operate most efficiently and effectively, fast processors require
the contradictory features of reduced memory latency and larger memory
size. Larger memory size implies greater physical size, greater
communication distances, and slower access time due to the additional
signal buffers needed to drive heavily loaded address, data and control
signal lines, all of which increase memory latency. The primary negative
effect of memory latency is its effect on processor speed. The longer it
takes to obtain data from memory, the slower a processor runs because
processors usually remain idle when they are waiting for data. This
negative effect has increased as processor speed has outstripped memory
subsystem speed. Despite the gains made in high-density, high-speed
integrated memories, the progress to date still leaves the memory
subsystem as the speed-limiting link in computer system design. This is
true regardless of whether the computer system includes a single processor
or a plurality of processors.
One way to reduce average memory latency is to add a cache subsystem to a
computer system. A cache subsystem consists of a small memory situated
adjacent to a processor that is hardware controlled rather than software
controlled. Frequently used datum and instructions are replicated in cache
memories. Cache subsystems capitalize on the property that once a datum or
instruction has been fetched from system memory, it is very likely that it
will be reused in the near future. Due to the close association between
cache memory and its associated processor and the nature of the control
(hardware as opposed to software), cache memory latency is several times
less than that of system memory. Because access is much more rapid,
overall speed is improved in computer systems that include a cache
subsystem. As memory latency increases with memory size, hierarchical
caches have been developed to maintain average memory latency at a low
level. Some high-performance processors include separate instruction and
datum caches that can be simultaneously accessed. (For simplicity of
description, datum, instructions and any other forms of information
commonly stored in computer memories are collectively hereinafter referred
to as data.)
While computer systems that include a cache subsystem have a number of
advantages, one disadvantage is the expense of cache memories. This
disadvantage is enhanced because a cache memory does not add capacity to
system memory. Rather, cache memories are add-ons to system memory,
because, as noted above, cache memories replicate data stored in system
memory. The replication of data leads to another disadvantage of cache
memories, namely, the need to maintain coherency between data stored at
two or more locations in the memories of a computer system. More
specifically, because data stored at either location can be independently
updated, a computer system that includes a cache subsystem requires a way
of maintaining coherency between independent sources of the same data. If
coherency is not maintained, data at one location will become stale when
the same data at another location is updated. The use of stale data can
lead to errors.
Several different types of cache management algorithms have been developed
to govern what occurs when data stored in a cache are updated. The
simplest algorithm is known as a "write-through" cache coherency
management protocol. A write-through cache coherency management protocol
causes processor writes to be propagated directly to system memory. All
caches throughout the computer system are searched, and any copies of
written data are either invalidated or updated. While a write-through
cache coherency management protocol can be used with multiprocessor
computer systems that include a large number of processors, a
write-through cache coherency management protocol is better suited for
single processor computer systems or multiprocessor computer systems
incorporating a limited number, e.g., four, of processors.
A more complex, but higher performance, coherency management algorithm is
known as a "write-back" cache coherency management protocol. Like a
write-through cache coherency management protocol, a write-back cache
coherency management protocol is an algorithm that is normally
incorporated in the hardware of a computer system that controls the
operation of a cache. In a write-back cache coherency management protocol,
initial processor writes are written only to cache memory. Later, as
necessary, updated data stored in a cache memory is transferred to system
memory. Updated data transfer occurs when an input/output device or
another processor requires the updated data. A write-back cache coherency
management protocol is better suited for use in multiprocessor computer
systems that include a large number of processors (e.g., 24) than a
write-through cache coherency management protocol because a write-back
cache coherency management protocol has a lower impact on the system
interconnect because a write-through cache coherency management protocol
greatly reduces write traffic.
One of the first write-back coherency management protocols was suggested by
Dr. James R. Goodman in his paper entitled "Using Cache Memory to Reduce
Processor Memory Traffic" (10th International Symposium of Computer
Architecture, 1983). Dr. Goodman's improvement is based on the observation
that if the sole copy of data associated with a specific system memory
location is stored in a cache, the cache copy can be repeatedly modified
without the need to broadcast write-invalidate messages to all other
system caches each time a modification occurs. More specifically, Dr.
Goodman's improvement requires the addition of a state bit to each cache
copy. The state bit indicates that the copy is either "shared" or "owned."
When a system memory location is first read, and data supplied to a cache,
the state bit is set to the "shared" state. If the cache copy is later
written, i.e., modified, the state bit transitions to the "owned" state.
At the same time, a write-invalidate message is broadcast, resulting in
the updated cache copy of the data being identified as the only valid copy
associated with the related system memory location. As long as the cache
location remains in an owned state, it can be rewritten, i.e., updated,
without the need to broadcast further write-invalidate messages. A remote
request for the data to the memory location associated with the cached
copy causes a transition back to the shared state, and the read request to
be satisfied by either the cache and, then, updating the related system
memory location, or by the cache delaying the memory request until valid
data is rewritten to the system memory location.
Recently, proposals have been made to distribute memory throughout a
multiprocessor computer system, rather than use bank(s) of global memory
accessible by all processors via a common interconnect bus. More
specifically, in distributed memory multiprocessor computer systems, a
portion of system memory is physically located adjacent to the processor
the memory portion is intended to serve. Research in this area has grown
out of attempts to find ways of creating effective multiprocessor computer
systems out of a large number of powerful workstations connected together
via a network link. In the past, distributed shared memory computer
networks have used various software-implemented protocols to share memory
space. The software-implemented protocols make the distributed memory
simulate a common global memory accessible by all of the computers
connected to the network. Memory latency is improved because the portion
of memory associated with a specific processor can be accessed by that
processor without use of the network link. An example of this research
work is the BBN Butterfly computer system developed by BBN Laboratories,
Inc. See Butterfly.TM. Parallel Processor Overview, BBN Report No. 6148,
Version 1, Mar. 6, 1986, and The Uniform System Approach to Programming
the Butterfly.TM. Parallel Processor, BBN Report No. 6149, Version 2, Jun.
16, 1986.
A drawback of the software-implemented protocols used in the BBN Butterfly
and the like computer systems is their extremely poor performance when the
amount of sharing between processors is large or when the memory
associated with a single processor is insufficient to meet the needs of a
program and it becomes necessary to use the memory associated with another
processor and/or to make data calls to storage devices, such as a hard
disk. In the past, such requirements have significantly reduced processing
speed. Such requirements have also negatively impacted the bandwidth
requirements of the network linking the processors together. A further
disadvantage has been the increased overhead associated with the
management of data stored at different locations in the distributed memory
of the computer system. More specifically, the processing speed of prior
art distributed memory multiprocessor computer systems have been improved
by replicating shared data in the memories associated with the different
processors needing the data. This has a number of disadvantages. First,
replicating data in system memory creates a high memory overhead,
particularly because system memory is stored on a page basis and page
sizes are relatively large. Recently, page sizes of 64K bytes have been
proposed. In contrast cache memories store data in blocks of considerably
smaller size. A typical cache block of data is 64 bytes. Thus, the
"granularity" of the data replicated in system memory is considerably
larger than the granularity of data replicated in cache memory. The large
granularity size leads to other disadvantages. Greater interconnect
bandwidth is required to transfer larger data granules than smaller data
granules. Coherency problems are increased because of the likelihood that
more processors will be contending for the larger granules than the number
contending for smaller granules on a packet-to-packet basis.
In summary, in the last several years, very large-scale integrated circuit
(VLSI) processor speeds have been increased by roughly an order of
magnitude due to continual semiconductor improvements and due to the
introduction of reduced instruction set computer (RISC) architectures. As
processor speeds have improved, large, fast and expensive cache memories
have been needed in order to reduce average memory latency and keep
processor idle times reasonable. Even with improved cache memories and
improved ways of maintaining data coherency, average memory latency
remains the bottleneck to improving the performance of multiprocessor
computer systems.
A major portion of memory latency, i.e., memory access time, is the latency
of the network that interconnects the multiprocessors, memory, and
input/output modules of multiprocessor computer systems. Regardless of
whether the interconnect network is a fully interconnected switching
network or a shared bus, the time it takes for a memory request to travel
between a processor and system memory is directly added to the actual
memory operational latency. Interconnect latency includes not only the
actual signal propagation delays, but overhead delays such as
synchronization of the interconnect timing environment with the
interconnect arbitration, which increases rapidly as processors are added
to a multiprocessor system.
Recent attempts to improve memory latency have involved distributing system
memory so that it is closer to the processors requiring access to the
memory. This has led to sharing data in system memory, which, in turn, has
required the implementation of coherency schemes. In the past system
memory coherency schemes have been implemented in software. Because they
have been implemented in software they have been slow. Further, system
memory sharing requires that large data granules be transferred. Large
data granules take up large parts of system memory, require large amounts
of interconnect bandwidth to transfer from one memory location to another,
and are more likely to be referenced by a large number of processors than
smaller data granules. The present invention is directed to providing a
multiprocessor computer system that overcomes these disadvantages.
SUMMARY OF THE INVENTION
The present invention is directed to providing a multiprocessor system that
overcomes the problems outlined above. More specifically, the present
invention is directed to providing a multiprocessor computer system
wherein system memory is broken into sections, denoted coupled memory, and
distributed throughout the system such that a coupled memory is closely
associated with each processor. The close association improves memory
latency and reduces the need for system interconnect bandwidth. Coupled
memory is not cache memory. Cache memory stores replications of data
stored in system memory. Coupled memory is system memory. As a general
rule, data stored at one location in coupled memory is not replicated at
another coupled memory location. Moreover, the granular size of data
stored in caches, commonly called blocks of data, is considerably smaller
than the granular size of data stored in system memory, commonly called
pages. Coupled memory is lower in cost since it does not provide the high
performance of cache memory. Coupled memory is directly accessible by its
associated processor, i.e., coupled memory is accessible by its associate
processor without use of the system interconnect. More importantly, while
coupled memory is closely associated with a specific processor, unlike a
cache, coupled memory is accessible by other system processors via the
system interconnect. In addition to coupled memory, system memory may
include global memory, i.e., memory not associated with a processor, but
rather shared equally by all processors via the system interconnect. For
fast access, frequently used system memory data are replicated in caches
associated with each processor of the system. Cache coherency is
maintained by the system hardware.
More specifically, in accordance with this invention, a coherent coupled
memory multiprocessor computer system that includes a plurality of
processor modules, a global interconnect, an optional global memory and an
input/output subsystem is provided. Each processor module includes a
processor, cache memory, cache memory controller logic, coupled memory,
coupled memory control logic and a global interconnect interface. The
coupled memory associated with each specific processor and global memory,
if any, form system memory, i.e., coupled memory like global memory is
available to other processors. Coherency between similar (i.e.,
replicated) data stored in specific coupled memory locations and both
local and remote caches are maintained by either write-through or
write-back cache coherency management protocols. The cache coherency
management protocols are implemented in hardware, i.e., logic, form and,
thus, constitute a part of the computer system hardware.
In embodiments of the invention incorporating a write-through cache
coherency management protocol, each time a memory reference occurs, the
protocol logic determines if the read or write is of local or remote
origin and the state of a shared bit associated with the related system
memory location. The shared bit denotes if the data or instruction at the
addressed coupled memory location has or has not been shared with a remote
processor. Based on the nature of the command (read or write), the source
of the command (local or remote) and the state of the shared data bit (set
or clear), the write-through protocol logic controls the invalidating of
cache-replicated data or instructions and the subsequent state of the
shared bit. Thereafter, the read or write operation takes place.
Embodiments of the invention incorporating a write-back cache coherency
management protocol also determine if a read or write is of local or
remote origin. The protocol logic also determines the state of shared and
exclusive bits associated with the addressed coupled memory location.
Based on the nature of the command (read or write), the source of the
command (local or remote) and the state of the shared and exclusive bits
(set or clear), the write-back protocol logic controls the invalidating of
cache-stored data, the subsequent state of the shared and exclusive bits
and the supplying of data to the source of read commands or the writing of
data to the coupled memory. The write-back cache coherency management
protocol logic also determines the state of an ownership bit associated
with replicated data stored in caches and uses the status of the ownership
bit to control the updating of replicated data stored in caches.
BRIEF DESCRIPTION OF THE DRAWINGS
The foregoing and other features and advantages of this invention will
become better understood by reference to the following detailed
description of preferred embodiments of the invention when taken in
conjunction with the accompanying drawings, wherein:
FIG. 1 is a block diagram of a coherent coupled memory multiprocessor
system formed in accordance with this invention;
FIG. 2 is a flow diagram illustrating the write operation of a
write-through cache coherency management protocol suitable for use in
embodiments of the invention;
FIG. 3 is a flow diagram illustrating the read operation of a write-through
cache coherency management protocol suitable for use in embodiments of the
invention;
FIG. 4 is a state diagram illustrating the logic used to carry out the
write-through cache coherency management protocol illustrated in FIGS. 2
and 3;
FIG. 5 is a flow diagram illustrating a processor cache read request of a
write-back cache coherency management protocol suitable for use in
embodiments of the invention;
FIG. 6 is a flow diagram illustrating a processor cache write request of a
write-back cache coherency management protocol suitable for use in
embodiments of the invention;
FIG. 7 is a state diagram illustrating the logic used to carry out the
processor cache read and write requests illustrated in FIGS. 5 and 6;
F | | |