|
|
|
| United States Patent | 5581777 |
| Link to this page | http://www.wikipatents.com/5581777.html |
| Inventor(s) | Kim; Won S. (Fremont, CA);
Bulfer; David M. (Mountain View, CA);
Nickolls; John R. (Los Altos, CA);
Blank; W. Thomas (Palo Alto, CA);
Figel; Hannes (Milpitas, CA) |
| Abstract | A massively parallel processor is provided with a plurality of clusters.
Each cluster includes a plurality of processor elements ("PEs") and a
cluster memory. Each PE of the cluster has associated with it an address
register, a stage register, an error register, a PE enable flag, a memory
flag, and a grant request flag. A cluster data bus and an error bus
connects each of the stage registers and error registers of the cluster to
the memory. The grant request flags of the cluster are interconnected by a
polling network, which polls only one of the grant request flags at a
time. In response to a signal on the polling network and the state of the
associated memory flag, the grant request flag determines an I/O operation
between the associated data register and the cluster memory over the
cluster data bus. Both data and error bits are associated with respective
processor elements. The sequential memory operations proceed in parallel
with parallel processor operations. The sequential memory operations also
may be queued. Addressing modes include direct and indirect. In direct
address mode, a PE addresses its own address space by appending its PE
number to a broadcast partial address. The broadcast partial address is
furnished over a broadcast bus, and the PE number is furnished on a
cluster address bus. In indirect address mode, a PE addresses either its
own address space or that of other PEs in its cluster by locally
calculating a partial address, then appending to it either its own PE
number or that of another PE in its cluster. The full address is furnished
over the cluster address bus. |
|
|
|
Title Information  |
|
|
|
|
|
Drawing from US Patent 5581777 |
|
|
Parallel processor memory transfer system using parallel transfers
between processors and staging registers and sequential transfers
between staging registers and memory |
|
|
|
|
|
| Publication Date |
December 3, 1996 |
|
|
|
|
|
| Filing Date |
March 3, 1995 |
|
|
|
|
|
|
|
|
|
|
|
| Parent Case |
This application is a continuation of application Ser. No. 08/159,916,
filed Nov. 30, 1993, which is a continuation of application Ser. No.
07/461,567, filed Jan. 05, 1990 both now abandoned. |
|
|
|
|
|
|
|
|
|
|
|
|
|
Title Information  |
|
|
References  |
|
|
| *references marked with an asterisk below are user-added references |
|
U.S. References |
|
|
| Add a new US reference: |
| | Reference | Relevancy | Comments | Reference | Relevancy | Comments | 3863233
|      Your vote accepted [0 after 0 votes] | | 5165023 Gifford 710/317 Nov,1992 |      Your vote accepted [0 after 0 votes] | | 5060186 Barbagelata 711/2 Oct,1991 |      Your vote accepted [0 after 0 votes] | | 5010476 Davis 711/147 Apr,1991 |      Your vote accepted [0 after 0 votes] | | 4891787 Gifford 712/205 Jan,1990 |      Your vote accepted [0 after 0 votes] | | 4873626 Gifford 710/120 Oct,1989 |      Your vote accepted [0 after 0 votes] | | 4860249 Nicely 712/16 Aug,1989 |      Your vote accepted [0 after 0 votes] | | 4852048 Morton 712/11 Jul,1989 |      Your vote accepted [0 after 0 votes] | | 4827403 Steele, Jr. 712/13 May,1989 |      Your vote accepted [0 after 0 votes] | | 4805173 Hillis 714/765 Feb,1989 |      Your vote accepted [0 after 0 votes] | | 4805091 Thiel 712/12 Feb,1989 |      Your vote accepted [0 after 0 votes] | | 4797815 Moore 710/109 Jan,1989 |      Your vote accepted [0 after 0 votes] | | 4791641 Hillis 714/757 Dec,1988 |      Your vote accepted [0 after 0 votes] | | 4783738 Li 712/21 Nov,1988 |      Your vote accepted [0 after 0 votes] | | 4773038 Hillis 703/20 Sep,1988 |      Your vote accepted [0 after 0 votes] | | 4719621 May 370/402 Jan,1988 |      Your vote accepted [0 after 0 votes] | | 4709327 Hillis 712/14 Nov,1987 |      Your vote accepted [0 after 0 votes] | | 4598400 Hillis 370/400 Jul,1986 |      Your vote accepted [0 after 0 votes] | | 4591981 Kassabov 712/22 May,1986 |      Your vote accepted [0 after 0 votes] | | 4481580 Martin 710/305 Nov,1984 |      Your vote accepted [0 after 0 votes] | | 4462073 Grondalski 712/207 Jul,1984 |      Your vote accepted [0 after 0 votes] | | 4447877 Grondalski 713/502 May,1984 |      Your vote accepted [0 after 0 votes] | | 4316244 Grondalski 711/168 Feb,1982 |      Your vote accepted [0 after 0 votes] | | 4314349 Batcher 708/230 Feb,1982 |      Your vote accepted [0 after 0 votes] | | 3987419 Morrill 711/112 Oct,1976 |      Your vote accepted [0 after 0 votes] | | 3936806 Batcher 712/10 Feb,1976 |      Your vote accepted [0 after 0 votes] | | 3812467 Batcher 712/300 May,1974 |      Your vote accepted [0 after 0 votes] | | 3800289 Batcher 711/104 Mar,1974 |      Your vote accepted [0 after 0 votes] | | 4980854 Donaldson 710/117 Dec,1969 |      Your vote accepted [0 after 0 votes] | | | | | |
|
|
|
|
U.S. References |
|
|
Foreign References |
|
|
|
|
|
|
Foreign References |
|
|
Other References |
|
|
|
|
|
|
Other References |
|
|
|
|
|
References  |
|
|
|
|
|
| Market Size |
|
Estimate the gross annual revenues of the relevant market
sector:
|
| | |
| |
|
|
| Market Share |
|
Estimate the percentage of the relevant market sector this invention will capture:
|
| | |
| |
|
|
| Reasonable Royalty |
|
What percentage of gross sales should the inventor or assignee be paid?
|
| | |
| |
|
|
|
Public's "Guesstimation" of Royalty Value
|
| Market Size | N/A | [No votes] | | x | Market Share | N/A | [No votes] | | x | Reasonable Royalty | N/A | [No votes] |
| | N/A | |
| |
|
|
|
|
|
|
|
|
|
|
|
|
Market Review  |
|
|
Technical Review  |
|
|
Claims  |
|
|
What is claim is:
1. A method for performing a memory write operation in a parallel processor
memory system that includes a processor controller, a transfer controller,
a plurality of processor elements, and a memory, each of said processor
elements having a processor, said method comprising the steps of:
for each of a plurality of clusters in parallel with the other of said
clusters, wherein each of the clusters comprises a respective subset of
the processor elements and a plurality of stage registers respectively
connected to the processor elements of the subset, selecting a further
subset of the processor elements therein for participation in a memory
write operation;
placing said processor controller in an interrupted state; for each of said
clusters in parallel with the other of said clusters, transferring data in
parallel from the processor elements of said selected further subset of
processor elements to the stage registers of said selected further subset
of processor elements while said processor controller is in said
interrupted state;
placing said processor controller in a non-interrupted state following
completion of said processor-to-stage register data transferring step;
placing said transfer controller in a memory busy state following the
completion of said processor-to-stage register data transferring step; for
each of said clusters concurrently with the other of said clusters,
transferring data sequentially from the stage registers of said selected
further subset of processor elements to an associated memory while said
transfer controller is in said memory bUSY state; and placing said
transfer controller in a non-memory busy state following the completion of
said stage register-to-memory data transferring step.
2. A method as in claim 1, wherein for each of said clusters, said selected
further subset is all of the processor elements in said each cluster.
3. A method as in claim 1, wherein for each of said clusters, said selected
further subset is less than all of the processor elements in said each
cluster.
4. A method as in claim 1 wherein said further processor element subset
selecting step comprises the step, for each of said clusters in parallel,
of assigning values to respective associated data transfer enable flags of
the processor elements of said each cluster for indicating which of the
processor elements of said each cluster participate in said parallel data
transferring step.
5. A method for performing a memory read operation in a parallel processor
memory system that includes a processor controller, a transfer controller,
a plurality of processor elements, and a memory, each of said processor
elements having a processor, said method comprising the steps of:
for each of a plurality of clusters in parallel with the other of said
clusters, wherein each of the clusters comprises a respective subset of
the processor elements and a plurality of stage registers respectively
connected to the processor elements of the subset, selecting a further
subset of the processor elements therein for participation in a memory
read operation;
placing said transfer controller in a memory busy state; for each of said
clusters concurrently with the other of said clusters, transferring data
sequentially from an associated memory to the stage registers of said
selected further subset of processor elements while said transfer
controller is in said memory state;
placing said transfer controller in a non-memory busy state following
completion of said memory-to-stage register data transferring step;
placing said processor controller in an interrupted state followinq the
completion of said memory-to-stage register data transferring step;
for each of said clusters in parallel with the other of said clusters,
transferring data in parallel from the stage registers of said selected
further subset of processor elements to the processors of said selected
further subset of processor elements while said processor controller is in
said interrupted state; and
placing said processor controller in a non-interrupted state following the
completion of said stage register-to-processor data transferring step.
6. A method as in claim 5, wherein for each of said clusters, said selected
further subset is all of the processor elements in said each cluster.
7. A method as in claim 5, wherein for each of said clusters, said selected
further subset is less than all of the processor elements in said each
cluster. elements of said each cluster having their respective grant
request flags set and the memory associated with said each cluster.
8. A method as in claim 5 wherein said further processor element subset
selecting step comprises the step, for each of said clusters in parallel,
of assigning values to respective associated data transfer enable flags of
the processor elements of said each cluster for indicating which of the
processor elements of said each cluster participate in said parallel data
transferring step. |
|
|
|
|
Claims  |
|
|
Description  |
|
|
CROSS REFERENCE TO RELATED APPLICATIONS
The following commonly-assigned patent applications, which are filed on
even date herewith, are hereby incorporated herein by reference: Nickolls
et al., "Scalable Inter-Processor and Processor to I/O Messaging System
for Parallel Processing Arrays, " Ser. No. 07/461,492, Taylor, "Network
and Method for Interconnecting Router elements Within Parallel Computer
System," Ser. No. 07/467,572, and Zapisek U.S. Pat. No. 5,313.590, "Router
Chip with Quad-Crossbar and Hyperbar Personalities" Ser. No. 07/461,551,
now U.S. Pat. No. 5,434,977.
BACKGROUND OF THE INVENTION
1. Field of the Invention
This invention relates to data transfers systems for massively parallel
processors, and more specifically to data addressing and the transfer of
data between a number of SIMD parallel processors arranged in a cluster
and a common cluster memory.
2. Background of the Invention
Parallel processors have been developed that are based on the concurrent
execution of the same instruction by a large number of relatively simple
"processor elements" operating on respective data streams. These
processors, known as single instruction, multiple data ("SIMD")
processors, are useful in such applications as image processing, signal
processing, artificial intelligence, data base operations, and
simulations.
Typically, a SIMD processor includes an array of processor elements and a
routing network over which the results of calculations and operandi are
communicated among the processor elements and input/output ("I/O")
devices. The operations of the processor elements and of the routing
network are controlled by a separate control processor, in response to
instructions and data furnished from a computer subsystem.
A recent SIMD processor is described in U.S. Pat. No. 4,314,349, issued
Feb. 2, 1982 to Batcher. A processing element constitutes the basic
building block, and each processing element is connected to a uniquely
associated random access memory by a bi-directional data bus. The data bus
is the main data path for the processing element. During each machine
cycle, one bit of data can be transferred from any one of six sources,
viz. a bit read from RAM memory, the state of the B, C, P, or S register,
or the state of an equivalence function. The destination of a data bit on
the data bus may be one or more of, for example, the following: the
address location in the RAM memory, the A, G, or S register, the logic
associated with the P register, the input to a sum-OR tree, and the input
to the parity tree. It will be appreciated that during a memory I/O
operation, the bus is reserved for memory data and other operations
requiring bus access may not proceed.
The SIMD processor described in U.S. Pat. No. 4,805,173, issued Feb. 14,
1989 to Hillis et al. includes an error control and correction technique
which operates across multiple processors and multiple computer memories.
Each integrated circuit includes sixteen processors which are connected to
an associated memory through a memory interface. The memory is in the form
of 22 4K.times.1 bit RAMs. Each of 16 4K.times.1 slices functions as the
memory for a different one of the 16 processors, and the remaining 6
4K.times.1 bit slices store parity or syndrome bits for the data stored in
the memories of the 16 processors. Parallel data is read from or written
to each integrated circuit at the address specified by an address decoder.
In a memory read operation, the memory is read in parallel one row at a
time to produce data outputs on 16 output lines and parity outputs on 6
additional output lines. These signals then are applied in parallel to
error control circuitry for detection and correction of parity errors. In
a memory write operation, data is written in parallel from the 16
processors into the 16 memory slices at the given address, and the 6
syndrome outputs are written into the 6 other memory slices at the same
addresses and at the same time as the data used in generating the syndrome
outputs are stored in the 16 memory slices. It will be appreciated that
the error control and correction technique requires that all 16 memory
slices be read or written in parallel.
A SIMD processor related to the Hillis et al. '173 processor appears in
U.S. Pat. No. 4,791,641, issued Dec. 13, 1988 to Hillis. The error
correction system of the Hillis patent treats data for plural memories
associated with plural processors as a unitary data word, and generates an
error code for the unitary data word. It will be appreciated that the
error correction system contemplates that the unitary data word be read or
written as a unit.
In parallel processor systems, the size of the memory per processor element
tends to be small. Nonetheless, because of the great number of processor
elements, the amount of memory required by a parallel processor is large.
Unfortunately, memory such as SRAM with speeds comparable to that of even
simple microprocessors tends to be expensive. Unfortunately, the less
expensive memory such as DRAM is relatively slow and its use in a parallel
processor would be expected to compromise performance by causing the
processor elements to idle while the memory operation is completed.
SUMMARY OF THE INVENTION
The memory system architecture of our invention permits data transfer
operations between memory and processors to proceed concurrently with
parallel processing operations. Relatively slow discrete memory may be
used without excessively degrading the overall speed of the parallel
processor. In one embodiment having DRAMs, memory is utilized at or near
the maximum memory bandwidth in both sequential page mode and random mode.
The memory system architecture of our invention provides for addressing by
a processor element of its own memory space to be made in accordance with
an address furnished to all processor elements by an array control unit,
or computed locally by the processor element. Such addressing is possible
even as to memory words that includes both data bits and error detection
and correction bits.
These and other advantages are realized in the present invention, a memory
system suitable for, for example, a parallel processor having a plurality
of clusters, each having a plurality of processor elements. In one
embodiment, each processor element has an enable flag. The memory system
includes a number of memory flags, each associated with a respective
processor element. The memory system also includes a number of stage
registers, each associated with a respective processor element. A memory
is provided, and a cluster data bus connects each of the stage registers
of the cluster to the memory. In addition, a number of grant request flags
interconnected in a polling network are provided. Each of the grant
request flags is associated with a respective processor element and is
responsive to a signal on the polling network and the state of the
associated memory flag for determining I/O operation between the
associated data register and the memory.
In one variation of the foregoing, a number of address registers are
provided, each associated with a respective processor element. The address
registers are connected to the memory over a cluster address bus, and each
of the grant request flags is responsive to the polling network and to the
state of the associated memory flag for determining an I/O operation
between the associated data register and the memory in accordance with the
contents of the associated address register.
In another variation, an address bus is connected to the memory from the
array control unit of the parallel processor. Each of the grant request
flags is responsive to the polling network and the state of the associated
memory flag for determining an I/O operation between the associated data
register and the memory in accordance with the contents of address bus. In
yet another variation, an error bus is connected to the memory; and a
number of error registers are provided. Each error register is associated
with a respective processor element and is connected thereto and also to
the error bus. Each of the grant request flags is responsive to the
polling network and the state of the associated memory flag for
determining an I/O operation between the associated error register and
said memory.
In another embodiment of the memory system, a data bus of a preselected
width is connected to each of the processor elements within a cluster. In
addition, an error bus is connected to each of the processor elements
within the cluster, and has a preselected width as well. A memory for the
cluster also is provided, each word of which has a width equal to the sum
of the width of the data bus and the error bus. An address bus for the
cluster also is provided, wherein the cluster memory is responsive to a
enable signal for performing an I/O operation on the data bus and on the
error bus in accordance with the contents of the address bus. In one
variation, the address bus is connected to each of the processor elements
in the cluster. In another variation, the address bus is connected to an
array control unit of the parallel processor. In these variations, the
address bus may include an identification bus connected to each of the
processor elements in the cluster. The identification bus can be driven by
either a fixed or programmable register in each of the processor elements.
In another embodiment of the present invention, data is transferred in a
parallel processor as follows. A common data path is provided between a
cluster of processor elements and a cluster memory. Respective memory
flags of the processor elements are evaluated in parallel, and the values
of the memory flags are furnished to respective grant request flags. A
polling signal is applied to a selected one of the grant request flags,
which is responsive to the value of its respective memory flag and to the
polling signal for determining an I/O operation between a data register
associated with the selected grant request flag and the memory over the
common data bus.
In one variation of the foregoing, an address is provided from an address
register associated with the selected grant request flag to the memory
over a common address bus connected to the memory. The selected grant
request flag is responsive to the value of its respective memory flag and
to the polling signal for determining an I/O operation between a data
register associated with the selected grant request flag and the memory
over the common data bus in accordance with the address provided. In
another variation, an address is provided from a processor control unit to
the memory over a broadcast bus. The selected grant request flag is
responsive to the value of its respective memory flag and to the polling
signal for determining an I/O operation between a data register associated
with the selected grant request flag and the memory over the common data
bus in accordance with the address provided.
BRIEF DESCRIPTION OF THE DRAWINGS
In the drawings, where like reference numerals indicate like parts,
FIG. 1 is a diagrammatic view of a massively parallel processor which
incorporates the present invention;
FIG. 2 is a diagrammatic view of a processor element printed circuit board
in accordance with the present invention;
FIG. 3 is a flowchart of execution and data bus cycles of the parallel
processor of FIG. 1;
FIG. 4 is a block diagram of the data flow path for an exemplary processor
element in accordance with the present invention;
FIG. 5 is a block diagram of the data flow path for the exemplary
arithmetic processing unit of FIG. 4;
FIG. 6 is a block diagram of the data flow path common to all processor
elements of a cluster in accordance with the present invention;
FIG. 7 is a logic schematic diagram of the cluster memory of FIG. 6;
FIG. 8 is a logic schematic diagram of the address controller of FIG. 6;
FIG. 9 is a logic schematic diagram of the error correction logic of FIG.
6;
FIG. 10A and 10B are adjacent sections of a logic schematic diagram of the
stage register of FIG. 4;
FIG. 11 is a logic schematic diagram of the exponent/address register of
FIGS. 4 and 5;
FIG. 12 is a logic schematic diagram of the grant request flag circuit of
FIG. 4;
FIG. 13 is a logic schematic diagram of the E and M flag circuits of FIGS.
4 and 5;
FIG. 14 is a logic schematic diagram of the PE register circuit of | | |