|
Description  |
|
|
INCORPORATION BY REFERENCE
U.S. Pat. No. 4,598,400, issued Jul. 1, 1986, to W. Daniel Hillis, for
Method and Apparatus For Routing Message Packets, and assigned to the
assignee of the present application, incorporated herein by reference.
U.S. Pat. No. 4,814,973, issued Mar. 21, 1989, to W. Daniel Hillis, for
Parallel Processor, and assigned to the assignee of the present
application, incorporated herein by reference.
U.S. patent application Ser. No. 07/043,126, filed Apr. 27, 1987, now U.S.
Pat. No. 4,984,235, by W. Daniel Hillis, et al, for Method and Apparatus
For Routing Message Packets, and assigned to the assignee of the present
application, incorporated herein by reference.
U.S. patent application Ser. No. 07/179,020, filed Apr. 8, 1988, now U.S.
Pat. No. 5,148,547, by Brewster Kahle, et al., for Method and Apparatus
For Interfacing Parallel Processors To A Co-Processor, and assigned to the
assignee of the present application, incorporated herein by reference.
FIELD OF THE INVENTION
The invention relates generally to the field of massively parallel computer
systems, and more particularly to communications arrangements for
transferring data among processing nodes in such systems.
BACKGROUND OF THE INVENTION
A computer system generally includes one or more processors, a memory and
an input/output system. The memory stores data and instructions for
processing the data. The processor(s) process the data in accordance with
the instructions, and store the processed data in the memory. The
input/output system facilitates loading of data and instructions into the
system, and obtaining processed data from the system.
Most modern computer systems have been designed around a "von Neumann"
paradigm, under which each processor has a program counter that identifies
the location in the memory which contains its (that is, the processor's)
next instruction. During execution of an instruction, the processor
increments the program counter to identify the location of the next
instruction to be processed. Processors in such a system may share data
and instructions; however, to avoid interfering with each other in an
undesirable manner, such systems are typically configured so that the
processors process separate instruction streams, that is, separate series
of instructions, and sometimes complex procedures are provided to ensure
that processors' access to the data is orderly.
In Von Neumann machines, instructions in one instruction stream are used to
process data in a single data stream. Such machines are typically referred
to as SISD single instruction/single data) machines if they have one
processor, or MIMD (multiple instruction/multiple data) machines if they
have multiple processors. In a number of types of computations, such as
processing of arrays of data, the same instruction stream may be used to
process data in a number of data streams. For these computations, SISD
machines would iteratively perform the same operation or series of
operations on the data in each data stream. Recently, single
instruction/multiple data (SIMD) machines have been developed which
process the data in all of the data streams in parallel. Since SIMD
machine process all of the data streams in parallel, such problems can be
processed much more quickly than in SISD machines, and at lower cost than
with MIMD machines providing the same degree of parallelism.
The aforementioned Hillis patents and Hillis, et al., patent application
disclose an SIMD machine which includes a host computer, a
micro-controller and an array of processing elements, each including a
bit-serial processor and a memory. The host computer, inter alia,
generates commands which are transmitted to the micro-controller. In
response to a command, the micro-controller transmits one or more SIMD
instructions to the array, each SIMD instruction enabling all of the
processing elements to perform the same operation in connection with data
stored in the elements' memories.
The array disclosed in the Hillis patents and Hillis, et al., patent
application also includes two communications mechanisms which facilitate
transfer of data among the processing elements. One mechanism enables each
processing element to selectively transmit data to one of its
nearest-neighbor processing elements. The second mechanism, a global
router interconnecting integrated circuit chips housing the processing
elements in a hypercube, enables any processing element to transmit data
to any other processing element in the system. In the first mechanism,
termed "NEWS" (for the North, East, West, and South directions in which a
processing element may transmit data), the micro-controller enables all of
the processing elements to transmit, and to receive, bit-serial data in
unison, from the selected neighbor.
On the other hand, in the global router, the data is transmitted in the
form of messages, with each message containing an address that identifies
the processing element that is to receive the message. The
micro-controller controls all of the processing elements in parallel. In
particular, the micro-controller enables the processing elements to
transmit messages, in bit serial format, from particular source locations
in their respective memories, for delivery in the destination processing
elements at particular destination locations in the respective memories.
If multiple messages have the same destination processing elements,
later-delivered messages will be combined with previously-received
messages, and accordingly the messages that will be processed by the
serial processors after the message transfer operation will be a function
of messages previously received by the processor.
SUMMARY OF THE INVENTION
The invention provides a new and improved communications arrangement for
facilitating transfers of data among processing nodes in a processor
array.
In brief summary, the invention provides a massively parallel computer
system including a plurality of processing nodes under control of a system
controller. The processing nodes are interconnected by a plurality of
communications links. Each processing node comprises at least one
processor, a memory, and a router node connected to the communications
links for transferring in a series of message transfer cycles messages
over the communications links. The controller enables each processing node
to establish a message queue in its memory. The controller further enables
storage of messages received by the processing nodes for their respective
processors during a message transfer cycle to be stored in the message
queue.
BRIEF DESCRIPTION OF THE DRAWINGS
This invention is pointed out with particularity in the appended claims.
The above and further advantages of this invention may be better
understood by referring to the following description taken in conjunction
with the accompanying drawings, in which:
FIG. 1 is a block diagram of a portion of a computer system incorporating a
communication arrangement in accordance with the invention; and
FIG. 2 is a diagram of a queue used in the new communication arrangement;
FIG. 3 is a flow diagram useful in understanding the operation of the new
communication arrangement; and
FIG. 4 is a general block diagram of a parallel processing system in which
the invention can be used.
DETAILED DESCRIPTION OF AN ILLUSTRATIVE EMBODIMENT
FIG. 4 is a block diagram of a portion of a computer system incorporating a
communication arrangement in accordance with the invention. The computer
system includes a processing node array 8 comprising a plurality of
processing nodes, generally identified by reference numeral 10,
interconnected by a plurality of communications links 7(x)(y). Eight
processing nodes 10 are shown in the processing node array 8 depicted in
FIG. 1, identified by reference numerals 10(A) through 10(H), but it will
be clear from the following that the processing node array 8 may include
fewer, or many more, than eight processing nodes 10. The structure of a
processing node 10 will be described in detail below in connection with
FIG. 1.
In one embodiment, the communications links 7(x)(y) interconnect the
processing nodes 10 in the form of an "n"-dimensional hypercube. In that
embodiment, each communications link 7(x)(y) transmits messages from one
processing node 10(s) (hereinafter referred to as a "source" processing
node) to another processing node 10(d) (hereinafter referred to as a
"destination" processing node). In reference numeral 7(s)(d), the index
"s" identifies the source processing node 10 and the index "d" identifies
the destination processing node 10(d). As used herein, two processing
nodes 10 and 10(d) will be said to be "adjacent" if there is a pair of
communications links 7(s)(d) and 7(j)(i) which interconnect them.
In the array 8 depicted in FIG. 1, the hypercube forms three dimensions, as
suggested by the grid 9 proximate processing node 10(A). As is
conventional in connection with hypercubes, the dimensionality of the
hypercube and the number of processing nodes are related, with the
relation being that the number of dimensions corresponds to the logarithm,
to the base two, of the number of processing nodes in the hypercube. Since
the eight processing nodes 10 are shown in the processing node array 8 of
FIG. 1, the processing nodes 10 can be interconnected in a
three-dimensional hypercube. It will be appreciated that the processing
node array 8 may include many more processing nodes 10 which may be
interconnected by communications links 7(s)(d) to form a hypercube;
preferably the number of processing nodes 10, in the array 8 is a power of
two, which facilitates interconnecting them in a regular hypercube having
a number of dimensions corresponding to the logarithm, to the base two, of
the number of processing nodes.
The grid 13 has three arrows that are labeled "DIM 0," "DIM 1," and "DIM
2," each of which identifies one of the three dimensions DIM "i" in which
"i" identifies the dimension. The directions of the hypercube dimensions,
that is, the orientations of the particular communications links 7(s)(d)
which correspond to the particular hypercube dimensions, differ for each
processing node 10, and is determined as follows. As shown on FIG. 1, each
processing node 10 is assigned a hypercube address, which is shown in
binary form in FIG. 1. Each hypercube address has a number of binary
digits corresponding to the number of dimensions in the hypercube. Thus,
for example, processing node 10(A) is assigned hypercube address "000,"
processing node 10(B) is assigned hypercube address "001," and so on, with
processing node 10(H) being assigned hypercube address "101." The binary
addresses are assigned to the processing nodes 10 so that the binary
addresses of adjacent processing nodes differ in one bit location.
In each hypercube address, the right-most binary digit is the low-order
digit in the hypercube address, with each successive digit towards the
left being a progressively higher order digit, and the left-most binary
digit being the high-order digit in the hypercube address. The dimension
of the communications link 7(s)(d) interconnecting adjacent nodes
corresponds to the order of the digit in the binary addresses that is
different. Thus, as shown in FIG. 1, the binary addresses of processing
node 10(A) and processing node 10(B) differ in the low (zeroth) order
digit, and so the hypercube dimension from processing node 10(A) to
processing node 10(B) is the DIM 0 dimension zero, as shown in grid 13.
Similarly, the binary addresses of processing node 10(A) and processing
node 10(C) differ in the next (first) order digit, and so the hypercube
dimension from processing node 10(A) to processing node 10(C) is DIM 1
dimension one, also as shown in grid 13. Finally, the binary addresses of
processing node 10(A) and processing node 10(E) differ in the high
(second) order digit, and so the hypercube dimension from processing node
10(A) to processing node 10(E) is DIM 2 dimension two.
The hypercube dimensions from each processing node 10 to its adjacent nodes
are determined in a similar manner. It will be appreciated that, for the
communications link 7(s)(d) from a processing node 10 to another
processing node 10(d) that is associated with a particular dimension for
the node 10, the communications link 7(j)(i) from the processing node
10(d) to the processing node 10 is associated with the same dimension.
This is a result of the fact that the hypercube addresses of the
processing nodes 10 and 10(d), for each pair of adjacent nodes, will
differ in the same order hypercube address bit, which order determines the
dimension for each processing node.
In one particular embodiment, the computer system also includes a
micro-controller 5, which is controlled by a host computer 6. To
accomplish processing, the host computer 6, in response to a request from
an applications or system program being processed thereby, transmits
signals representing a command to the micro-controller 5. In response to a
command, the micro-controller 5 may transmit a number of signals, as
detailed below in connection with FIG. 2, to control the processing nodes
10 of processing node array 8 in parallel. The processing nodes 10 may
also generate status signals, which they couple to the micro-controller 5
to notify it of the status of the operations enabled by the
micro-controller. The micro-controller 5 may also provide status signals
to the host computer 6 to notify it of the status of the processing of the
command. In addition, the computer system may include one or more
input/output systems (not shown). The input/output systems may include,
for example, mass data storage devices, frame buffers, printers, or the
like, which supply data to the processing node array 8 for processing, or
which receive data therefrom for storage, display, and so forth.
FIG. 1 is a block diagram of a portion of a computer system incorporating a
communication arrangement in accordance with the invention. The computer
system includes a micro-controller 5, which is controlled by a host 6 and
which, in turn, controls an array of processing nodes, one of which,
namely, processing node 10, is shown in FIG. 1. To accomplish processing,
the host computer 6 transmits commands to the micro-controller 5. In
response to a command, the micro-controller 5 may transmit one or more
instructions or other sets of control signals which control processing and
other operations, in parallel, to all of the processing nodes
concurrently. In addition, a number of processing nodes 10 are
interconnected, as described in the aforementioned Hillis patents and
Hillis, et al., patent application to facilitate the transfer of data
among the processing nodes 10.
With reference to FIG. 1, processing node 10 includes two processing
element (PE) chips 11H and 11L (generally identified by reference numeral
11 ) connected to a memory 12 over a data bus 13. In one embodiment, the
data bus includes thirty-two data lines D(31:0) which are divided into
high-order data lines D(31:16), which connect to PE chip 11H, and
low-order data lines D(15:0), which connect to PE chip 11L. Each PE chip
11 includes a set of serial processors, generally identified by reference
numeral 14, and a router node, generally identified by reference numeral
15. The serial processors operate in response to SP INSTR serial processor
instruction signals from the micro-controller 5 to perform processing on
data stored in the memory 12. The micro-controller 5 may enable the serial
processors 14(i) to emulate a larger number of virtual processors by
essentially providing, in each memory 12, multiple sets of each item of
data, one set associated with each virtual processor and providing the SP
INSTR serial processor instruction signals multiple times to, in parallel,
enable the serial processors 14(i) to process the sets seriatim.
The memory 12 operates in response to SEL MEM ADRS selected memory address
signals, which identify storage locations in the memory 12, and MEM CTRL
memory control signals which indicate whether data is to be stored in or
transmitted from the location identified by the SEL MEM ADRS selected
memory address signals. The SEL MEM ADRS selected memory address signals
are provided by a multiplexer 16, which operates under control of the MEM
CTRL memory control signals from the micro-controller 5. The multiplexer
16 couples either MC MEM ADRS micro-controller memory address signals from
the micro-controller 5 or IND MEM ADRS indirect memory address signals to
the memory 12 as the SEL MEM ADRS selected memory address signals. The
router nodes 15 also operate in response to RTR CTRL router control
signals, also from the micro-controller 5, to transmit messages containing
data from one processing node 10 to another.
In one embodiment, each PE chip 11 includes sixteen serial processors 14,
each of which is associated with one of the data lines of the data bus 13.
That is, each serial processor 14(i) receives data bits from, and
transmits data bits onto, one of the data lines D(i) ["i" is an integer
from the set (31, . . . ,0)]. The memory 12 has storage locations
organized into thirty-two bit slices, with each slice being identified by
a particular binary-encoded value of the SEL MEM ADRS selected memory
address signals from the multiplexer 16. If data is to be transmitted from
a slice in memory identified by a particular value of the SEL MEM ADRS
selected memory address signals, the memory 12 will transmit bits 3 1
through 0 of the slice onto data lines D(31) through D(0), respectively.
On the other hand, if data is to be loaded into a slice in memory
identified by a particular value of the SEL MEM ADRS selected memory
address signals, the memory 12 will receive bits 31 through 0 of from data
lines D(31) through D(0), respectively, and load them into respective bits
of the slice.
To perform processing on multi-bit words of data in the memory 12 using the
serial processors 14, the micro-controller 5 iteratively enables
generation of SEL MEM ADRS selected memory address signals whose values
identify successive location in memory 12, and MEM CTRL memory control
signals which enable the memory 12 to transmit or store slices of data,
and SP INSTR serial processor instruction signals which enable the serial
processors 14 to perform the required operations on the bits on their
associated data lines D(i). The data in the memory 12 thus may be viewed
in two ways, namely, (i) a slice view, identified by the arrow labeled
"SLICE," representing fixed-size words of data ("data slices") that will
be transmitted from the memory 12 onto the data bus 13, or that will be
received by the memory 12 from the data bus 13, at one time in response to
the MEM ADRS memory address signals, and (ii) a processor view, identified
by the arrow labelled "PROCESSOR," which represents the organization in
memory 12 of data which may be accessed by an individual serial processor.
The router nodes 15 of all of the processing nodes 10 are interconnected to
facilitate transfer of messages among the processing nodes 10 comprising
the array. In one particular embodiment the router nodes are
interconnected in the form of a hypercube, as described in the
aforementioned Hillis patents. Each router node 15H and 15L, under control
of RTR CTRL router control signals from the micro-controller 5, transmits
messages to other router nodes 15 on other processing element chips 11
over a plurality of communications links 7, more specifically identified
by reference numerals HC.sub.-- O.sub.-- H(11:0) (connected to router node
15H) and HC.sub.-- O.sub.-- L(11:0) (connected to router node 15L).
In addition, each router node 15H and 15L receives messages from
communications links identified by reference numerals HC.sub.-- I.sub.--
H(11:0) (connected to router node 15H) and HC.sub.-- I.sub.-- L(11:0)
(connected to router node 15L). The router nodes 15 determine from the
address of each received message whether the message is intended for a
serial processor 14(i) on the processing node 10 and, if so, couples it
onto a data line D(i) of data bus 13 over which the serial processor 14(i)
that is to receive the message accesses the memory 12. The
micro-controller 13 enables generation of SEL MEM ADRS selected memory
address and MEM CTRL memory control signals to facilitate the storage of
the data from the message in the memory 12.
Each message includes an address field and a data field. The data field
includes the data to be transferred. The contents of the address field
depend upon which of two message transfer modes the computer system is
using to transfer messages during the message transfer operation. In a
"direct" message transfer mode, the address field of each message includes
an address identify that identifies a processing node 10 and serial
processor 14(i) that is the intended recipient of the message. In
addition, if each serial processor 14(i) is enabled to emulate a plurality
of virtual processors, the address field will also provide the
identification of a virtual processor. To deliver the message to the
addressed serial processor 14(i), the micro-controller enables the data
from the message to be stored in processor format in memory 12. In
enabling storage of the delivered message data, the micro-controller 5
generates successive MC MEM ADRS micro-controller memory address signals
to identify successive locations in memory and MEM CTRL memory control
signals that enable the multiplexer 16 to couple the MC MEM ADRS
micro-controller memory address signals to the memory 12 as SEL MEM ADRS
selected memory address signals.
In an "indirect" message transfer mode, the address field also includes the
identification of the processing node 10 and the serial processor 14(i)
that is the intended recipient of the data, and in addition includes the
location in memory 12 in which the data in the data field is to be stored.
In the indirect message transfer mode, the micro-controller enables the
portion of the address identifying the location in memory 12 to be coupled
to the multiplexer 16 as IND MEM ADRS indirect memory address signals, and
generates MEM CTRL memory control signals that enable the multiplexer 16
to couple them as SEL MEM ADRS selected memory address signals to identify
the location in memory 12 in which the data is to be stored.
The various communications links HC.sub.-- O.sub.-- H(11:0), HC.sub.--
O.sub.-- L(11:0), HC.sub.-- I.sub.-- H (11:0) and HC.sub.-- I.sub.--
L(11:0) connected to each processing node 10 are connected to diverse ones
of other processing nodes in a conventional manner to effect the hypercube
interconnection. Thus, the outgoing communications links identified by
reference numerals HC.sub.-- O.sub.13 H(11:0) and HC.sub.-- O.sub.--
L(11:0) correspond to various incoming communications links, which may be
identified by reference numerals HC.sub.-- I.sub.-- H(11:0) and HC.sub.--
I.sub.-- L(11:0), at router nodes 15 of other processing nodes 10. In one
embodiment, the circuitry of the router nodes 15H and 15L is similar to
that described in the aforementioned Hillis patents and Hillis, et al.
patent application and will not be described further herein.
The router nodes 15, under control of the micro-controller 5, perform
message transfers in one or more message transfer cycles. That is, one
message transfer operation, which may be initiated in response to a single
message transfer command from the host 6, may require multiple message
transfer cycles to complete. In each message transfer cycle, each
processing node 10 may transfer a message over each communications link
connected thereto to another processing node 10. For each message so
transferred, if the destination serial processor 14(i) is not located on
the receiving processing node 10, that processing node 10, during the
current or a subsequent message transfer cycle, transfers the message over
a communications link connected thereto to another processing node 10.
On the other hand, if the destination serial processor 14(i) is located on
the receiving processing node 10, the router node 15 that receives the
message will deliver the data in the message thereto during this or a
subsequent message transfer cycle. That is, the router node 15 couples the
data onto the data line D(i) associated with the destination serial
processor and stores it in a temporary destination buffer 60 in memory 12.
At some later point, the :message will be transferred from the temporary
destination buffer 60 for storage elsewhere in memory 12, which is
identified by the micro-controller 5 if the message transfer operation is
in the direct mode, or by an address contained in the message if the
message transfer operation is in the indirect mode, from which the data in
the message may be processed by the serial processor 14(i). Eventually,
all of the messages transferred during the message transfer operation will
be in memory 12 and available to the respective destination serial
processor(s) 14(i), at which point the message transfer operation will be
finished.
The processing nodes 10 may also have an auxiliary processor 20 that
processes data in memory 12 that may be organized either in slice format
or in processor format, and a transposer module 21 to interface the
auxiliary processor 20 to the data bus 13. The auxiliary processor 20 may
be, for example, a floating point processor, which may perform arithmetic
and logic operations in connection with data in floating point data
format. The auxiliary processors 20 and transposer modules 21 in the
various processing nodes 10 operate in response to AP INSTR auxiliary
processor instruction signals and XPOSER CTRL transposer control signals,
respectively, from the micro-controller 5. As is the case with the other
control signals provided by the micro-controller 5, the micro-controller 5
transmits the AP INSTR auxiliary processor instruction signals and the
XPOSER CTRL transposer control signals to control the auxiliary processor
20 and transposer module 21 of all of the processing nodes 10
concurrently, enabling them to generally perform the same operation
concurrently.
The transposer module 21 includes several transposer circuits 22A through
22M (generally identified by reference numeral 22). Each transposer 22
receives input data from an input multiplexer 24 and stores it in one of a
plurality of slots identified by the contents of a write pointer register
25. The register 25 may be provided with a pointer prior to storing each
item of data in a slot in the associated transposer 22. Alternatively, the
register may be loaded with an initial value before loading any data in
the transposer 22 and then incremented for each successive item of data
loaded therein. The input multiplexer 24, under control of the XPOSER CTRL
transposer control signals, selectively couples data signals to the
transposer 22 from either the data bus 13 or from a bus 26. Bus 26 carries
AP IN (31:0) auxiliary processor in signals representing processed data
from the auxiliary processor 20.
The transposers 22 operate in response to the XPOSER CTRL transposer
control signals to generate transpositions of the data stored therein. The
transposer module 21 also includes two output multiplexers 30 and 31, also
controlled by the XPOSER CTRL transposer control signals. Multiplexer 30
receives data signals from the output terminals of transposers 22 and
selectively couples the signals from one of the transposers onto the data
bus 13. Similarly, the multiplexer 31 receives data signals from the
output terminals of transposer 23 and selectively couples the signals from
one of the transposers as AP OUT (31:0) signals for transmission to the
auxiliary processor. The structure and operation of transposers 22, input
multiplexers 24, pointer registers 25 and output multiplexers 30 and 31
are generally described in the aforementioned Kahle, et al., patent
application.
The processing node 10 also provides a direct (that is, non-transposing)
path between the data bus 13 and the auxiliary processor 20. It will be
appreciated that the transposer module 21 facilitates the transposition of
data transmitted from the memory 12 in processor format, which would be
transmitted serially over separate lines of the data bus 13, into parallel
format for processing by the auxiliary processor 20. If the data is stored
in memory 12 in slice format, transposition is not required. In addition,
the transposer module 21 receives processed data from the auxiliary
processor 20 and, if it is required that it be stored in the memory 12 in
processor format, transposes the data for transmission serially over
predetermined lines of the data bus 13. If the processed data from the
auxiliary processor 20 is to be stored in the memory 12 in slice format,
the data may be transmitted by the auxiliary processor 20 to the memory 12
over the non-transposing path.
The transposer module 21 also includes several components which provide the
IND MEM ADRS indirect memory address signals which are coupled to the
multiplexer 16. This indirect memory addressing capability permits the
processing nodes 10 to independently provide memory addresses to their own
memories 12, so that the addressed locations in the respective memories 12
may differ as among the various processing nodes 10. The transposer module
21 includes an adder 32 which produces the IND MEM ADRS indirect memory
address signals in response to BASE signals provided from a base register
33 and OFFSET signals from multiplexer 31. Thus, the OFFSET signals may
correspond to the outputs of one of the transposers 22 or the signals on
the data bus 13. The base register 33 and maximum offset register 35 are
separately provided with values provided over bus 13 in response to
appropriate XPOSER CTRL transposer control signals from the
micro-controller 5.
The compare circuit 36 determines whether the binary-encoded value of the
signals from multiplexer 31 exceeds the binary-encoded value of the MAX
OFFSET signals from the register 35, to provide a COMP OK compare status
signal to indicate whether the offset provided by the OFFSET signals is
less than the maximum offset identified by the maximum offset register 35.
If the COMP OK compare status signal indicates that the value of the
OFFSET signal exceeds the maximum offset value contained in the maximum
offset register 35, the micro-controller 36 may inhibit storage in the
location identified by the IND MEM ADRS indirect memory address signals.
In accordance with the invention, the system depicted in FIG. 1 can perform
message transfers so that messages destined for a particular serial
processor 14(i) are received and stored in a queue 50, with successive
messages received by each serial processor 14(i) at a processing node 10
being stored in successive slots in a | | |