|
Description  |
|
|
INCORPORATION BY REFERENCE
Guy E. Blelloch, Scan Primitives and Parallel Vector Models, (Ph.D.
Dissertation, Massachusetts Institute of Technology: 1988), incorporated
herein by reference.
U.S. patent appn. Ser. No. 07/489,079, filed Mar. 5, 1990, now U.S. Pat.
No. 5,118,975 in the name of W. Daniel Hillis, et al., entitled Digital
Clock Buffer Circuit Providing Controllable Delay, and assigned to the
assignee of the present application, incorporated herein by reference.
FIELD OF THE INVENTION
The invention relates generally to the field of digital computer systems,
and more particularly to massively parallel computing systems. The
invention particularly provides arrangements for controlling processors in
a computing system having a large number of processors, for facilitating
transfer of data among the processors and for facilitating diagnosis of
faulty components in the computing system.
BACKGROUND OF THE INVENTION
A digital computer system generally comprises three basic elements, namely,
a memory element, an input/output element and a processor element. The
memory element stores information in addressable storage locations. This
information includes data and instructions for processing the data. The
processor element fetches information from the memory element, interprets
the information as either an instruction or data, processes the data in
accordance with the instructions, and returns the processed data to the
memory element. The input/output element, under control of the processor
element, also communicates with the memory element to transfer
information, including instructions and the data to be processed, to the
memory, and to obtain processed data from the memory.
Most modern computing systems are considered "von Neumann" machines, since
they are generally constructed according to a paradigm attributed to John
von Neumann. Von Neumann machines are characterized by having a processing
element, a global memory which stores all information in the system, and a
program counter that identifies the location in the global memory of the
instruction being executed. The processing element executes one
instruction at a time, that is, the instruction identified by the program
counter. When the instruction is executed, the program counter is advanced
to identify the location of the next instruction to be processed. (In many
modern systems, the program counter is actually advanced before the
processor has finished processing the current instruction.)
Von Neumann systems are conceptually uncomplicated to design and program,
since they do only one operation at a time. A number of advancements have
been made to the original von Neumann paradigm to permit the various parts
of the system, most notably the various components of the processor, to
operate relatively independently and achieve a significant increase in
processing speed. One such advancement is pipelining of the various steps
in executing an instruction, including instruction fetch, operation code
decode (a typical instruction includes an operation code which identifies
the operation to be performed, and in most cases one or more operand
specifiers, which identify the location in memory of the operands, or
data, to be used in executing the instruction), operand fetch, execution
(that is, performing the operation set forth in the operation code on the
fetched operands), and storing of processed data, which steps are
performed relatively independently by separate hardware in the processor.
In a pipelined processor, the processor's instruction fetch hardware may
be fetching one instruction while other hardware is decoding the operation
code of another instruction, fetching the operands of still another
instruction, executing yet another instruction, and storing the processed
data of a fifth instruction. Since the five steps are performed
sequentially, pipelining does not speed up processing of an individual
instruction. However, since the processor begins processing of additional
instructions before it has finished processing a current instruction, it
can speed up processing of a series of instructions.
A pipelined processor is obviously much more complicated than a simple
processor in a von Neumann system, as it requires not only the various
circuits to perform each of the operations (in a simple von Neumann
processor, many circuits could be used to perform several operations), but
also control circuits to coordinate the activities of the various
operational circuits. However, the speed-up of the system can be dramatic.
More recently, some processors have been provided with execution hardware
which includes multiple functional units each being optimized to perform a
certain type of mathematical operation. For example, some processors have
separate functional units for performing integer arithmetic and floating
point arithmetic, since they are processed very differently. Some
processors have separate hardware functional units each of which performs
one or only several types of mathematical operations, including addition,
multiplication, and division operations, and other operations such as
branch control and logical operations, all of which can be operating
concurrently. This can be helpful in speeding up certain computations,
most particularly those in which several functional units may be used
concurrently for performing parts of a single computation.
In a von Neumann processor, including those which incorporate pipelining or
multiple functional units (or both, since both may be incorporated into a
single processor), a single instruction stream operates on a single data
stream. That is, each instruction operates on data to enable one
calculation at a time. Such processors have been termed "SISD," for
single-instruction/single-data." If a program requires a segment of a
program to be used to operate on a number of diverse elements of data to
produce a number of calculations, the program causes the processor to loop
through that segment for each calculation. In some cases, in which the
program segment is short or there are only a few data elements, the time
required to perform such a calculation may not be unduly long.
However, for many types of such programs, SISD processors would require a
very long time to perform all of the calculations required. Accordingly,
processors have been developed which incorporate a large number of
processing elements all of which may operate concurrently on the same
instruction stream, but with each processing element processing a separate
data stream. These processors have been termed "SIMD" processors, for
"single-instruction/multiple-data."
SIMD processors are useful in a number of applications, such as image
processing, signal processing, artificial intelligence, database
operations, and computer simulation of a number of things, such as
electronic circuits and fluid dynamics. In image processing, each
processing element may be used to perform processing on a pixel ("picture
element") of the image to enhance the overall image. In signal processing,
the processors concurrently perform a number of the calculations required
to perform such computations as the "Fast Fourier transform" of the data
defining the signal. In artificial intelligence, the processors perform
searches on extensive rule bases representing the stored knowledge of the
particular application. Similarly, in database operations, the processors
perform searches on the data in the database, and may also perform sorting
and other operations. In computer simulation of, for example, electronic
circuits, each processor may represent one part of the circuit, and the
processor's iterative computations indicate the response of the part to
signals from other parts of the circuit. Similarly, in simulating fluid
dynamics, which can be useful in a number of applications such as weather
predication and airplane design, each processor is associated with one
point in space, and the calculations provide information about various
factors such as fluid flow, temperature, pressure and so forth.
Typical SIMD systems include a SIMD array, which includes the array of
processing elements and a router network, a control processor and an
input/output component. The input/output component, under control of the
control processor, enables data to be transferred into the array for
processing and receives processed data from the array for storage,
display, and so forth. The control processor also controls the SIMD array,
iteratively broadcasting instructions to the processing elements for
execution in parallel. The router network enables the processing elements
to communicate the results of a calculation to other processing elements
for use in future calculations.
Several routing networks have been used in SIMD arrays and others have been
proposed. In one routing network, the processing elements are
interconnected in a matrix, or mesh, arrangement. In such an arrangement,
each processing element is connected to, and communicates with, four
"nearest neighbors" to form rows and columns defining the mesh. This
arrangement can be somewhat slow if processing elements need to
communicate among themselves at random. However, the arrangement is
inexpensive and conceptually simple, and may suffice for some types of
processing, most notably image processing. The "Massively Parallel
Processor" manufactured by Goodyear Aerospace Corporation is an example of
a SIMD array having such a routing network.
In another routing network, processing elements are interconnected in a
cube or hypercube arrangement, having a selected number of dimensions, for
transferring data, in the form of messages, among the processing elements.
The arrangement is a "cube" if it only has three dimensions, and a
"hypercube" if it has more than three dimensions. U.S. Pat. No. 4,598,400,
entitled Method and Apparatus For Routing Message Packets, issued Jul. 1,
1986 to W. Daniel Hillis, and assigned to the assignee of the present
application, describes a system having a hypercube routing network. In the
system described in the '400 patent, multiple processing elements are
connected to a single routing node, and the routing nodes are
interconnected in the hypercube.
Another routing arrangement which has been proposed is a crossbar switch,
through which each processing element can communicate directly with any of
the other processing elements. The crossbar switch provides the most
efficient communications of any of the routing networks proposed. However,
a crossbar switch also has the most connections and switching elements,
and thus is the most expensive and also the most susceptible to failure
due to broken connections and faulty switching elements. Thus, crossbar
switch arrangements are rarely used, except when the number of processing
elements is fairly small, since the complexity of a crossbar switch
increases with the square of the number of processing elements.
Yet another routing arrangement is an omega network, in which switching is
performed through a number of serially-connected stages. Each stage has
two inputs, each connected to the outputs of a prior stage or processing
elements, has two outputs which may be connected to the inputs of a
subsequent stage or processing elements. The "Butterfly" computer system
manufactured by Bolt Beranek & Newman uses such a network.
SUMMARY OF THE INVENTION
The invention provides a new and improved parallel computer system.
In brief summary, the new computer includes a plurality of processing
elements, a command processor, a diagnostic processor and a communications
network. The processing elements each performs data processing and data
communications operations in connection with commands. The processing
elements also performing diagnostic operations in response to diagnostic
operation requests and providing diagnostic results in response thereto.
The command processor generates commands for the processing elements, and
also performs diagnostic operations in response to diagnostic operation
requests and providing diagnostic results in response thereto. The
diagnostic processor generates diagnostic requests. The communication
network includes three elements, including a data router, a control
network and a diagnostic network. The data router is connected to the
processing elements for facilitating the transfer of data among them
during a data communications operation. The control network is connected
to the processing elements and the command processor for transferring
commands from the command processor to the processing elements. The
diagnostic network connected to the processing elements, the command
processor and the diagnostic processor for transferring diagnostic
requests from the diagnostic processor to the processing elements and the
command processor and for transferring diagnostic results from the
processing elements and the command processor to the diagnostic processor.
BRIEF DESCRIPTION OF THE DRAWINGS
This invention is pointed out with particularity in the appended claims.
The above and further advantages of this invention may be better
understood by referring to the following description taken in conjunction
with the accompanying drawings, in which:
FIG. 1 is a general block diagram of a massively parallel computer system
constructed in accordance with the invention;
FIGS. 2A and 2B are block diagrams useful in understanding the structure
and operation of the data router of the computer system of FIG. 1;
FIG. 3 is a diagram depicting the structure of message packets transferred
over the data router;
FIGS. 4A and 4B are block diagrams useful in understanding the structure
and operation of the control network of the computer system of FIG. 1;
FIG. 5 is a diagram depicting the structure of message packets transferred
over the control network;
FIGS. 6A through 6C are block diagrams useful in understanding the
structure and operation of the diagnostic network of the computer system
of FIG. 1;
FIG. 7 is a diagram depicting the structure of message packets transferred
over the diagnostic network;
FIG. 8 is a general block diagram of a processing element in the computer
system depicted in FIG. 1;
FIG. 9A-1 comprises a general block diagram of a data router interface
circuit useful in interfacing the processing element depicted in FIG. 8 to
the data router of the computer system depicted in FIG. 1, FIGS. 9A-2A and
9A-2B contain definitions of registers in the data router interface and
FIGS. 9B-1 through 9D-7 comprise logic diagrams of the data router
interface;
FIG. 10A comprises a general block diagram of a control network interface
circuit useful in interfacing the processing element depicted in FIG. 8 to
the control network of the computer system depicted in FIG. 1, FIGS. 10A-1
contains a definitions of a register in the control network interface and
FIGS. 10B through 10G comprise logic diagrams of the control network
interface;
FIG. 11A is a general block diagram of a data router node used in the data
router described in connection with FIGS. 2A and 2B, and FIGS. 11B-1
through 11D comprise detailed block and logic diagrams of the data router
node;
FIG. 12A is a general block diagram of a control network node used in the
control network described in connection with FIGS. 4A and 4B, and FIGS.
12B-1 through 12D-1 comprise detailed block and logic diagrams of the
control router mode; and
FIG. 13A is a general block diagram of a diagnostic network node used in
the diagnostic network described in connection with FIG. 6, and FIGS.
13B-1 through 13C comprise detailed block and logic diagrams of the
diagnostic network node.
DETAILED DESCRIPTION OF AN ILLUSTRATIVE EMBODIMENT
I. General Description
A. General Description of Computer System
FIG. 1 is a general block diagram of a massively parallel computer system
10 constructed in accordance with the invention. With reference to FIG. 1,
system 10 includes a plurality of processing elements 11(0) through 11(N)
(generally identified by reference numeral 11), scalar processors 12(0)
through 12(M) (generally identified by reference numeral 12) and
input/output processors 13(0) through 13(K) (generally identified by
reference numeral 13). Input/output units (not shown), such as, for
example, disk and tape storage units, video display devices, printers and
so forth may be connected to the input/output processors to supply
information, including data and program commands, for processing by the
processing elements 11 and scalar processors 12 in the system, and may
also receive processed data for storage, display and printing. The scalar
processors 12 may also be connected to input/output units including, for
example, video display terminals which permit one or more operators to
generally control system 10.
The system 10 further includes a control network 14, a data router 15 and a
diagnostic network 16. The control network 14 permits one or more scalar
processors 12 to broadcast program commands to the processing elements 11.
The processing elements 11 execute the commands generally concurrently.
The control network 14 also permit the processing elements 11 to transfer
status information to the scalar processors 12. The control network 14 is
also used by the processing elements 11 to perform selected types of
arithmetic operations, termed "scan" and "reduce" operations, as described
below. The control network 14 may also be used to provide synchronization
among the processing elements 11.
The data router 15 transfers data among the processing elements 11, scalar
processors 12 and input/output processors 13. In particular, under control
of the scalar processors 12, the input/output processors 13 retrieve data
to be processed from the input/output units and distributes it to the
respective scalar processors 12 and processing elements 11. During
processing, the scalar processors 12 and processing elements 11 can
transfer data among themselves over the data router 15. In addition, the
processing elements 11 and scalar processors 12 can transfer processed
data to the input/output processors 13. Under control of the scalar
processors 12, the input/output processors 13 can direct the processed
data that they receive from the data router 15 to particular ones of the
input/output units for storage, display, printing, or the like.
The diagnostic network 16, under control of a diagnostic processor (not
shown), facilitates testing of other portions of the system 10 to
identify, locate and diagnose defects. The diagnostic processor may
comprise one or more of the scalar processors 12. In addition, the
diagnostic network 16 may be used to establish selected operating
conditions in the other portions of the system 10 as described below.
The system 10 is synchronous, that is, all of its elements operate in
accordance with a global SYS CLK system clock signal provided by a clock
circuit 17.
One particular embodiment of system 10 may include hundreds or many
thousands of processing elements 11 operating on a single problem in
parallel under control of commands broadcast to them by the scalar
processors 12. In that embodiment, the processing elements 11 operate in
parallel on the same command on their individual sets of data, thereby
forming a parallel computer system. In addition, the system 10 may be
dynamically logically partitioned, as described below, into multiple
subsystems which may concurrently operate on separate problems or separate
parts of a single problem. In that case, each partition includes at least
one scalar processor 12 and a plurality of processing elements 11.
B. General Description of Communications Networks
1. Data Router 15
Before proceeding to a detailed description of the system 10 and its
various components, it would be helpful to generally describe the
structures of the control network 14 and data router 15. The data router
15 and control network 14 both transfer information in the form of message
packets, which will be described in detail below in connection with FIGS.
3 and 5, respectively. FIGS. 2A and 2B depict the general structure of the
data router 15 and FIGS. 4A and 4B depict the general structure of the
control network 14.
With reference to FIG. 2A, the data router 15 is generally tree-structured,
having a plurality of data router node groups 20(i,j) ("i" and "j" are
integers) organized in a plurality of levels each identified by the index
"i" in reference numeral 20(i,j). A data router node group 20(i,j) at each
level "i" is connected to a selected number of data router node groups
20(i-1,j) in the next lower level "i-1" to form a tree. As will be
described in detail below, the data router node groups 20(i,j) perform
message switching operations to transfer data, in the form of data router
message packets, among the processing elements 11, scalar processors 12
and input/output processors 13, which are collectively identified as
leaves 21(0) through 21(N) (generally identified by reference numeral 21).
Each data router node group 20(1,j) in the lowest level is connected to
one or more leaves 21. In the reference numeral 20(i,j), the index (j)
uniquely identifies each of the data router node groups 20(i,j) at each
level "i."
In the data router 15 represented in FIG. 2A, the data router node group
20(M,0) at the highest level "M" is termed the "physical root" of the
tree. At each level "i", each data router node group 20(i,j) is termed the
"parent" of data router node groups 20(i-1,j) connected thereto, and each
data router node group 20(i-1,j) is termed a "child" of the data router
node group 20(i,j) to which it is connected. It will be appreciated that
the data router node group 20(i,j) will also be a child of the data router
node group 20(i+1,j) connected thereto. In one particular embodiment, each
data router node group 20(i,j) in a particular level "i" is connected to
four child data router node groups 20(i-1,j); in that embodiment, the
"fan-out" of the tree, that is, the number of children connected to each
parent, is four. It will be appreciated from the following that the
fan-out need not be constant, but may vary from level to level and also
among data router node groups 20(i,j) within the same level.
The structure of the data router 15 is further termed a "fat-tree", and
will be particularly described in connection with FIG. 2B. With reference
to FIG. 2B, at least some of the data router node groups 20(i,j) includes
at least one, and typically two or more data router nodes 22(i,j,k),
wherein "k" is an integer that uniquely identifies each data router node
within a data router node group 20(i,j). Each data router node 22(i,j,k)
in a data router node group 20(i,j) is connected to a plurality of data
router nodes 22(i+1,j,k) in level "i+1," with the connections being
established so that the data router nodes 22(i,j,k) in each data router
node group 20(i,j) are connected to different ones of the data router
nodes 22(i+1,j,k) in the data router node group 20(i,j) in level "i+1."
For example, in data router node group 20(1,0), data router node 22(1,0,0)
is connected to data router nodes 22(2,0,0) and 22(2,0,1) of data router
node group 20(2,0), and data router node 22(1,0,1) is connected to data
router nodes 22(2,0,2) and 22(2,0,3) of data router node group 20(2,0).
In addition, each data router node 22(i,j,k) in a parent data router node
group 20(i,j) is connected to one data router node 22(i-1,j,k) in that
parent's child data router node groups 20(i-1,j). Accordingly, as shown in
FIG. 2B, data router node (2,0,0) in data router node group 20(2,1) is
connected to one data router node 22(1,j,0), where "j" equals 0, 1, 2 and
3, in each of the data router node groups 20(1,0) through 21(1,3).
It will be appreciated that the collection of data router nodes 22(i,j,k)
from each leaf 21 to and including the data router nodes 22(m,0,k) in the
root data router node group 20(M,0) essentially forms an inverted tree.
Each leaf 21 effectively comprises the root of one inverted tree and the
data router nodes 22(M,0,k) of the root data router node group 20(M,0)
form all of the leaves of all of the inverted trees defined by the
collection of leaves 21. The number of data router nodes 22(i,j,k) in each
data router node group 20(i,j) at a particular level "i" in the tree
defining data router 15 will be determined by the fan-out at each level
from level "1" to level "i" in the inverted tree. The fan-out at a
particular level "i" is the number of data router nodes 22(i+1,j,k) at
level "i+1" to which each data router node 22(i,j,k) at level "i" is
connected. Thus, for example, since data router node 22(1,0,0) of data
router node group 20(1,0) in level "1" is connected to two data router
nodes 22(2,0,0) and 22(2,0,1) of data router node groups 20(2,0) in level
"2," the fan-out from data router node 22(1,0,0) is two. In one particular
embodiment, the fan-out from data router nodes 22(i,j,k) at a particular
level "i" is the same for the entire level, but it may differ from level
to level as described below.
As noted above, the data router 15 transfers message packets among the
processing elements 11, scalar processors 12 and input/output processors
13, all of which are represented by leaves 21. Each connection shown in
FIG. 2B between a leaf 21 and a data router node 22(1,j,k) of level 1,
which is represented by a line therebetween, actually represents two
unidirectional data paths, one for transferring a message packet in each
direction. Thus, for example, the connection between leaf 21(0) and data
router node 22(1,0,0) of data router node group 20(1,0) represents two
data paths. One data path is used by the leaf 21(0) to transmit a message
packet to the data router node 22(1,0,0) for delivery to another leaf
21(x). The other data path is used by the data router node 22(1,0,0) to
deliver message packets originating at other leaves 21 destined for the
leaf 21(0).
Similarly, each connection between a data router node 22(i,j,k) of a level
"i" and a data router node 22(i+1,j,k) of a level "i+1," which is also
represented in FIG. 2B by a line, represents two unidirectional data
paths, one for transferring a message packet in each direction. Thus, for
example, the connection between data router node 22(1,0,0) of data router
node group 20(1,0) and data router node 22(2,0,0) represents two data
paths, one used to transfer message packets from data router node
22(1,0,0) to data router node 22(2,0,0) and the other to transfer message
packets in the opposite direction, that is, from data router node
22(2,0,0) to data router node 22(1,0,0).
Transfer of a message packet from one leaf 21(x) to another leaf 21(y)
through the data router 15 message transfer proceeds in two general
operations. First, the data router nodes 22(i,j,k) transfer the message
packet first "up the tree," that is, to data router nodes in successively
higher levels, until it reaches a selected maximum level determined in
part by the separation between the source and destination leaves. After a
message packet has reached the selected maximum level, the transfer
continues "down the tree", during which the data router nodes 22(i,j,k)
transfer the message packet to data router nodes at successively lower
levels until it is delivered to the destination leaf 21(y). As will be
clear from the detailed description of the structure and operation of a
data router node 22(i,j,k) in FIGS. 11A through 11D below, the data router
15 can transfer a plurality of messages concurrently, any of the data
router nodes 22(i,j,k) can direct messages up the tree and other messages
down the tree at the same time.
Before proceeding further, it may be helpful to describe the structure of a
message packet transferred over the data router 15. With reference to FIG.
3, a data router message packet 30 includes three general portions,
including a message address portion 31, a message data portion 32, and a
checksum portion 33, each comprising one or more "flits." In one
embodiment, each flit comprises four bits, which are transferred in
parallel over a data router connection, that is, between a leaf 21 and a
data router node 22(i,j,k) or between two data router nodes 22(i,j,k).
The message data portion 32 includes several elements, including a length
flit 34, a tag flit 35 and one or more data flits 36(0) through 36(N)
(generally identified by reference numeral 36). The data flits 36
generally contain the actual message data being transferred over the data
router 15, which may vary from packet to packet. The tag flit 35 contains
control information which may be used by the destination leaf, identified
herein by reference numeral 22(y), in processing the data. The contents of
the length flit 34 are identify the number of flits in the message data
portion 32, and may vary depending on the amount of data being transferred
in a particular packet. In one particular embodiment, the contents of
length flit 34 identify the number of thirty-two bit words in the data
flits 36 of the message packet. In that embodiment, the number of data
flits 36 in the message packet is eight times the value in the length flit
34.
The checksum portion 33 contains a value which is used in detecting errors
in packet transmission over the data router 15.
The data router 15 uses the contents of the message address portion 31 to
determine the path to be traversed by the message packet 30 from the
source leaf to the destination leaf. The message address portion 31
includes a header 40, which identifies the selected maximum level to which
the message packet is to be transferred when going up the tree, and a down
path identification portion 41 which identifies the path down the tree to
the destination leaf 21(y) when going down the tree. When directing a
message packet up the tree, a data router node 22(i,j,k) at level "i,"
randomly selects one of the data router nodes 22(i+1,j,k) connected
thereto in level "i+1" in data router node group 20(i+1,j) to receive the
message packet. Other than specifying the selected maximum height for the
message packet, the packet does not otherwise specify the particular path
it is to take up the tree.
The down path identification portion 41 of message packet 30 defines the
path the packet is to take down the tree from the data router node group
20(i,j) at the selected maximum level to the destination leaf 21(y). The
down path identification portion includes one or more down path identifier
fields 42(1) through 42(M) (generally identified by reference numeral 42).
The successive down path identifier fields 42, beginning with field 42(M),
are used by the data router nodes 22(i,j,k) at successively lower levels
as they direct the packet downwardly in the tree.
The down path identifier field 42(i) for level "i" identifies the child
data router node group 20(i-1,j) to which the parent data router node
group 20(i,j) that receives the packet at level "i" is to direct the
message packet 30. It will be appreciated that the down path identifier
fields 42 need not specifically identify one of the data router nodes
22(i-1,j,k) in the data router node group 20(i,j) at each level to which
the message packet is to be directed, since the path down the tree is
effectively a traversal of the inverted tree of which the destination leaf
21(y) is the root.
In one embodiment, in which each parent data router node group 20(i,j) is
connected to four child data router node groups 20(i-1,j) or four leaves
21, each down path identifier field 42 comprises two bits that are binary
encoded to identify one of the four children to which the message is to be
directed. As indicated by FIG. 3, two fields 42 are packed into a single
four-bit flit in the message packet 30. Since one down path identifier
field 42 is used to at each level (i) in the downward traversal, the
number of down path identifier fields 42 required to define the downward
path corresponds to the selected maximum level in the path up the tree,
which, in turn, corresponds to the contents of header 40. During the
downward traversal mode, the data router nodes 22(i,j,k) through which a
message packet 30 passes decrement the contents of the header 40 and,
after both down path identifier fields 42 contained in a flit have been
used, discard the flit. Thus, the length and content of a message packet
30 may change as it is being passed down the tree.
It will be appreciated that the addressing arrangement provided by the
header 40 and down path identification portion 41 can be viewed as
follows. The selected maximum height in header 40 effectively identifies
the data router node group 20(i,j) which is the root of a sub-tree,
preferably the smallest sub-tree, of the data router 15 that contains both
the source leaf 21(x) and the destination leaf 21(y). On the other hand,
the down path identification portion 41 details the exact path from that
root to the destination leaf 21(y).
The provision of increasing numbers of data router nodes 22(i,j,k) in data
router node groups 20(i,j) at higher levels in the data router 15, thereby
resulting in a "fat-tree" design, provides several advantages. In a
massively parallel computer SIMD system, processing elements 11 typically
transfer messages during a message transfer operation, initiated by
commands from the scalar processors 12. During a message transfer
operation, a large number of processing elements 11 may transfer messages
concurrently. If the data router 15 did not have increasing numbers of
data router nodes 22(i,j,k) at higher levels to which the message packets
30 can be directed when going up the tree, the bandwidth of the data
router 15, that is, the rate at which it can transfer message packets 30,
would decrease at higher levels.
Since increasing numbers of data router nodes 22(i,j,k) are provided at
higher levels in the "fat-tree" design, the reduction in bandwidth at
higher levels can be minimized or controlled. As noted above, the fan-out
of data router node groups 20(i,j), that is, the number of data router
nodes 22(i+1,j,k) at level "i+1" connected to each data router node
22(i,j,k) at level "i" can vary from level to level, and can be selected
to maintain a desired minimum bandwidth between the respective levels "i"
and "i+1." Alternatively, the fan-outs from each level to the next higher
level can be selected so that the entire data router 15 has a selected
minimum bandwidth.
Further, as noted above, each data router node 22(i,j,k) randomly selects
the data router node 22(i+1,j,k) in the next higher level to which it
directs a message packet 30 in the path up the tree. Accordingly, the
message packets are randomly distributed through the higher levels of the
tree, which minimizes the likelihood of bottlenecks and maximizes the
bandwidth in the higher levels.
As shown in FIGS. 2A and 2B, each data router node group 20(i,j), and in
particular each data router node 22(i,j,k), in the data router 15 receives
an AFD(i,j) all-fall-down (i,j) signal. The AFD(i,j) all-fall-down (i,j)
signal is provided by the control network 14, as will be described below
in connection with FIGS. 4A and 4B, under control of the scalar processors
12 to initiate a context switch operation. The AFD(i,j) all-fall-down
(i,j) signal, when asserted, enables the data router 15 to enter an
all-fall-down mode, in which it quickly empties itself of message packets.
In response to the AFD(i,j) all-fall-down (i,j) signal, the data router 15
directs all message packets 30 directly down the tree to the leaves 21,
where they are stored until the context in which the message packets were
generated is restored. At that point, the leaves 21 which receive such
messages can transmit them over the data router 15, which will deliver
them to the intended destinations.
In contrast to normal operation described above, in which the contents of
the header 40 are decremented and flits containing down path identifier
fields 42 discarded as the message packet 30 is directed down the tree,
when the AFD(i,j) all-fall-down (i,j) signal is asserted the contents of
the header 40 are not decremented and no changes are made to the flits
containing the down path identifier fields 42. When the context is
restored and the leaves 21 return the message packets to the data router
15, they will be delivered to the proper destination leaves. This can be
seen from the following explanation.
In the following explanation, reference numerals 21(x) and 21(y) will refer
to the original source and destination leaves, respectively, for a message
packet 30 and reference numeral 21(x') will refer to the intermediate
storage leaf which receives and stores the message packet 30 while the
context in which the data router message packet 30 was generated is being
switched out. First, for those message packets that are being transferred
up the tree or that have reached the selected maximum height when the
AFD(i,j) all-fall-down (i,j) signal is asserted, the contents of the
header 40 and down path identification portion 41 are the same as when
they were originally transmitted by the source leaf 21(x). Since the
intermediate storage leaf 21(x') receives the message packet 30 it must be
part of a sub-tree of the data router 15 that includes both the source
leaf 21(x) and the destination leaf 21(y). Further, the sub-tree has the
same root data router node group 20(i,j) that the message packet 30 would
have reached had the AFD(i,j) all-fall-down (i,j) signal not been
asserted. Accordingly, when the intermediate storage leaf 21(x') transmits
the message packet over the data router 15, the packet will go up the tree
and reach the same data router node group 20(i,j) that it would have
reached if the AFD(i,j) all-fall-down (i,j) signal had not been asserted,
and from there will follow the same downward path, defined by the down
path identification portion 41, that it would have taken.
On the other hand, if a message packet is being transferred down the tree
when the AFD(i,j) all-fall-down (i,j) signal is asserted, prior to the
signal's assertion the contents of the header field 40 are decremented as
the message packet is passed from level to level. Accordingly, it will be
appreciated that, when the message packet 30 is transmitted by the
intermediate storage leaf 21(x'), in its path up the tree it will go only
to a data router node group 20(i,j) at the level indicated in the header
field 40, which, in turn, corresponds to the data router node group
20(i,j) which controlled the direction of transfer of the message packet
30 when the AFD(i,j) all-fall-down (i,j) signal signal was asserted. It
will be appreciated that the data router node group 20(i,j) that the
message packet 30 reaches may not be the root of a sub-tree that includes
the source leaf 21(x). However, it will be the root of a sub-tree that
includes both the intermediate storage leaf 21(x'), since the message
packet 30 was transferred from that data router node group 20(i,j) to the
intermediate storage leaf 21(x'), and the destination leaf 21(y), since
the message packet 30 could have been transferred from that data router
node group 20(i,j) to the destination leaf had the AFD all-fall-down (i,j)
signal not been asserted.
As will be described in further detail below, each leaf 21 maintains a
message counter that it increments when it tranmsits a message packet over
the data router 15, and that it decrements when it receives a message
packet from the data router 15. As noted above, the control network 14
performs selected arithmetic operations, whose results can be provided to
| | |