|
Description  |
|
|
FIELD OF THE INVENTION
The present invention relates to a parallel processing computer
architecture.
BACKGROUND OF THE INVENTION
Computer processing speed and efficiency in both scalar and vector machines
can be achieved through the use of multiprocessing techniques. By
increasing the number of processors and operating them in parallel, more
work can be done in a shorter period of time.
Initial attempts to increase system speed and efficiency involved the use
of a limited number of processors running in parallel. For instance, an
example of a two-processor multiprocessing vector machine is disclosed in
U.S. Pat. No. 4,636,942, issued Jan. 13, 1987 to Chen et al. Another
aspect of the two-processor machine of the Chen '942 patent is disclosed
in U.S. Pat. No. 4,661,900, issued Apr. 28, 1987 to Chen et al. A
four-processor multiprocessing vector machine is disclosed in U.S. Pat.
No. 4,745,545, issued May 17, 1988 to Schiffleger, and in U.S. Pat. No.
4,754,398, issued Jun. 28, 1988 to Pribnow. All of the above named patents
are assigned to Cray Research, Inc., the assignee of the present
invention.
As the number of processors in a computing system increase, direct
connection and close cooperation between all of the processors becomes
impossible. As a result the programming paradigm shifts from
multiprocessing to concurrent computing. In a concurrent computer a large
number of processors work independently on a pieces of a concurrent
program. The processors must still communicate in order to coordinate and
share data but they can operate independently on that data. In concurrent
computers, communication efficiency becomes critical. Communication
latency must be low but at the same time packaging density must be
optimized to limit the amount of processor-to-processor interconnect; in
addition, it is preferable in some applications to ensure deterministic
communication latency.
In response to the need to balance interconnect density against
communication latency, a variety of network topologies have been
developed. Most such network topologies limit the connections between
processors to a relatively small number of neighbors. A large class of
such topologies can be characterized as either k-ary n-cubes or as
networks such as rings, meshes, tori, binary n-cubes and Omega networks
which are isomorphic to k-ary n-cubes. Processors in this class of
topologies communicate via a message passing protocol in which information
intended for a distant processor is packetized and routed through
intermediate processors to the destination processor.
Communication latency in a network such as a k-ary n-cube depends heavily
on the choice of routing algorithm. Routing algorithms fall into two
categories: store-and-forward routing and wormhole routing. In
store-and-forward routing, a message sent from one processor to another is
captured and stored in each intermediate processor before being sent on to
the next processor. This means that each processor must have a fairly
large buffering capacity in order to store the number of messages which
may be in transit through the processor. Also, since a message must be
received in its entirety before it can be forwarded, store-and-forward
approaches to routing result in communication latencies which increase
dramatically as a function of the number of nodes in a system. On the
other hand, such an approach is amenable to the use of deadlock free
algorithms which avoid deadlock by preventing or reducing the occurrences
of blocking in message transfers.
In wormhole routing a message is divided into a number of smaller message
packets call flits. A header flit is received by a processor and examined
as to its destination. The header flit is then sent on to the next
processor indicated by the routing algorithm. Intermediate flits are
forwarded to the same processor soon after they are received. This tends
to move a message quickly through the system. Since, however, each
intermediate flit is devoid of routing information, a channel to the next
processor is considered dedicated to the message until the complete
message is transferred. This results in blocking of other messages which
might need to use that particular channel. As more messages block, the
system can become deadlocked.
A number of approaches have been offered for resolving the problem of
deadlock in wormhole routing. In virtual cut-through routing, messages
which are blocked are removed from the network and stored in buffers on
one of the intermediate processors. Therefore, blocking in virtual
cut-through networks can be avoided through the use of many of the
deadlock avoidance algorithms available for store-and-forward routing.
Virtual cut-through routing avoids deadlock but at the cost of the
additional hardware necessary to buffer blocked messages.
Two alternate approaches for avoiding deadlock in wormhole routing
communications networks are described in "Adaptive, low latency,
deadlock-free packet routing for networks of processors," published by J.
Yantchev and C. R. Jesshope in IEEE Proceedings, Vol. 136, Pt. E, No. 3,
May 1989. Yantchev et al. describe a method of avoiding deadlock in
wormhole routing in which the header flit, when blocked, coils back to the
source node. The source node then waits for a non-deterministic delay
before trying to send the message again. Yantchev et al. indicate that
such an approach is likely to prove very expensive in terms of
communications costs and that these costs will likely increase out of
proportion as network diameter increases.
Yantchev et al. also propose an improved wormhole routing algorithm which
operates to remove cycles in a network channel dependency graph by
constraining routing within the network to message transfers within a
series of virtual networks lain over the existing communications network.
Under the Yantchev method, the physical interconnection grid is
partitioned into classes according to the directions needed for message
packet routing. In a two-dimensional array of processors, these classes
would correspond to (+X, +Y), (-X, +Y), (+X, -Y) and (-X, -Y). Each class
defines a particular virtual network; the combination of two of the
virtual networks (such as (+X, +Y) and (-X, -Y)), along with a suitable
deadlock free multiplexing scheme, results in a fully connected network
which is deadlock-free. Yantchev et al. teach that the two-dimensional
scheme can be extended to an n-dimensional network in which one virtual
network is used for increasing coordinates while a second is used for
decreasing coordinates. The method of virtual networks can also be
extended to include adaptive routing.
The method taught by Yantchev et al. can be used to good effect in avoiding
deadlock in mesh networks. The Yantchev approach is not, however, as
practical for networks having wrap-around channels, such as tori.
Wrap-around channels increase the number of cycles in a network. To
eliminate these cycles Yantchev et al. teach that a toroidal network can
be decomposed into a fully unwrapped torus equivalent consisting of two or
more subarrays. Message passing is then limited to transfers within a
subarray.
Such an approach, while breaking the cycles, does so at a relatively high
cost. Under Yantchev, a large number of virtual channels must be allocated
for each node (eight for an unwrapped two-dimensional toroid) in order to
break all possible cycles. As the number of dimensions increase, the
number of virtual channels needed for deadlock free routing also
increases.
Dimension order, or e-cube routing is yet another wormhole approach to
deadlock-free routing. In dimension order routing, an ordering of
dimensions is selected and all traffic completes its routing in that
order. That is, all routing is completed in one dimension before any
routing is allowed in another dimension. This rigid routing scheme
provides deadlock free transfers by restricting the types of turns
possible in a message transfer (i.e. eliminating cycles in the acyclic
mesh). Dimension order routing is described in "Deadlock-free Message
Routing in Multiprocessor Interconnection Networks" published by William
J. Dally and Charles L. Seitz in IEEE Transactions on Computers, Vol.
C-36, No. 5, May 1987.
Dimension order routing provides a deterministic routing protocol but,
since it only provides a single path between a source and a destination
node, in mesh networks this method is not fault tolerant. In toroidal
networks, the situation is not much better. A toroid has 2.sup.n possible
paths but all paths turn on the same n-1 nodes. Because of this, a failure
in any node can cut off communication between one or more node pairs.
Each of the communications networks described above suffers limitations in
its applicability to network topologies having hundreds or thousands of
nodes. There is a need in the art for a communications network which
resolves the above-mentioned problems in an efficient and hardware limited
fashion while achieving low communications latency.
SUMMARY OF THE INVENTION
A multidimensional interconnection and routing apparatus for a parallel
processing computer connects together processing elements in a
three-dimensional structure. The interconnection and routing apparatus
includes a plurality of processing element nodes. A communication connects
at least one of the processing elements with a host system. An
interconnection network connects together the processing element nodes in
an X, Y, and Z dimension. The network includes communication paths
connecting each of the plurality of processing elements to adjacent
processing elements in the plus and minus directions of each of the X, Y,
and Z dimensions.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a diagram of an MPP system components.
FIG. 2 is a diagram of processing element components for an MPP system.
FIG. 3 is a diagram of a processing element node for an MPP system.
FIG. 4 is a diagram of interconnected network components for an MPP system.
FIG. 5 is a diagram of communication links for an MPP system.
FIG. 6 is a diagram of communication link signals for an MPP system.
FIG. 7 is a diagram of a one dimensional torus network.
FIG. 8 is a diagram of a two dimensional torus network.
FIG. 9 is a diagram of a three dimensional torus network.
FIG. 10 is a diagram of interleaving processing nodes within an MPP system.
FIG. 11 is a diagram of +X, +Y, and +Z dimension information travel within
an MPP system.
FIG. 12 is a diagram of -X, -Y, and -Z dimension information travel within
an MPP system.
FIG. 13 is a diagram of information travel within an MPP system for
avoiding a bad communication link in the Y dimension.
FIG. 14 is a diagram of a dateline communication link within an MPP system.
FIG. 15 is a diagram of generic packet formats for information within an
MPP system.
FIG. 16 is a diagram of a processing element network router for an MPP
system.
FIG. 17 is a diagram of an X dimension switch for an MPP system.
FIG. 18 is a diagram of an input node network router for an MPP system.
FIG. 19 is a diagram of an I/O gateway for an MPP system.
FIG. 20 is a diagram of data paths through each dimension switch logic.
FIG. 21 is a diagram of logic for a random number generator.
FIG. 22 is a diagram of buffers in an MPP system.
FIG. 23 is a diagram of dateline deadlock avoidance.
FIG. 24 is a diagram of naive and optimized virtual channel allocations.
FIG. 25 is a diagram showing an example of standard and origin allocation.
FIG. 26 is a diagram showing an example of linear-lengthwise and partition
allocation.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT
In the following detailed description of the preferred embodiment,
reference is made to the accompanying drawings which form a part hereof
and in which is shown by way of illustration a specific embodiment in
which the invention may be practiced. This embodiment is described in
sufficient detail to enable those skilled in the art to practice the
invention, and it is to be understood that other embodiments may be
utilized and that structural or logical changes may be made without
departing from the scope of the present invention. The following detailed
description is, therefore, not to be taken in a limiting sense, and the
scope of the present invention is defined by the appended claims.
INTRODUCTION
The following describes the architecture and functions a first-phase
massively parallel processing (MPP) system. The MPP system typically
contains hundreds or thousands of microprocessors, each accompanied by a
local memory. The system is designed to support two styles of MPP
programming: data parallel and message passing.
Data parallel programs, such as High Performance Fortran (HPF), are
designed to provide a programmer with ease of use while still providing a
good percentage of MPP performance. Message passing programs, such as
parallel virtual machine (PVM) messaging, provide a higher percentage of
peak MPP performance.
Cray Research, Inc., the assignee of the present application, supports both
styles of programming so that customers may take a program from another
vendor's MPP system and port it to a Cray Research, Inc. MPP system with a
minimum amount of effort. For more information on Cray Research's MPP
Programming Models, refer to the CRAY T3D Software Overview Technical Note
publication number SN-2505, which is incorporated herein by reference.
The MPP system connects to a host computer system. The host system runs the
software compilers for the MPP system. All programs written for the MPP
system are compiled on the host system, but run in the MPP system. The
host system may be, for example, any Cray Research, Inc. computer system
that has an input/output subsystem model E (IOS-E). Examples of host
systems include the CRAY Y-MP E series computer systems, the CRAY Y-MP M90
series computer systems, and the CRAY Y-MP C90 series computer systems.
The host system may reside in the same cabinet as the MPP system. This
configuration is called a single-cabinet configuration. The host system
may also reside in a separate cabinet that is cabled to the MPP system
cabinet. This configuration is called a multiple-cabinet configuration.
The MPP system preferably contains four types of components: processing
element nodes, the interconnect network, I/O gateways, and a clock. FIG. 1
shows a simplified model of the components of the MPP system. The
following sections describe preferred components of an MPP system.
PROCESSING ELEMENT NODES
An MPP computer system typically contains hundreds or thousands of
microprocessors, each accompanied by a local memory. Each microprocessor
and local memory component is called a processing element. In the MPP
system, each processing element contains a microprocessor 10, local memory
12, and support circuitry 14 (refer to FIG. 2).
The microprocessor is preferably an Alpha reduced instruction set computer
(RISC) 64-bit microprocessor developed by Digital Equipment Corporation.
The microprocessor performs arithmetic and logical operations on 64-bit
integer and 64-bit floating-point registers. The microprocessor also
preferably contains an internal instruction cache memory and data cache
memory that each store 256 lines of data or instructions. Each line in the
instruction and data cache memory is four 64-bit words wide.
Local memory preferably comprises a dynamic random access memory (DRAM)
that stores system data. A low-latency, high-bandwidth data path connects
the microprocessor to local memory in a processing element.
The MPP system memory is physically distributed because each processing
element contains local memory; however, the system memory is also
logically shared. System memory is logically shared because the
microprocessor in one processing element can access the memory of another
processing element without involving the microprocessor in that processing
element.
The support circuitry extends the control and addressing functions of the
microprocessor. This includes performing data transfers to or from local
memory.
The MPP system may contain 128, 256, 512, 1,024, or 2,048 processing
elements depending on the system configuration (excluding the processing
elements in the I/O gateways). The processing elements reside in
processing element nodes.
Each processing element node typically contains two processing elements 16
and 18, a network interface 20, and a block transfer engine 22 (refer to
FIG. 3). The following paragraphs briefly describe each of these
components. Processing elements (PEs) 16 and 18 in a processing element
node are preferably identical but function independently. Access to block
transfer engine 20 and network interface 22 is shared by the two PEs.
Network interface 20 formats information before it is sent over the
interconnect network to another processing element node or I/O gateway.
Network interface 20 also receives incoming information from another
processing element node or I/O gateway and steers the information to PE 0
or PE 1 in the processing element node.
Block transfer engine (BLT) 22 is an asynchronous direct memory access
controller that redistributes system data. BLT 22 redistributes system
data between the local memory in PE 0 or PE 1 and the memory in remote
PEs. BLT 22 can redistribute up to 65,536 64-bit words of data (or 65,536
4-word lines of data) without interrupting the PE.
INTERCONNECT NETWORK
The interconnect network provides communication paths among the processing
element nodes and the I/O gateways in the MPP system. The interconnect
network forms a three dimensional matrix of paths which connect the nodes
in the X, Y, and Z dimensions (see FIG. 1).
The interconnect network is comprised of communication links 26a-26f and
network routers 24. FIG. 4 shows how the components of the interconnect
network connect to a processing element node.
The following describes the components of the interconnect network and
describes characteristics of the interconnect network.
Communication Links
Communication links transfer data and control information between the
network routers in the interconnect network. Each communication link
connects two nodes in one dimension (see FIG. 5); for example,
communication link 28 connects nodes 30 and 32 in the X dimension.
A communication link typically comprises two unidirectional channels. Each
channel in the link preferably contains Data, Physical Unit (Phit) Type,
Virtual Channel Select, and Virtual Channel Acknowledge signals. FIG. 6
shows the signals for both unidirectional channels in one communication
link.
Data Signals
Each channel typically contains 16 Data signals. Data signals preferably
carry two types of information: requests or responses. Requests contain
information that request a node to perform an activity. For example, a
source node may send a request to a destination node to read data from
memory in the destination node. This request is sent over one channel in
the communication link.
Responses contain information that is the result of an activity. For
example, after receiving a request for read data, a destination node sends
the response back to the source node. The response contains the read data.
Requests and responses preferably must be logically separated. This is
preferably done by providing separate buffers for requests and responses.
These buffers are used to create virtual channels.
Phit Type Bits
A phit is the amount of information that can be placed on a data channel in
one clock period. In the MPP system described in the present
specification, a phit is 16 bits in size.
Each channel preferably contains two phit type bits that are controlled by
the node sending information over the channel. These bits indicate what
type of phit is on the Data signals. Table 1 lists the definitions of the
least significant bit (LSB) and most significant bit (MSB) of the phit
type bits. (More information on packets is provided at the end of this
subsection).
TABLE 1
______________________________________
Phit Type Bit Definitions
MSB LSB Data Signals Contain
______________________________________
0 0 No information
0 1 Packet routing tag phit
1 0 Packet phits
1 1 Last phit of packet
______________________________________
Virtual Channel Signals
The virtual channel signals are used to control which virtual channel the
data will use. A virtual channel is created when request and response
information transfers over the same physical communication link but is
stored in separate buffers. The virtual channel signals include the
virtual channel select bits and the virtual channel acknowledge bits.
There are two virtual channel select bits. These bits indicate which
virtual channel buffer in the receiving node the information will be
stored in. Table 2 shows the definitions of the virtual channel select
bits.
TABLE 2
______________________________________
Virtual Channel Select Bit Definitions
MSB LSB Definition Name
______________________________________
0 0 Request buffer 0
Virtual channel 0
0 1 Request buffer 1
Virtual channel 1
1 0 Response buffer 0
Virtual channel 2
1 1 Response buffer 1
Virtual channel 3
______________________________________
The most significant bit of the virtual channel select bits indicates if
the information on the Data signals is a request or a response. When set
to 0, this bit indicates the information is a request. When set to 1, this
bit indicates the information is a response.
The least significant bit of the virtual channel select bits indicates
which of the two request or two response buffers the information on the
Data signals will be stored in. When set to 0, this bit indicates the
information will be stored in buffer 0. When set to 1, this bit indicates
the information will be stored in buffer 1.
There are four virtual channel acknowledge bits. Each virtual channel
buffer controls one of the virtual channel acknowledge bits. For example,
virtual channel buffer 2 controls bit 2.sup.2 of the virtual channel
acknowledge bit. The node receiving information sets the appropriate
virtual channel acknowledge bit to 1 while the node empties the virtual
channel buffer and sends the information to another node or a PE. The node
resets the virtual channel acknowledge bit to 0 after the virtual channel
is empty and the data has been sent to another node or a PE.
Torus Interconnect Topology
The interconnect network is connected in a bidirectional torus. A torus
contains communication links that connect the smallest numbered node in a
dimension directly to the largest numbered node in the same dimension.
This type of connection forms a ring where information can transfer from
one node, through all of the nodes in the same dimension, and back to the
original node.
FIG. 7 shows a one dimensional torus network in the X dimension.
Information can transfer from node 00, through all of the nodes, and back
to node 00 in a circular fashion. Each node has a communication link in
both the plus and minus direction of the X dimension.
Torus networks offer several advantages for network communication. One
advantage is speed of information transfers. For example, in FIG. 7, node
07 can communicate directly with node 00 instead of sending information
through all of the nodes in the X dimension. Another advantage of the
torus network is the ability to avoid bad communication links. For
example, in FIG. 7, if node 00 cannot transfer information directly to
node 01 due to a bad communication link, node 00 can still communicate
with node 01 by sending the information the long way around the network
through the other nodes in the X dimension.
FIG. 8 shows a two dimensional torus network in the Y and X dimensions.
Each node has communication links in both the plus and minus directions of
the Y and X dimensions. FIG. 9 shows a three dimensional torus network in
the Z, Y, and X dimensions. Each node has communication links in both the
plus and minus directions of the Z, Y, and X dimensions.
Several of the diagrams in this specification show three dimensional
network connections. For clarity, the communication link that completes
the torus in each dimension is not shown. It is important to remember
that, although not shown in the diagrams, this communication link is
present.
Interleaving
The nodes in the interconnect network are preferably interleaved.
Interleaving is the physical placement of nodes so that the maximum wiring
distance between nodes is minimized.
FIG. 10 shows two one-dimensional torus networks. The eight nodes in upper
network 34 are not interleaved. The eight nodes in lower network 36 are
interleaved. In the interleaved network (also called a folded torus
network), the physical length of the longest communication link is shorter
than the physical length of the longest communication link in the
non-interleaved network. The X and Z dimensions of the network are
interleaved. This minimizes the length of the physical communication links
(wires) in the MPP system.
Several of the diagrams in this specification contain drawings of three
dimensional interconnect networks. For clarity, the communication links
are shown logically and do not show the interleaving. It is important to
remember that although not shown, the nodes in the network are physically
interleaved in the preferred embodiment.
Dimension Order Routing
When a node sends information to another node, the information may travel
through several communication links in the network. Each transfer of
information over a communication link is referred to as a hop. After
information leaves a node, it typically travels through the network in the
X dimension first, then through the Y dimension, and finally through the Z
dimension. When finished moving through the communication links in the Z
dimension, the information arrives at the destination node. This method of
information travel is called dimension order routing.
For example, if node A shown in FIG. 11 sends request information to node
B, the information first travels one hop in the +X direction. Since the
information does not need to travel any farther in the X dimension, it
switches direction to the Y dimension. After completing one hop in the +Y
direction, the information switches direction to the Z dimension and
completes one hop in the +Z direction. After completing one hop in the +Z
direction, the request information arrives at node B.
Information does not always travel in the positive direction of a
dimension. For example, of node B in FIG. 12 sends response information to
node A, the information completes on hop in the -X direction and then
changes direction into the Y dimension. The information completes one hop
in the -Y direction before changing direction into the Z dimension. After
completing one hop in the -Z direction, the response information arrives
at node A.
Because information can travel in either the positive or negative direction
of a dimension, bad communication links can be avoided. For example, if
node A in FIG. 13 sends information to node B, the information completes
one hop in the +X direction and then switches direction into the Y
dimension. Consider, for example, that due to a bad communication link,
the information cannot complete a hop in the +Y direction. Instead, the
information may be routed so it completes two hops in the -Y direction and
travels the long way around the torus in the Y dimension. After switching
directions into the Z dimension, the information completes one hop in the
+Z direction and arrives at node B.
An example of a system for information routing is described in patent
application Ser. No. 07/983,979 filed Nov. 30, 1992 and entitled
"DIRECTION ORDER ROUTING IN MULTIPROCESSING SYSTEMS," which is
incorporated herein by reference.
Virtual Channels
A virtual channel is created when request and response information travels
over the same physical communication link, but is stored in different
buffers. The MPP system contains four virtual channel buffers (see Table
3).
TABLE 3
______________________________________
Virtual Channel Buffers
Buffer Name Definition
______________________________________
Virtual channel 0 Request buffer 0
Virtual channel 1 Request buffer 1
Virtual channel 2 Response buffer 0
Virtual channel 3 Response buffer 1
______________________________________
The virtual channel buffers prevent two types of communication deadlock
conditions that may occur in the interconnect network. The following
describes these conditions.
Without the virtual channel buffers, a communication deadlock condition may
occur if two nodes simultaneously transfer request or response information
to each other. To prevent this condition from occurring, the MPP system
contains two types of buffers: request buffers and response buffers. These
buffers provide separate destination buffers for request and response
information.
Also without the virtual channel (VC) buffers, a communication deadlock
condition may occur if all of the nodes in one dimension send request or
response information to the next node in the dimension at the same time.
For example, a deadlock condition may occur if all of the nodes in the X
dimension send request information to the next node in the +X direction at
the same time. To prevent this condition from occurring, the MPP system
preferably contains two request buffers and t | | |