WikiPatents - Community Patent Review
Create Free Account  |  License or Sell Your Patent  |  WikiPatents Marketplace  |  WikiPatents Blog
Username:  Password:  
    
Advanced Search
System for allocating messages between virtual channels to avoid deadlock and to optimize the amount of message traffic on each type of virtual channel    
United States Patent5583990   
Link to this pagehttp://www.wikipatents.com/5583990.html
Inventor(s)Birrittella; Mark S. (Chippewa Falls, WI); Kessler; Richard E. (Eau Claire, WI); Oberlin; Steven M. (Chippewa Falls, WI); Passint; Randal S. (Chippewa Falls, WI); Thorson; Greg (Altoona, WI)
AbstractA multidimensional interconnection and routing apparatus for a parallel processing computer connects together processing elements in a three-dimensional structure. The interconnection and routing apparatus includes a plurality of processing element nodes. A communication connects at least one of the processing elements with a host system. An interconnection network connects together the processing element nodes in an X, Y, and Z dimension. The network includes communication paths connecting each of the plurality of processing elements to adjacent processing elements in the plus and minus directions of each of the X, Y, and Z dimensions.
   














 Title Information Submit all comments and votes
 
Patent Text Patent PDF Print Page Summary File History
Plain text PDF images Print Summary File History
Drawing from US Patent 5583990
System for allocating messages between virtual channels to avoid

     deadlock and to optimize the amount of message traffic on each type of

     virtual channel - US Patent 5583990 Drawing
System for allocating messages between virtual channels to avoid deadlock and to optimize the amount of message traffic on each type of virtual channel
Inventor     Birrittella; Mark S. (Chippewa Falls, WI); Kessler; Richard E. (Eau Claire, WI); Oberlin; Steven M. (Chippewa Falls, WI); Passint; Randal S. (Chippewa Falls, WI); Thorson; Greg (Altoona, WI)
Owner/Assignee     Cray Research, Inc. (Eagan, MN)
Patent assignment
All assignments
Publication Date     December 10, 1996
Application Number     08/165,266
PAIR File History     Application Data   Transaction History
Image File Wrapper   Patent Term   Fees
Litigation
Filing Date     December 10, 1993
US Classification     712/29 710/52 710/56 711/170
Int'l Classification     G06F 013/00
Examiner     Lee; Thomas C.
Assistant Examiner     Krick; Rehana Perveen
Attorney/Law Firm     Schwegman; Lundberg , Woessner & Kluth, P.A.
Address
Parent Case    
Priority Data    
USPTO Field of Search     395/800 395/800.01 395/200.01 395/872 395/876 395/497.01
Patent Tags     allocating messages between virtual channels avoid deadlock optimize amount message traffic each type of virtual channel
   
Enter a comma (,) or semicolon (;) between multiple tag words/phrases.
Describe this patent:
 Amusing   
 Clever   
 Complex   
 Efficient   
 Historic   
 Important   
 Innovative   
 Interesting   
 Practical   
 Simple   
[no votes]
Patent WIKI

Share information and news about this patent, including information and news about the technology, inventors, company, ligation and licensing.

 References Submit all comments and votes
 
*references marked with an asterisk below are user-added references
 U.S. References
 
Add a new US reference:  
ReferenceRelevancyCommentsReferenceRelevancyComments
5390164
Kremer
370/223
Feb,1995

[0 after 0 votes]
5341504
Mori
712/12
Aug,1994

[0 after 0 votes]
5313645
Rolfe
712/11
May,1994

[0 after 0 votes]
5218676
Ben-Ayed

Jun,1993

[0 after 0 votes]
5170482
Shu
712/12
Dec,1992

[0 after 0 votes]
5157692
Horie
375/260
Oct,1992

[0 after 0 votes]
5105424
Flaig
709/243
Apr,1992

[0 after 0 votes]
5008882
Peterson

Apr,1991

[0 after 0 votes]
4933933
Dally
370/406
Jun,1990

[0 after 0 votes]
5383191
Hobgood
709/251
Dec,1969

[0 after 0 votes]
 Foreign References
 Other References
 Market Review Submit all comments and votes
   
Market Size
Estimate the gross annual revenues of the relevant market sector:
> $10B
$5B - $10B
$2B - $5B
$500M - $2B
$100M - $500M
$10M - $100M
$1M - $10M
$500K - $1M
$100K - $500K
< $100K
[No votes]
$0
 
$0   $2.5B   $5B   $7.5B   $10B
Market Share
Estimate the percentage of the relevant market sector this invention will capture:
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Reasonable Royalty
What percentage of gross sales should the inventor or assignee be paid?
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Public's "Guesstimation" of Royalty Value
Market SizeN/A[No votes]
xMarket ShareN/A[No votes]
xReasonable RoyaltyN/A[No votes]

N/A

License Availablity
If you are NOT the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
License Availablity
If you ARE the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
Competitive Advantage
Does this invention have a significant competitive advantage over similar technologies?
Yes

No



[No votes]
Most helpful competitive advantage comment
[No comments]

Commercial Alternatives
Are there viable commercial alternatives for this invention?
Yes

No



[No votes]
Most helpful commercial alternative comment
[No comments]

 Technical Review Submit all comments and votes
 Claims Submit all comments and votes
 


What is claimed is:

1. A method of avoiding deadlock in a multiprocessor computer system having a plurality of processing element nodes being interconnected by an interconnection network in an n-dimensional topology, the network including physical communication paths connecting each of the plurality of processing element nodes to adjacent processing element nodes, the method comprising the steps of:

defining two types of virtual channels for each of the n-dimensions, each type of said virtual channel having virtual channel buffers assigned to each physical communication path, the virtual channel buffers being capable of storing messages communicated between the processing element nodes over the physical communication paths;

defining a dateline in each one of said virtual channels, said dateline representing a communication link between two virtual channel buffers, which complete a cyclic path in their associated virtual channel, wherein if messages cross the dateline a cyclic buffer dependency can occur which creates a deadlock;

allocating the messages to the virtual channel buffers in any one of the two types of virtual channels when the messages are being transferred among the processing element nodes in any one of the n dimensions without crossing a dateline; and

allocating the messages such that messages cannot cross a dateline in its associated type of virtual channel, but instead must use the other type of virtual channel to cross the dateline to thereby avoid deadlock.

2. The method of claim 1 further comprising the step of placing the defined datelines to decrease imbalances in the utilization of the two types of virtual channels for each processing element node and to avoid deadlock.

3. The method of claim 1 wherein the messages communicated between the processing element nodes over the physical communication paths include request and response information, and wherein the step of defining two types of virtual channels includes assigning two types of virtual channel buffers capable of storing request information and assigning two types of virtual channel buffers capable of storing response information.

4. The method of claim 1 further comprising the steps of partitioning the processing element nodes into at least two portions for each of the n dimensions and placing the defined datelines at boundaries of the at least two portions.

5. The method of claim 4 wherein the allocating steps are performed so that messages between the two types of virtual channels within the at least two portions are allocated prior to allocating messages between the two types of virtual channels crossing portion boundaries.

6. The method of claim 5 wherein allocating messages between the two types of virtual channels within the at least two portions includes alternating the allocation between the two types of virtual channels depending on the lengths of the associated physical communication paths.

7. The method of claim 5 wherein allocating messages between the two types of virtual channels crossing portion boundaries includes allocating all virtual channel buffers assigned to physical communication paths which pass one or more processing element node after crossing portion boundaries on a first side to the first type of virtual channel and allocating all other virtual channel buffers assigned to physical communication paths crossing portion boundaries on the first side to the second type of virtual channel, and allocating all virtual channel buffers assigned to physical communication paths which pass one or more processing element node after crossing portion boundaries on a second side to the second type of virtual channel and allocating all other virtual channel buffers assigned to physical communication paths crossing portion boundaries on the second side to the first type of virtual channel.

8. A multiprocessor computer system comprising:

a plurality of processing element nodes;

an interconnection network interconnecting the plurality of processing element nodes in an n-dimensional topology, the network including physical communication paths connecting each of the plurality of processing element nodes to adjacent processing element nodes, and two types of virtual channels for each of the n dimensions, each type of said virtual channel having virtual channel buffers assigned to each physical communication path, the virtual channel buffers being capable of storing messages communicated between the processing element nodes over the physical communication paths; and

a look-up table storing information indicative of a defined dateline in each one of said virtual channels, said dateline representing a communication link between two virtual channel buffers, which complete a cyclic path in their associated virtual channel, wherein if messages cross the dateline a cyclic buffer dependency can occur which creates a deadlock, the look-up table further storing information indicative of an allocation of messages to the virtual channel buffers in any one of the two types of virtual channels when the messages are to be transferred among the processing element nodes in any one of the n dimensions without crossing a dateline, and an allocation of messages such that messages cannot cross a dateline in its associated type of virtual channel, but instead must use the other type of virtual channel to cross the dateline to thereby avoid deadlock.

9. The method of claim 1 wherein the allocating steps are performed to decrease imbalances in the utilization of the two types of virtual channels for each processing element node.
 Description Submit all comments and votes
 


FIELD OF THE INVENTION

The present invention relates to a parallel processing computer architecture.

BACKGROUND OF THE INVENTION

Computer processing speed and efficiency in both scalar and vector machines can be achieved through the use of multiprocessing techniques. By increasing the number of processors and operating them in parallel, more work can be done in a shorter period of time.

Initial attempts to increase system speed and efficiency involved the use of a limited number of processors running in parallel. For instance, an example of a two-processor multiprocessing vector machine is disclosed in U.S. Pat. No. 4,636,942, issued Jan. 13, 1987 to Chen et al. Another aspect of the two-processor machine of the Chen '942 patent is disclosed in U.S. Pat. No. 4,661,900, issued Apr. 28, 1987 to Chen et al. A four-processor multiprocessing vector machine is disclosed in U.S. Pat. No. 4,745,545, issued May 17, 1988 to Schiffleger, and in U.S. Pat. No. 4,754,398, issued Jun. 28, 1988 to Pribnow. All of the above named patents are assigned to Cray Research, Inc., the assignee of the present invention.

As the number of processors in a computing system increase, direct connection and close cooperation between all of the processors becomes impossible. As a result the programming paradigm shifts from multiprocessing to concurrent computing. In a concurrent computer a large number of processors work independently on a pieces of a concurrent program. The processors must still communicate in order to coordinate and share data but they can operate independently on that data. In concurrent computers, communication efficiency becomes critical. Communication latency must be low but at the same time packaging density must be optimized to limit the amount of processor-to-processor interconnect; in addition, it is preferable in some applications to ensure deterministic communication latency.

In response to the need to balance interconnect density against communication latency, a variety of network topologies have been developed. Most such network topologies limit the connections between processors to a relatively small number of neighbors. A large class of such topologies can be characterized as either k-ary n-cubes or as networks such as rings, meshes, tori, binary n-cubes and Omega networks which are isomorphic to k-ary n-cubes. Processors in this class of topologies communicate via a message passing protocol in which information intended for a distant processor is packetized and routed through intermediate processors to the destination processor.

Communication latency in a network such as a k-ary n-cube depends heavily on the choice of routing algorithm. Routing algorithms fall into two categories: store-and-forward routing and wormhole routing. In store-and-forward routing, a message sent from one processor to another is captured and stored in each intermediate processor before being sent on to the next processor. This means that each processor must have a fairly large buffering capacity in order to store the number of messages which may be in transit through the processor. Also, since a message must be received in its entirety before it can be forwarded, store-and-forward approaches to routing result in communication latencies which increase dramatically as a function of the number of nodes in a system. On the other hand, such an approach is amenable to the use of deadlock free algorithms which avoid deadlock by preventing or reducing the occurrences of blocking in message transfers.

In wormhole routing a message is divided into a number of smaller message packets call flits. A header flit is received by a processor and examined as to its destination. The header flit is then sent on to the next processor indicated by the routing algorithm. Intermediate flits are forwarded to the same processor soon after they are received. This tends to move a message quickly through the system. Since, however, each intermediate flit is devoid of routing information, a channel to the next processor is considered dedicated to the message until the complete message is transferred. This results in blocking of other messages which might need to use that particular channel. As more messages block, the system can become deadlocked.

A number of approaches have been offered for resolving the problem of deadlock in wormhole routing. In virtual cut-through routing, messages which are blocked are removed from the network and stored in buffers on one of the intermediate processors. Therefore, blocking in virtual cut-through networks can be avoided through the use of many of the deadlock avoidance algorithms available for store-and-forward routing. Virtual cut-through routing avoids deadlock but at the cost of the additional hardware necessary to buffer blocked messages.

Two alternate approaches for avoiding deadlock in wormhole routing communications networks are described in "Adaptive, low latency, deadlock-free packet routing for networks of processors," published by J. Yantchev and C. R. Jesshope in IEEE Proceedings, Vol. 136, Pt. E, No. 3, May 1989. Yantchev et al. describe a method of avoiding deadlock in wormhole routing in which the header flit, when blocked, coils back to the source node. The source node then waits for a non-deterministic delay before trying to send the message again. Yantchev et al. indicate that such an approach is likely to prove very expensive in terms of communications costs and that these costs will likely increase out of proportion as network diameter increases.

Yantchev et al. also propose an improved wormhole routing algorithm which operates to remove cycles in a network channel dependency graph by constraining routing within the network to message transfers within a series of virtual networks lain over the existing communications network. Under the Yantchev method, the physical interconnection grid is partitioned into classes according to the directions needed for message packet routing. In a two-dimensional array of processors, these classes would correspond to (+X, +Y), (-X, +Y), (+X, -Y) and (-X, -Y). Each class defines a particular virtual network; the combination of two of the virtual networks (such as (+X, +Y) and (-X, -Y)), along with a suitable deadlock free multiplexing scheme, results in a fully connected network which is deadlock-free. Yantchev et al. teach that the two-dimensional scheme can be extended to an n-dimensional network in which one virtual network is used for increasing coordinates while a second is used for decreasing coordinates. The method of virtual networks can also be extended to include adaptive routing.

The method taught by Yantchev et al. can be used to good effect in avoiding deadlock in mesh networks. The Yantchev approach is not, however, as practical for networks having wrap-around channels, such as tori. Wrap-around channels increase the number of cycles in a network. To eliminate these cycles Yantchev et al. teach that a toroidal network can be decomposed into a fully unwrapped torus equivalent consisting of two or more subarrays. Message passing is then limited to transfers within a subarray.

Such an approach, while breaking the cycles, does so at a relatively high cost. Under Yantchev, a large number of virtual channels must be allocated for each node (eight for an unwrapped two-dimensional toroid) in order to break all possible cycles. As the number of dimensions increase, the number of virtual channels needed for deadlock free routing also increases.

Dimension order, or e-cube routing is yet another wormhole approach to deadlock-free routing. In dimension order routing, an ordering of dimensions is selected and all traffic completes its routing in that order. That is, all routing is completed in one dimension before any routing is allowed in another dimension. This rigid routing scheme provides deadlock free transfers by restricting the types of turns possible in a message transfer (i.e. eliminating cycles in the acyclic mesh). Dimension order routing is described in "Deadlock-free Message Routing in Multiprocessor Interconnection Networks" published by William J. Dally and Charles L. Seitz in IEEE Transactions on Computers, Vol. C-36, No. 5, May 1987.

Dimension order routing provides a deterministic routing protocol but, since it only provides a single path between a source and a destination node, in mesh networks this method is not fault tolerant. In toroidal networks, the situation is not much better. A toroid has 2.sup.n possible paths but all paths turn on the same n-1 nodes. Because of this, a failure in any node can cut off communication between one or more node pairs.

Each of the communications networks described above suffers limitations in its applicability to network topologies having hundreds or thousands of nodes. There is a need in the art for a communications network which resolves the above-mentioned problems in an efficient and hardware limited fashion while achieving low communications latency.

SUMMARY OF THE INVENTION

A multidimensional interconnection and routing apparatus for a parallel processing computer connects together processing elements in a three-dimensional structure. The interconnection and routing apparatus includes a plurality of processing element nodes. A communication connects at least one of the processing elements with a host system. An interconnection network connects together the processing element nodes in an X, Y, and Z dimension. The network includes communication paths connecting each of the plurality of processing elements to adjacent processing elements in the plus and minus directions of each of the X, Y, and Z dimensions.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a diagram of an MPP system components.

FIG. 2 is a diagram of processing element components for an MPP system.

FIG. 3 is a diagram of a processing element node for an MPP system.

FIG. 4 is a diagram of interconnected network components for an MPP system.

FIG. 5 is a diagram of communication links for an MPP system.

FIG. 6 is a diagram of communication link signals for an MPP system.

FIG. 7 is a diagram of a one dimensional torus network.

FIG. 8 is a diagram of a two dimensional torus network.

FIG. 9 is a diagram of a three dimensional torus network.

FIG. 10 is a diagram of interleaving processing nodes within an MPP system.

FIG. 11 is a diagram of +X, +Y, and +Z dimension information travel within an MPP system.

FIG. 12 is a diagram of -X, -Y, and -Z dimension information travel within an MPP system.

FIG. 13 is a diagram of information travel within an MPP system for avoiding a bad communication link in the Y dimension.

FIG. 14 is a diagram of a dateline communication link within an MPP system.

FIG. 15 is a diagram of generic packet formats for information within an MPP system.

FIG. 16 is a diagram of a processing element network router for an MPP system.

FIG. 17 is a diagram of an X dimension switch for an MPP system.

FIG. 18 is a diagram of an input node network router for an MPP system.

FIG. 19 is a diagram of an I/O gateway for an MPP system.

FIG. 20 is a diagram of data paths through each dimension switch logic.

FIG. 21 is a diagram of logic for a random number generator.

FIG. 22 is a diagram of buffers in an MPP system.

FIG. 23 is a diagram of dateline deadlock avoidance.

FIG. 24 is a diagram of naive and optimized virtual channel allocations.

FIG. 25 is a diagram showing an example of standard and origin allocation.

FIG. 26 is a diagram showing an example of linear-lengthwise and partition allocation.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

In the following detailed description of the preferred embodiment, reference is made to the accompanying drawings which form a part hereof and in which is shown by way of illustration a specific embodiment in which the invention may be practiced. This embodiment is described in sufficient detail to enable those skilled in the art to practice the invention, and it is to be understood that other embodiments may be utilized and that structural or logical changes may be made without departing from the scope of the present invention. The following detailed description is, therefore, not to be taken in a limiting sense, and the scope of the present invention is defined by the appended claims.

INTRODUCTION

The following describes the architecture and functions a first-phase massively parallel processing (MPP) system. The MPP system typically contains hundreds or thousands of microprocessors, each accompanied by a local memory. The system is designed to support two styles of MPP programming: data parallel and message passing.

Data parallel programs, such as High Performance Fortran (HPF), are designed to provide a programmer with ease of use while still providing a good percentage of MPP performance. Message passing programs, such as parallel virtual machine (PVM) messaging, provide a higher percentage of peak MPP performance.

Cray Research, Inc., the assignee of the present application, supports both styles of programming so that customers may take a program from another vendor's MPP system and port it to a Cray Research, Inc. MPP system with a minimum amount of effort. For more information on Cray Research's MPP Programming Models, refer to the CRAY T3D Software Overview Technical Note publication number SN-2505, which is incorporated herein by reference.

The MPP system connects to a host computer system. The host system runs the software compilers for the MPP system. All programs written for the MPP system are compiled on the host system, but run in the MPP system. The host system may be, for example, any Cray Research, Inc. computer system that has an input/output subsystem model E (IOS-E). Examples of host systems include the CRAY Y-MP E series computer systems, the CRAY Y-MP M90 series computer systems, and the CRAY Y-MP C90 series computer systems. The host system may reside in the same cabinet as the MPP system. This configuration is called a single-cabinet configuration. The host system may also reside in a separate cabinet that is cabled to the MPP system cabinet. This configuration is called a multiple-cabinet configuration.

The MPP system preferably contains four types of components: processing element nodes, the interconnect network, I/O gateways, and a clock. FIG. 1 shows a simplified model of the components of the MPP system. The following sections describe preferred components of an MPP system.

PROCESSING ELEMENT NODES

An MPP computer system typically contains hundreds or thousands of microprocessors, each accompanied by a local memory. Each microprocessor and local memory component is called a processing element. In the MPP system, each processing element contains a microprocessor 10, local memory 12, and support circuitry 14 (refer to FIG. 2).

The microprocessor is preferably an Alpha reduced instruction set computer (RISC) 64-bit microprocessor developed by Digital Equipment Corporation. The microprocessor performs arithmetic and logical operations on 64-bit integer and 64-bit floating-point registers. The microprocessor also preferably contains an internal instruction cache memory and data cache memory that each store 256 lines of data or instructions. Each line in the instruction and data cache memory is four 64-bit words wide.

Local memory preferably comprises a dynamic random access memory (DRAM) that stores system data. A low-latency, high-bandwidth data path connects the microprocessor to local memory in a processing element.

The MPP system memory is physically distributed because each processing element contains local memory; however, the system memory is also logically shared. System memory is logically shared because the microprocessor in one processing element can access the memory of another processing element without involving the microprocessor in that processing element.

The support circuitry extends the control and addressing functions of the microprocessor. This includes performing data transfers to or from local memory.

The MPP system may contain 128, 256, 512, 1,024, or 2,048 processing elements depending on the system configuration (excluding the processing elements in the I/O gateways). The processing elements reside in processing element nodes.

Each processing element node typically contains two processing elements 16 and 18, a network interface 20, and a block transfer engine 22 (refer to FIG. 3). The following paragraphs briefly describe each of these components. Processing elements (PEs) 16 and 18 in a processing element node are preferably identical but function independently. Access to block transfer engine 20 and network interface 22 is shared by the two PEs.

Network interface 20 formats information before it is sent over the interconnect network to another processing element node or I/O gateway. Network interface 20 also receives incoming information from another processing element node or I/O gateway and steers the information to PE 0 or PE 1 in the processing element node.

Block transfer engine (BLT) 22 is an asynchronous direct memory access controller that redistributes system data. BLT 22 redistributes system data between the local memory in PE 0 or PE 1 and the memory in remote PEs. BLT 22 can redistribute up to 65,536 64-bit words of data (or 65,536 4-word lines of data) without interrupting the PE.

INTERCONNECT NETWORK

The interconnect network provides communication paths among the processing element nodes and the I/O gateways in the MPP system. The interconnect network forms a three dimensional matrix of paths which connect the nodes in the X, Y, and Z dimensions (see FIG. 1).

The interconnect network is comprised of communication links 26a-26f and network routers 24. FIG. 4 shows how the components of the interconnect network connect to a processing element node.

The following describes the components of the interconnect network and describes characteristics of the interconnect network.

Communication Links

Communication links transfer data and control information between the network routers in the interconnect network. Each communication link connects two nodes in one dimension (see FIG. 5); for example, communication link 28 connects nodes 30 and 32 in the X dimension.

A communication link typically comprises two unidirectional channels. Each channel in the link preferably contains Data, Physical Unit (Phit) Type, Virtual Channel Select, and Virtual Channel Acknowledge signals. FIG. 6 shows the signals for both unidirectional channels in one communication link.

Data Signals

Each channel typically contains 16 Data signals. Data signals preferably carry two types of information: requests or responses. Requests contain information that request a node to perform an activity. For example, a source node may send a request to a destination node to read data from memory in the destination node. This request is sent over one channel in the communication link.

Responses contain information that is the result of an activity. For example, after receiving a request for read data, a destination node sends the response back to the source node. The response contains the read data.

Requests and responses preferably must be logically separated. This is preferably done by providing separate buffers for requests and responses. These buffers are used to create virtual channels.

Phit Type Bits

A phit is the amount of information that can be placed on a data channel in one clock period. In the MPP system described in the present specification, a phit is 16 bits in size.

Each channel preferably contains two phit type bits that are controlled by the node sending information over the channel. These bits indicate what type of phit is on the Data signals. Table 1 lists the definitions of the least significant bit (LSB) and most significant bit (MSB) of the phit type bits. (More information on packets is provided at the end of this subsection).

TABLE 1 ______________________________________ Phit Type Bit Definitions MSB LSB Data Signals Contain ______________________________________ 0 0 No information 0 1 Packet routing tag phit 1 0 Packet phits 1 1 Last phit of packet ______________________________________

Virtual Channel Signals

The virtual channel signals are used to control which virtual channel the data will use. A virtual channel is created when request and response information transfers over the same physical communication link but is stored in separate buffers. The virtual channel signals include the virtual channel select bits and the virtual channel acknowledge bits.

There are two virtual channel select bits. These bits indicate which virtual channel buffer in the receiving node the information will be stored in. Table 2 shows the definitions of the virtual channel select bits.

TABLE 2 ______________________________________ Virtual Channel Select Bit Definitions MSB LSB Definition Name ______________________________________ 0 0 Request buffer 0 Virtual channel 0 0 1 Request buffer 1 Virtual channel 1 1 0 Response buffer 0 Virtual channel 2 1 1 Response buffer 1 Virtual channel 3 ______________________________________

The most significant bit of the virtual channel select bits indicates if the information on the Data signals is a request or a response. When set to 0, this bit indicates the information is a request. When set to 1, this bit indicates the information is a response.

The least significant bit of the virtual channel select bits indicates which of the two request or two response buffers the information on the Data signals will be stored in. When set to 0, this bit indicates the information will be stored in buffer 0. When set to 1, this bit indicates the information will be stored in buffer 1.

There are four virtual channel acknowledge bits. Each virtual channel buffer controls one of the virtual channel acknowledge bits. For example, virtual channel buffer 2 controls bit 2.sup.2 of the virtual channel acknowledge bit. The node receiving information sets the appropriate virtual channel acknowledge bit to 1 while the node empties the virtual channel buffer and sends the information to another node or a PE. The node resets the virtual channel acknowledge bit to 0 after the virtual channel is empty and the data has been sent to another node or a PE.

Torus Interconnect Topology

The interconnect network is connected in a bidirectional torus. A torus contains communication links that connect the smallest numbered node in a dimension directly to the largest numbered node in the same dimension. This type of connection forms a ring where information can transfer from one node, through all of the nodes in the same dimension, and back to the original node.

FIG. 7 shows a one dimensional torus network in the X dimension. Information can transfer from node 00, through all of the nodes, and back to node 00 in a circular fashion. Each node has a communication link in both the plus and minus direction of the X dimension.

Torus networks offer several advantages for network communication. One advantage is speed of information transfers. For example, in FIG. 7, node 07 can communicate directly with node 00 instead of sending information through all of the nodes in the X dimension. Another advantage of the torus network is the ability to avoid bad communication links. For example, in FIG. 7, if node 00 cannot transfer information directly to node 01 due to a bad communication link, node 00 can still communicate with node 01 by sending the information the long way around the network through the other nodes in the X dimension.

FIG. 8 shows a two dimensional torus network in the Y and X dimensions. Each node has communication links in both the plus and minus directions of the Y and X dimensions. FIG. 9 shows a three dimensional torus network in the Z, Y, and X dimensions. Each node has communication links in both the plus and minus directions of the Z, Y, and X dimensions.

Several of the diagrams in this specification show three dimensional network connections. For clarity, the communication link that completes the torus in each dimension is not shown. It is important to remember that, although not shown in the diagrams, this communication link is present.

Interleaving

The nodes in the interconnect network are preferably interleaved. Interleaving is the physical placement of nodes so that the maximum wiring distance between nodes is minimized.

FIG. 10 shows two one-dimensional torus networks. The eight nodes in upper network 34 are not interleaved. The eight nodes in lower network 36 are interleaved. In the interleaved network (also called a folded torus network), the physical length of the longest communication link is shorter than the physical length of the longest communication link in the non-interleaved network. The X and Z dimensions of the network are interleaved. This minimizes the length of the physical communication links (wires) in the MPP system.

Several of the diagrams in this specification contain drawings of three dimensional interconnect networks. For clarity, the communication links are shown logically and do not show the interleaving. It is important to remember that although not shown, the nodes in the network are physically interleaved in the preferred embodiment.

Dimension Order Routing

When a node sends information to another node, the information may travel through several communication links in the network. Each transfer of information over a communication link is referred to as a hop. After information leaves a node, it typically travels through the network in the X dimension first, then through the Y dimension, and finally through the Z dimension. When finished moving through the communication links in the Z dimension, the information arrives at the destination node. This method of information travel is called dimension order routing.

For example, if node A shown in FIG. 11 sends request information to node B, the information first travels one hop in the +X direction. Since the information does not need to travel any farther in the X dimension, it switches direction to the Y dimension. After completing one hop in the +Y direction, the information switches direction to the Z dimension and completes one hop in the +Z direction. After completing one hop in the +Z direction, the request information arrives at node B.

Information does not always travel in the positive direction of a dimension. For example, of node B in FIG. 12 sends response information to node A, the information completes on hop in the -X direction and then changes direction into the Y dimension. The information completes one hop in the -Y direction before changing direction into the Z dimension. After completing one hop in the -Z direction, the response information arrives at node A.

Because information can travel in either the positive or negative direction of a dimension, bad communication links can be avoided. For example, if node A in FIG. 13 sends information to node B, the information completes one hop in the +X direction and then switches direction into the Y dimension. Consider, for example, that due to a bad communication link, the information cannot complete a hop in the +Y direction. Instead, the information may be routed so it completes two hops in the -Y direction and travels the long way around the torus in the Y dimension. After switching directions into the Z dimension, the information completes one hop in the +Z direction and arrives at node B.

An example of a system for information routing is described in patent application Ser. No. 07/983,979 filed Nov. 30, 1992 and entitled "DIRECTION ORDER ROUTING IN MULTIPROCESSING SYSTEMS," which is incorporated herein by reference.

Virtual Channels

A virtual channel is created when request and response information travels over the same physical communication link, but is stored in different buffers. The MPP system contains four virtual channel buffers (see Table 3).

TABLE 3 ______________________________________ Virtual Channel Buffers Buffer Name Definition ______________________________________ Virtual channel 0 Request buffer 0 Virtual channel 1 Request buffer 1 Virtual channel 2 Response buffer 0 Virtual channel 3 Response buffer 1 ______________________________________

The virtual channel buffers prevent two types of communication deadlock conditions that may occur in the interconnect network. The following describes these conditions.

Without the virtual channel buffers, a communication deadlock condition may occur if two nodes simultaneously transfer request or response information to each other. To prevent this condition from occurring, the MPP system contains two types of buffers: request buffers and response buffers. These buffers provide separate destination buffers for request and response information.

Also without the virtual channel (VC) buffers, a communication deadlock condition may occur if all of the nodes in one dimension send request or response information to the next node in the dimension at the same time. For example, a deadlock condition may occur if all of the nodes in the X dimension send request information to the next node in the +X direction at the same time. To prevent this condition from occurring, the MPP system preferably contains two request buffers and t