WikiPatents - Community Patent Review
Create Free Account  |  License or Sell Your Patent  |  WikiPatents Marketplace  |  WikiPatents Blog
Username:  Password:  
    
Advanced Search
Parallel computer system with physically separate tree networks for data and control messages    
United States Patent5590283   
Link to this pagehttp://www.wikipatents.com/5590283.html
Inventor(s)Hillis; W. Daniel (Cambridge, MA); Douglas; David C. (Concord, MA); Leiserson; Charles E. (Winchester, MA); Kuszmaul; Bradley C. (Waltham, MA); Ganmukhi; Mahesh N. (Wexford, PA); Hill; Jeffrey V. (San Jose, CA); Wong-Chan; Monica C. (Cambridge, MA)
AbstractA digital computer comprises a plurality of processing elements, a communications router, and a control network. Each processing element performs data processing operations in connection with commands, at least some of the processing elements performing the data processing operations in connection with the commands in messages they receive over the control network. Each processing element also generates and receives data transfer messages, each including an address portion containing an address, for transfer to another processing element as identified by the address. At least one of the processing elements further generates the control network messages for transfer over the communications router. The communications router comprises router nodes interconnected in the form of a "fat-tree," and the control network comprises control network nodes interconnected in the form of a tree, with the processing elements being connected at the leaf nodes of the respective communications router and control network.
   














 Title Information Submit all comments and votes
 
Patent Text Patent PDF Print Page Summary File History
Plain text PDF images Print Summary File History
Drawing from US Patent 5590283
Parallel computer system with physically separate tree networks for data

     and control messages - US Patent 5590283 Drawing
Parallel computer system with physically separate tree networks for data and control messages
Inventor     Hillis; W. Daniel (Cambridge, MA); Douglas; David C. (Concord, MA); Leiserson; Charles E. (Winchester, MA); Kuszmaul; Bradley C. (Waltham, MA); Ganmukhi; Mahesh N. (Wexford, PA); Hill; Jeffrey V. (San Jose, CA); Wong-Chan; Monica C. (Cambridge, MA)
Owner/Assignee     Thinking Machines Corporation (Bedford, MA)
Patent assignment
All assignments
Publication Date     December 31, 1996
Application Number     08/380,854
PAIR File History     Application Data   Transaction History
Image File Wrapper   Patent Term   Fees
Litigation
Filing Date     January 27, 1995
US Classification     712/29 709/243 709/252
Int'l Classification     G06F 015/163 G06F 015/173
Examiner     Lee; Thomas C.
Assistant Examiner     Dinh; D.
Attorney/Law Firm    
Address
Parent Case     This application is a division of application Ser. No. 08/183,219, filed Jan. 14, 1994, now U.S. Pat. No. 5,388,214 which is a continuation of application Ser. No. 07/592,029, filed Oct. 3, 1990, now abandoned.
Priority Data    
USPTO Field of Search     395/200.02 395/800
Patent Tags     parallel computer physically separate tree networks data control messages
   
Enter a comma (,) or semicolon (;) between multiple tag words/phrases.
Describe this patent:
 Amusing   
 Clever   
 Complex   
 Efficient   
 Historic   
 Important   
 Innovative   
 Interesting   
 Practical   
 Simple   
[no votes]
Patent WIKI

Share information and news about this patent, including information and news about the technology, inventors, company, ligation and licensing.

 References Submit all comments and votes
 
*references marked with an asterisk below are user-added references
 U.S. References
 
Add a new US reference:  
ReferenceRelevancyCommentsReferenceRelevancyComments
5191578
Lee
370/418
Mar,1993

[0 after 0 votes]
4543630
Neches
709/252
Sep,1985

[0 after 0 votes]
 Foreign References
 Other References
 Market Review Submit all comments and votes
   
Market Size
Estimate the gross annual revenues of the relevant market sector:
> $10B
$5B - $10B
$2B - $5B
$500M - $2B
$100M - $500M
$10M - $100M
$1M - $10M
$500K - $1M
$100K - $500K
< $100K
[No votes]
$0
 
$0   $2.5B   $5B   $7.5B   $10B
Market Share
Estimate the percentage of the relevant market sector this invention will capture:
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Reasonable Royalty
What percentage of gross sales should the inventor or assignee be paid?
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Public's "Guesstimation" of Royalty Value
Market SizeN/A[No votes]
xMarket ShareN/A[No votes]
xReasonable RoyaltyN/A[No votes]

N/A

License Availablity
If you are NOT the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
License Availablity
If you ARE the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
Competitive Advantage
Does this invention have a significant competitive advantage over similar technologies?
Yes

No



[No votes]
Most helpful competitive advantage comment
[No comments]

Commercial Alternatives
Are there viable commercial alternatives for this invention?
Yes

No



[No votes]
Most helpful commercial alternative comment
[No comments]

 Technical Review Submit all comments and votes
 Claims Submit all comments and votes
 


What is claimed as new and desired to be secured by Letters Patent of the United States is:

1. A digital computer comprising

A. a plurality of processing elements each performing data processing operations in connection with commands, at least some of the processing elements performing said data processing operations in connection with said commands which they receive in control network messages, each processing element also (i) generating and receiving data transfer messages, each including an address portion containing an address, for transfer to another processing element as identified by the address and (ii) at least one of said processing elements further generating said control network messages and (iii) said at least some processing elements receiving said control network messages;

B. a communications router comprising a plurality of router node groups interconnected in a tree pattern in a series of levels from a lower leaf level to an upper root level, and each node group in levels above the leaf level including a plurality of router nodes, with router nodes in levels below the root level being connected to a plurality of router nodes in the next higher level thereby forming a fat-tree structure, each node receiving data transfer messages and coupling them to another node or to a processing element connected thereto as determined by the address in the respective address portion; and

C. a control network comprising a like plurality of control network node groups interconnected in a like tree pattern in a series of levels from a lower leaf level to an upper physical root level, each control network node group below the upper root level receiving control network messages from a processing element or a lower-level control network node group and generating a control network message in response thereto for transmission to a higher-level control network node group, and receiving control network messages from a higher level control network node group and generating control messages in response thereto for transmission to lower-level control network node groups, the control network node group at the root level generating control network messages for transmission to the lower level control network node groups in response to control network messages received therefrom.

2. A digital computer as defined in claim 1 in which each control network node group in the control network is associated with a router node group in the communications router, each control network node group providing a control signal for controlling a selected operation of the associated router node group.

3. A digital computer as defined in claim 2 in which each respective control signal provided by a said control network node group is provided in response to control information provided by at least one of said processing elements.

4. A digital computer as defined in claim 3 in which said control information is provided in a said control network message.

5. A digital computer as defined in claim 4 in which said control network node groups selectively operate in a first mode and a second mode, the control signal enabling said router node groups to operate in one of said first mode or said second mode, the router node groups when operating in said second mode transferring said data transfer messages to respective router node groups in said next lower levels thereby enabling said communications router to quickly transfer data transfer messages to said processing elements.
 Description Submit all comments and votes
 


INCORPORATION BY REFERENCE

Guy E. Blelloch, Scan Primitives and Parallel Vector Models, (Ph.D. Dissertation, Massachusetts Institute of Technology: 1988), incorporated herein by reference.

U.S. patent appn. Ser. No. 07/489,079, filed Mar. 5, 1990, now U.S. Pat. No. 5,118,975 in the name of W. Daniel Hillis, et al., entitled Digital Clock Buffer Circuit Providing Controllable Delay, and assigned to the assignee of the present application, incorporated herein by reference.

FIELD OF THE INVENTION

The invention relates generally to the field of digital computer systems, and more particularly to massively parallel computing systems. The invention particularly provides arrangements for controlling processors in a computing system having a large number of processors, for facilitating transfer of data among the processors and for facilitating diagnosis of faulty components in the computing system.

BACKGROUND OF THE INVENTION

A digital computer system generally comprises three basic elements, namely, a memory element, an input/output element and a processor element. The memory element stores information in addressable storage locations. This information includes data and instructions for processing the data. The processor element fetches information from the memory element, interprets the information as either an instruction or data, processes the data in accordance with the instructions, and returns the processed data to the memory element. The input/output element, under control of the processor element, also communicates with the memory element to transfer information, including instructions and the data to be processed, to the memory, and to obtain processed data from the memory.

Most modern computing systems are considered "von Neumann" machines, since they are generally constructed according to a paradigm attributed to John von Neumann. Von Neumann machines are characterized by having a processing element, a global memory which stores all information in the system, and a program counter that identifies the location in the global memory of the instruction being executed. The processing element executes one instruction at a time, that is, the instruction identified by the program counter. When the instruction is executed, the program counter is advanced to identify the location of the next instruction to be processed. (In many modern systems, the program counter is actually advanced before the processor has finished processing the current instruction.)

Von Neumann systems are conceptually uncomplicated to design and program, since they do only one operation at a time. A number of advancements have been made to the original von Neumann paradigm to permit the various parts of the system, most notably the various components of the processor, to operate relatively independently and achieve a significant increase in processing speed. One such advancement is pipelining of the various steps in executing an instruction, including instruction fetch, operation code decode (a typical instruction includes an operation code which identifies the operation to be performed, and in most cases one or more operand specifiers, which identify the location in memory of the operands, or data, to be used in executing the instruction), operand fetch, execution (that is, performing the operation set forth in the operation code on the fetched operands), and storing of processed data, which steps are performed relatively independently by separate hardware in the processor. In a pipelined processor, the processor's instruction fetch hardware may be fetching one instruction while other hardware is decoding the operation code of another instruction, fetching the operands of still another instruction, executing yet another instruction, and storing the processed data of a fifth instruction. Since the five steps are performed sequentially, pipelining does not speed up processing of an individual instruction. However, since the processor begins processing of additional instructions before it has finished processing a current instruction, it can speed up processing of a series of instructions.

A pipelined processor is obviously much more complicated than a simple processor in a von Neumann system, as it requires not only the various circuits to perform each of the operations (in a simple von Neumann processor, many circuits could be used to perform several operations), but also control circuits to coordinate the activities of the various operational circuits. However, the speed-up of the system can be dramatic.

More recently, some processors have been provided with execution hardware which includes multiple functional units each being optimized to perform a certain type of mathematical operation. For example, some processors have separate functional units for performing integer arithmetic and floating point arithmetic, since they are processed very differently. Some processors have separate hardware functional units each of which performs one or only several types of mathematical operations, including addition, multiplication, and division operations, and other operations such as branch control and logical operations, all of which can be operating concurrently. This can be helpful in speeding up certain computations, most particularly those in which several functional units may be used concurrently for performing parts of a single computation.

In a von Neumann processor, including those which incorporate pipelining or multiple functional units (or both, since both may be incorporated into a single processor), a single instruction stream operates on a single data stream. That is, each instruction operates on data to enable one calculation at a time. Such processors have been termed "SISD," for single-instruction/single-data." If a program requires a segment of a program to be used to operate on a number of diverse elements of data to produce a number of calculations, the program causes the processor to loop through that segment for each calculation. In some cases, in which the program segment is short or there are only a few data elements, the time required to perform such a calculation may not be unduly long.

However, for many types of such programs, SISD processors would require a very long time to perform all of the calculations required. Accordingly, processors have been developed which incorporate a large number of processing elements all of which may operate concurrently on the same instruction stream, but with each processing element processing a separate data stream. These processors have been termed "SIMD" processors, for "single-instruction/multiple-data."

SIMD processors are useful in a number of applications, such as image processing, signal processing, artificial intelligence, database operations, and computer simulation of a number of things, such as electronic circuits and fluid dynamics. In image processing, each processing element may be used to perform processing on a pixel ("picture element") of the image to enhance the overall image. In signal processing, the processors concurrently perform a number of the calculations required to perform such computations as the "Fast Fourier transform" of the data defining the signal. In artificial intelligence, the processors perform searches on extensive rule bases representing the stored knowledge of the particular application. Similarly, in database operations, the processors perform searches on the data in the database, and may also perform sorting and other operations. In computer simulation of, for example, electronic circuits, each processor may represent one part of the circuit, and the processor's iterative computations indicate the response of the part to signals from other parts of the circuit. Similarly, in simulating fluid dynamics, which can be useful in a number of applications such as weather predication and airplane design, each processor is associated with one point in space, and the calculations provide information about various factors such as fluid flow, temperature, pressure and so forth.

Typical SIMD systems include a SIMD array, which includes the array of processing elements and a router network, a control processor and an input/output component. The input/output component, under control of the control processor, enables data to be transferred into the array for processing and receives processed data from the array for storage, display, and so forth. The control processor also controls the SIMD array, iteratively broadcasting instructions to the processing elements for execution in parallel. The router network enables the processing elements to communicate the results of a calculation to other processing elements for use in future calculations.

Several routing networks have been used in SIMD arrays and others have been proposed. In one routing network, the processing elements are interconnected in a matrix, or mesh, arrangement. In such an arrangement, each processing element is connected to, and communicates with, four "nearest neighbors" to form rows and columns defining the mesh. This arrangement can be somewhat slow if processing elements need to communicate among themselves at random. However, the arrangement is inexpensive and conceptually simple, and may suffice for some types of processing, most notably image processing. The "Massively Parallel Processor" manufactured by Goodyear Aerospace Corporation is an example of a SIMD array having such a routing network.

In another routing network, processing elements are interconnected in a cube or hypercube arrangement, having a selected number of dimensions, for transferring data, in the form of messages, among the processing elements. The arrangement is a "cube" if it only has three dimensions, and a "hypercube" if it has more than three dimensions. U.S. Pat. No. 4,598,400, entitled Method and Apparatus For Routing Message Packets, issued Jul. 1, 1986 to W. Daniel Hillis, and assigned to the assignee of the present application, describes a system having a hypercube routing network. In the system described in the '400 patent, multiple processing elements are connected to a single routing node, and the routing nodes are interconnected in the hypercube.

Another routing arrangement which has been proposed is a crossbar switch, through which each processing element can communicate directly with any of the other processing elements. The crossbar switch provides the most efficient communications of any of the routing networks proposed. However, a crossbar switch also has the most connections and switching elements, and thus is the most expensive and also the most susceptible to failure due to broken connections and faulty switching elements. Thus, crossbar switch arrangements are rarely used, except when the number of processing elements is fairly small, since the complexity of a crossbar switch increases with the square of the number of processing elements.

Yet another routing arrangement is an omega network, in which switching is performed through a number of serially-connected stages. Each stage has two inputs, each connected to the outputs of a prior stage or processing elements, has two outputs which may be connected to the inputs of a subsequent stage or processing elements. The "Butterfly" computer system manufactured by Bolt Beranek & Newman uses such a network.

SUMMARY OF THE INVENTION

The invention provides a new and improved parallel computer system.

In brief summary, the new computer includes a plurality of processing elements, a command processor, a diagnostic processor and a communications network. The processing elements each performs data processing and data communications operations in connection with commands. The processing elements also performing diagnostic operations in response to diagnostic operation requests and providing diagnostic results in response thereto. The command processor generates commands for the processing elements, and also performs diagnostic operations in response to diagnostic operation requests and providing diagnostic results in response thereto. The diagnostic processor generates diagnostic requests. The communication network includes three elements, including a data router, a control network and a diagnostic network. The data router is connected to the processing elements for facilitating the transfer of data among them during a data communications operation. The control network is connected to the processing elements and the command processor for transferring commands from the command processor to the processing elements. The diagnostic network connected to the processing elements, the command processor and the diagnostic processor for transferring diagnostic requests from the diagnostic processor to the processing elements and the command processor and for transferring diagnostic results from the processing elements and the command processor to the diagnostic processor.

BRIEF DESCRIPTION OF THE DRAWINGS

This invention is pointed out with particularity in the appended claims. The above and further advantages of this invention may be better understood by referring to the following description taken in conjunction with the accompanying drawings, in which:

FIG. 1 is a general block diagram of a massively parallel computer system constructed in accordance with the invention;

FIGS. 2A and 2B are block diagrams useful in understanding the structure and operation of the data router of the computer system of FIG. 1;

FIG. 3 is a diagram depicting the structure of message packets transferred over the data router;

FIGS. 4A and 4B are block diagrams useful in understanding the structure and operation of the control network of the computer system of FIG. 1;

FIG. 5 is a diagram depicting the structure of message packets transferred over the control network;

FIGS. 6A through 6C are block diagrams useful in understanding the structure and operation of the diagnostic network of the computer system of FIG. 1;

FIG. 7 is a diagram depicting the structure of message packets transferred over the diagnostic network;

FIG. 8 is a general block diagram of a processing element in the computer system depicted in FIG. 1;

FIG. 9A-1 comprises a general block diagram of a data router interface circuit useful in interfacing the processing element depicted in FIG. 8 to the data router of the computer system depicted in FIG. 1, FIGS. 9A-2A and 9A-2B contain definitions of registers in the data router interface and FIGS. 9B-1 through 9D-7 comprise logic diagrams of the data router interface;

FIG. 10A comprises a general block diagram of a control network interface circuit useful in interfacing the processing element depicted in FIG. 8 to the control network of the computer system depicted in FIG. 1, FIGS. 10A-1 contains a definitions of a register in the control network interface and FIGS. 10B through 10G comprise logic diagrams of the control network interface;

FIG. 11A is a general block diagram of a data router node used in the data router described in connection with FIGS. 2A and 2B, and FIGS. 11B-1 through 11D comprise detailed block and logic diagrams of the data router node;

FIG. 12A is a general block diagram of a control network node used in the control network described in connection with FIGS. 4A and 4B, and FIGS. 12B-1 through 12D-1 comprise detailed block and logic diagrams of the control router mode; and

FIG. 13A is a general block diagram of a diagnostic network node used in the diagnostic network described in connection with FIG. 6, and FIGS. 13B-1 through 13C comprise detailed block and logic diagrams of the diagnostic network node.

DETAILED DESCRIPTION OF AN ILLUSTRATIVE EMBODIMENT

I. General Description

A. General Description of Computer System

FIG. 1 is a general block diagram of a massively parallel computer system 10 constructed in accordance with the invention. With reference to FIG. 1, system 10 includes a plurality of processing elements 11(0) through 11(N) (generally identified by reference numeral 11), scalar processors 12(0) through 12(M) (generally identified by reference numeral 12) and input/output processors 13(0) through 13(K) (generally identified by reference numeral 13). Input/output units (not shown), such as, for example, disk and tape storage units, video display devices, printers and so forth may be connected to the input/output processors to supply information, including data and program commands, for processing by the processing elements 11 and scalar processors 12 in the system, and may also receive processed data for storage, display and printing. The scalar processors 12 may also be connected to input/output units including, for example, video display terminals which permit one or more operators to generally control system 10.

The system 10 further includes a control network 14, a data router 15 and a diagnostic network 16. The control network 14 permits one or more scalar processors 12 to broadcast program commands to the processing elements 11. The processing elements 11 execute the commands generally concurrently. The control network 14 also permit the processing elements 11 to transfer status information to the scalar processors 12. The control network 14 is also used by the processing elements 11 to perform selected types of arithmetic operations, termed "scan" and "reduce" operations, as described below. The control network 14 may also be used to provide synchronization among the processing elements 11.

The data router 15 transfers data among the processing elements 11, scalar processors 12 and input/output processors 13. In particular, under control of the scalar processors 12, the input/output processors 13 retrieve data to be processed from the input/output units and distributes it to the respective scalar processors 12 and processing elements 11. During processing, the scalar processors 12 and processing elements 11 can transfer data among themselves over the data router 15. In addition, the processing elements 11 and scalar processors 12 can transfer processed data to the input/output processors 13. Under control of the scalar processors 12, the input/output processors 13 can direct the processed data that they receive from the data router 15 to particular ones of the input/output units for storage, display, printing, or the like.

The diagnostic network 16, under control of a diagnostic processor (not shown), facilitates testing of other portions of the system 10 to identify, locate and diagnose defects. The diagnostic processor may comprise one or more of the scalar processors 12. In addition, the diagnostic network 16 may be used to establish selected operating conditions in the other portions of the system 10 as described below.

The system 10 is synchronous, that is, all of its elements operate in accordance with a global SYS CLK system clock signal provided by a clock circuit 17.

One particular embodiment of system 10 may include hundreds or many thousands of processing elements 11 operating on a single problem in parallel under control of commands broadcast to them by the scalar processors 12. In that embodiment, the processing elements 11 operate in parallel on the same command on their individual sets of data, thereby forming a parallel computer system. In addition, the system 10 may be dynamically logically partitioned, as described below, into multiple subsystems which may concurrently operate on separate problems or separate parts of a single problem. In that case, each partition includes at least one scalar processor 12 and a plurality of processing elements 11.

B. General Description of Communications Networks

1. Data Router 15

Before proceeding to a detailed description of the system 10 and its various components, it would be helpful to generally describe the structures of the control network 14 and data router 15. The data router 15 and control network 14 both transfer information in the form of message packets, which will be described in detail below in connection with FIGS. 3 and 5, respectively. FIGS. 2A and 2B depict the general structure of the data router 15 and FIGS. 4A and 4B depict the general structure of the control network 14.

With reference to FIG. 2A, the data router 15 is generally tree-structured, having a plurality of data router node groups 20(i,j) ("i" and "j" are integers) organized in a plurality of levels each identified by the index "i" in reference numeral 20(i,j). A data router node group 20(i,j) at each level "i" is connected to a selected number of data router node groups 20(i-1,j) in the next lower level "i-1" to form a tree. As will be described in detail below, the data router node groups 20(i,j) perform message switching operations to transfer data, in the form of data router message packets, among the processing elements 11, scalar processors 12 and input/output processors 13, which are collectively identified as leaves 21(0) through 21(N) (generally identified by reference numeral 21). Each data router node group 20(1,j) in the lowest level is connected to one or more leaves 21. In the reference numeral 20(i,j), the index (j) uniquely identifies each of the data router node groups 20(i,j) at each level "i."

In the data router 15 represented in FIG. 2A, the data router node group 20(M,0) at the highest level "M" is termed the "physical root" of the tree. At each level "i", each data router node group 20(i,j) is termed the "parent" of data router node groups 20(i-1,j) connected thereto, and each data router node group 20(i-1,j) is termed a "child" of the data router node group 20(i,j) to which it is connected. It will be appreciated that the data router node group 20(i,j) will also be a child of the data router node group 20(i+1,j) connected thereto. In one particular embodiment, each data router node group 20(i,j) in a particular level "i" is connected to four child data router node groups 20(i-1,j); in that embodiment, the "fan-out" of the tree, that is, the number of children connected to each parent, is four. It will be appreciated from the following that the fan-out need not be constant, but may vary from level to level and also among data router node groups 20(i,j) within the same level.

The structure of the data router 15 is further termed a "fat-tree", and will be particularly described in connection with FIG. 2B. With reference to FIG. 2B, at least some of the data router node groups 20(i,j) includes at least one, and typically two or more data router nodes 22(i,j,k), wherein "k" is an integer that uniquely identifies each data router node within a data router node group 20(i,j). Each data router node 22(i,j,k) in a data router node group 20(i,j) is connected to a plurality of data router nodes 22(i+1,j,k) in level "i+1," with the connections being established so that the data router nodes 22(i,j,k) in each data router node group 20(i,j) are connected to different ones of the data router nodes 22(i+1,j,k) in the data router node group 20(i,j) in level "i+1." For example, in data router node group 20(1,0), data router node 22(1,0,0) is connected to data router nodes 22(2,0,0) and 22(2,0,1) of data router node group 20(2,0), and data router node 22(1,0,1) is connected to data router nodes 22(2,0,2) and 22(2,0,3) of data router node group 20(2,0).

In addition, each data router node 22(i,j,k) in a parent data router node group 20(i,j) is connected to one data router node 22(i-1,j,k) in that parent's child data router node groups 20(i-1,j). Accordingly, as shown in FIG. 2B, data router node (2,0,0) in data router node group 20(2,1) is connected to one data router node 22(1,j,0), where "j" equals 0, 1, 2 and 3, in each of the data router node groups 20(1,0) through 21(1,3).

It will be appreciated that the collection of data router nodes 22(i,j,k) from each leaf 21 to and including the data router nodes 22(m,0,k) in the root data router node group 20(M,0) essentially forms an inverted tree. Each leaf 21 effectively comprises the root of one inverted tree and the data router nodes 22(M,0,k) of the root data router node group 20(M,0) form all of the leaves of all of the inverted trees defined by the collection of leaves 21. The number of data router nodes 22(i,j,k) in each data router node group 20(i,j) at a particular level "i" in the tree defining data router 15 will be determined by the fan-out at each level from level "1" to level "i" in the inverted tree. The fan-out at a particular level "i" is the number of data router nodes 22(i+1,j,k) at level "i+1" to which each data router node 22(i,j,k) at level "i" is connected. Thus, for example, since data router node 22(1,0,0) of data router node group 20(1,0) in level "1" is connected to two data router nodes 22(2,0,0) and 22(2,0,1) of data router node groups 20(2,0) in level "2," the fan-out from data router node 22(1,0,0) is two. In one particular embodiment, the fan-out from data router nodes 22(i,j,k) at a particular level "i" is the same for the entire level, but it may differ from level to level as described below.

As noted above, the data router 15 transfers message packets among the processing elements 11, scalar processors 12 and input/output processors 13, all of which are represented by leaves 21. Each connection shown in FIG. 2B between a leaf 21 and a data router node 22(1,j,k) of level 1, which is represented by a line therebetween, actually represents two unidirectional data paths, one for transferring a message packet in each direction. Thus, for example, the connection between leaf 21(0) and data router node 22(1,0,0) of data router node group 20(1,0) represents two data paths. One data path is used by the leaf 21(0) to transmit a message packet to the data router node 22(1,0,0) for delivery to another leaf 21(x). The other data path is used by the data router node 22(1,0,0) to deliver message packets originating at other leaves 21 destined for the leaf 21(0).

Similarly, each connection between a data router node 22(i,j,k) of a level "i" and a data router node 22(i+1,j,k) of a level "i+1," which is also represented in FIG. 2B by a line, represents two unidirectional data paths, one for transferring a message packet in each direction. Thus, for example, the connection between data router node 22(1,0,0) of data router node group 20(1,0) and data router node 22(2,0,0) represents two data paths, one used to transfer message packets from data router node 22(1,0,0) to data router node 22(2,0,0) and the other to transfer message packets in the opposite direction, that is, from data router node 22(2,0,0) to data router node 22(1,0,0).

Transfer of a message packet from one leaf 21(x) to another leaf 21(y) through the data router 15 message transfer proceeds in two general operations. First, the data router nodes 22(i,j,k) transfer the message packet first "up the tree," that is, to data router nodes in successively higher levels, until it reaches a selected maximum level determined in part by the separation between the source and destination leaves. After a message packet has reached the selected maximum level, the transfer continues "down the tree", during which the data router nodes 22(i,j,k) transfer the message packet to data router nodes at successively lower levels until it is delivered to the destination leaf 21(y). As will be clear from the detailed description of the structure and operation of a data router node 22(i,j,k) in FIGS. 11A through 11D below, the data router 15 can transfer a plurality of messages concurrently, any of the data router nodes 22(i,j,k) can direct messages up the tree and other messages down the tree at the same time.

Before proceeding further, it may be helpful to describe the structure of a message packet transferred over the data router 15. With reference to FIG. 3, a data router message packet 30 includes three general portions, including a message address portion 31, a message data portion 32, and a checksum portion 33, each comprising one or more "flits." In one embodiment, each flit comprises four bits, which are transferred in parallel over a data router connection, that is, between a leaf 21 and a data router node 22(i,j,k) or between two data router nodes 22(i,j,k).

The message data portion 32 includes several elements, including a length flit 34, a tag flit 35 and one or more data flits 36(0) through 36(N) (generally identified by reference numeral 36). The data flits 36 generally contain the actual message data being transferred over the data router 15, which may vary from packet to packet. The tag flit 35 contains control information which may be used by the destination leaf, identified herein by reference numeral 22(y), in processing the data. The contents of the length flit 34 are identify the number of flits in the message data portion 32, and may vary depending on the amount of data being transferred in a particular packet. In one particular embodiment, the contents of length flit 34 identify the number of thirty-two bit words in the data flits 36 of the message packet. In that embodiment, the number of data flits 36 in the message packet is eight times the value in the length flit 34.

The checksum portion 33 contains a value which is used in detecting errors in packet transmission over the data router 15.

The data router 15 uses the contents of the message address portion 31 to determine the path to be traversed by the message packet 30 from the source leaf to the destination leaf. The message address portion 31 includes a header 40, which identifies the selected maximum level to which the message packet is to be transferred when going up the tree, and a down path identification portion 41 which identifies the path down the tree to the destination leaf 21(y) when going down the tree. When directing a message packet up the tree, a data router node 22(i,j,k) at level "i," randomly selects one of the data router nodes 22(i+1,j,k) connected thereto in level "i+1" in data router node group 20(i+1,j) to receive the message packet. Other than specifying the selected maximum height for the message packet, the packet does not otherwise specify the particular path it is to take up the tree.

The down path identification portion 41 of message packet 30 defines the path the packet is to take down the tree from the data router node group 20(i,j) at the selected maximum level to the destination leaf 21(y). The down path identification portion includes one or more down path identifier fields 42(1) through 42(M) (generally identified by reference numeral 42). The successive down path identifier fields 42, beginning with field 42(M), are used by the data router nodes 22(i,j,k) at successively lower levels as they direct the packet downwardly in the tree.

The down path identifier field 42(i) for level "i" identifies the child data router node group 20(i-1,j) to which the parent data router node group 20(i,j) that receives the packet at level "i" is to direct the message packet 30. It will be appreciated that the down path identifier fields 42 need not specifically identify one of the data router nodes 22(i-1,j,k) in the data router node group 20(i,j) at each level to which the message packet is to be directed, since the path down the tree is effectively a traversal of the inverted tree of which the destination leaf 21(y) is the root.

In one embodiment, in which each parent data router node group 20(i,j) is connected to four child data router node groups 20(i-1,j) or four leaves 21, each down path identifier field 42 comprises two bits that are binary encoded to identify one of the four children to which the message is to be directed. As indicated by FIG. 3, two fields 42 are packed into a single four-bit flit in the message packet 30. Since one down path identifier field 42 is used to at each level (i) in the downward traversal, the number of down path identifier fields 42 required to define the downward path corresponds to the selected maximum level in the path up the tree, which, in turn, corresponds to the contents of header 40. During the downward traversal mode, the data router nodes 22(i,j,k) through which a message packet 30 passes decrement the contents of the header 40 and, after both down path identifier fields 42 contained in a flit have been used, discard the flit. Thus, the length and content of a message packet 30 may change as it is being passed down the tree.

It will be appreciated that the addressing arrangement provided by the header 40 and down path identification portion 41 can be viewed as follows. The selected maximum height in header 40 effectively identifies the data router node group 20(i,j) which is the root of a sub-tree, preferably the smallest sub-tree, of the data router 15 that contains both the source leaf 21(x) and the destination leaf 21(y). On the other hand, the down path identification portion 41 details the exact path from that root to the destination leaf 21(y).

The provision of increasing numbers of data router nodes 22(i,j,k) in data router node groups 20(i,j) at higher levels in the data router 15, thereby resulting in a "fat-tree" design, provides several advantages. In a massively parallel computer SIMD system, processing elements 11 typically transfer messages during a message transfer operation, initiated by commands from the scalar processors 12. During a message transfer operation, a large number of processing elements 11 may transfer messages concurrently. If the data router 15 did not have increasing numbers of data router nodes 22(i,j,k) at higher levels to which the message packets 30 can be directed when going up the tree, the bandwidth of the data router 15, that is, the rate at which it can transfer message packets 30, would decrease at higher levels.

Since increasing numbers of data router nodes 22(i,j,k) are provided at higher levels in the "fat-tree" design, the reduction in bandwidth at higher levels can be minimized or controlled. As noted above, the fan-out of data router node groups 20(i,j), that is, the number of data router nodes 22(i+1,j,k) at level "i+1" connected to each data router node 22(i,j,k) at level "i" can vary from level to level, and can be selected to maintain a desired minimum bandwidth between the respective levels "i" and "i+1." Alternatively, the fan-outs from each level to the next higher level can be selected so that the entire data router 15 has a selected minimum bandwidth.

Further, as noted above, each data router node 22(i,j,k) randomly selects the data router node 22(i+1,j,k) in the next higher level to which it directs a message packet 30 in the path up the tree. Accordingly, the message packets are randomly distributed through the higher levels of the tree, which minimizes the likelihood of bottlenecks and maximizes the bandwidth in the higher levels.

As shown in FIGS. 2A and 2B, each data router node group 20(i,j), and in particular each data router node 22(i,j,k), in the data router 15 receives an AFD(i,j) all-fall-down (i,j) signal. The AFD(i,j) all-fall-down (i,j) signal is provided by the control network 14, as will be described below in connection with FIGS. 4A and 4B, under control of the scalar processors 12 to initiate a context switch operation. The AFD(i,j) all-fall-down (i,j) signal, when asserted, enables the data router 15 to enter an all-fall-down mode, in which it quickly empties itself of message packets. In response to the AFD(i,j) all-fall-down (i,j) signal, the data router 15 directs all message packets 30 directly down the tree to the leaves 21, where they are stored until the context in which the message packets were generated is restored. At that point, the leaves 21 which receive such messages can transmit them over the data router 15, which will deliver them to the intended destinations.

In contrast to normal operation described above, in which the contents of the header 40 are decremented and flits containing down path identifier fields 42 discarded as the message packet 30 is directed down the tree, when the AFD(i,j) all-fall-down (i,j) signal is asserted the contents of the header 40 are not decremented and no changes are made to the flits containing the down path identifier fields 42. When the context is restored and the leaves 21 return the message packets to the data router 15, they will be delivered to the proper destination leaves. This can be seen from the following explanation.

In the following explanation, reference numerals 21(x) and 21(y) will refer to the original source and destination leaves, respectively, for a message packet 30 and reference numeral 21(x') will refer to the intermediate storage leaf which receives and stores the message packet 30 while the context in which the data router message packet 30 was generated is being switched out. First, for those message packets that are being transferred up the tree or that have reached the selected maximum height when the AFD(i,j) all-fall-down (i,j) signal is asserted, the contents of the header 40 and down path identification portion 41 are the same as when they were originally transmitted by the source leaf 21(x). Since the intermediate storage leaf 21(x') receives the message packet 30 it must be part of a sub-tree of the data router 15 that includes both the source leaf 21(x) and the destination leaf 21(y). Further, the sub-tree has the same root data router node group 20(i,j) that the message packet 30 would have reached had the AFD(i,j) all-fall-down (i,j) signal not been asserted. Accordingly, when the intermediate storage leaf 21(x') transmits the message packet over the data router 15, the packet will go up the tree and reach the same data router node group 20(i,j) that it would have reached if the AFD(i,j) all-fall-down (i,j) signal had not been asserted, and from there will follow the same downward path, defined by the down path identification portion 41, that it would have taken.

On the other hand, if a message packet is being transferred down the tree when the AFD(i,j) all-fall-down (i,j) signal is asserted, prior to the signal's assertion the contents of the header field 40 are decremented as the message packet is passed from level to level. Accordingly, it will be appreciated that, when the message packet 30 is transmitted by the intermediate storage leaf 21(x'), in its path up the tree it will go only to a data router node group 20(i,j) at the level indicated in the header field 40, which, in turn, corresponds to the data router node group 20(i,j) which controlled the direction of transfer of the message packet 30 when the AFD(i,j) all-fall-down (i,j) signal signal was asserted. It will be appreciated that the data router node group 20(i,j) that the message packet 30 reaches may not be the root of a sub-tree that includes the source leaf 21(x). However, it will be the root of a sub-tree that includes both the intermediate storage leaf 21(x'), since the message packet 30 was transferred from that data router node group 20(i,j) to the intermediate storage leaf 21(x'), and the destination leaf 21(y), since the message packet 30 could have been transferred from that data router node group 20(i,j) to the destination leaf had the AFD all-fall-down (i,j) signal not been asserted.

As will be described in further detail below, each leaf 21 maintains a message counter that it increments when it tranmsits a message packet over the data router 15, and that it decrements when it receives a message packet from the data router 15. As noted above, the control network 14 performs selected arithmetic operations, whose results can be provided to