WikiPatents - Community Patent Review
Create Free Account  |  License or Sell Your Patent  |  WikiPatents Marketplace  |  WikiPatents Blog
Username:  Password:  
    
Advanced Search
Task flow computer architecture    
United States Patent5574933   
Link to this pagehttp://www.wikipatents.com/5574933.html
Inventor(s)Horst; Robert W. (Champaign, IL)
AbstractA computer architecture has a plurality of processing cells interconnected to perform programming tasks. Each cell contains both memory and processing elements. Memory packets contain an instruction, a data element, and a pointer to another memory packet. Tasks are executed by following a linked list of memory packets. Transmission packets communicate instructions and register values along the linked list. A plurality of computer processes may be executed simultaneously.
   














 Title Information Submit all comments and votes
 
Patent Text Patent PDF Print Page Summary File History
Plain text PDF images Print Summary File History
Drawing from US Patent 5574933
Task flow computer architecture - US Patent 5574933 Drawing
Task flow computer architecture
Inventor     Horst; Robert W. (Champaign, IL)
Owner/Assignee     Tandem Computers Incorporated (Cupertino, CA)
Patent assignment
All assignments
Publication Date     November 12, 1996
Application Number     08/329,719
PAIR File History     Application Data   Transaction History
Image File Wrapper   Patent Term   Fees
Litigation
Filing Date     October 26, 1994
US Classification     712/28 710/30 710/33 710/52
Int'l Classification     G06F 003/00
Examiner     Bowler; Alyssa H.
Assistant Examiner     Nguyen; Dzung C.
Attorney/Law Firm     Townsend and Townsend and Crew
Address
Parent Case     This is a Continuation of application Ser. No. 07/735,594 filed Jul. 25, 1991, now U.S. Pat. No. 5,404,550.
Priority Data    
USPTO Field of Search     395/800 395/375 395/425 395/200.15 395/850 395/853 395/872 364/DIG. 1 364/232.22 364/232.91
Patent Tags     task flow computer architecture
   
Enter a comma (,) or semicolon (;) between multiple tag words/phrases.
Describe this patent:
 Amusing   
 Clever   
 Complex   
 Efficient   
 Historic   
 Important   
 Innovative   
 Interesting   
 Practical   
 Simple   
[no votes]
Patent WIKI

Share information and news about this patent, including information and news about the technology, inventors, company, ligation and licensing.

 References Submit all comments and votes
 
*references marked with an asterisk below are user-added references
 U.S. References
 
Add a new US reference:  
ReferenceRelevancyCommentsReferenceRelevancyComments
5179685
Nojiri
712/41
Jan,1993

[0 after 0 votes]
5133062
Joshi
703/13
Jul,1992

[0 after 0 votes]
5117489
Komori
712/201
May,1992

[0 after 0 votes]
5095480
Fenner
370/238
Mar,1992

[0 after 0 votes]
5093919
Yoshida
712/25
Mar,1992

[0 after 0 votes]
5043880
Yoshida
712/201
Aug,1991

[0 after 0 votes]
5021947
Campbell
712/25
Jun,1991

[0 after 0 votes]
4972315
Yamasaki
712/201
Nov,1990

[0 after 0 votes]
4893234
Davidson
712/27
Jan,1990

[0 after 0 votes]
4837676
Rosman

Jun,1989

[0 after 0 votes]
4153932
Dennis
712/27
May,1979

[0 after 0 votes]
4149240
Misunas
712/25
Apr,1979

[0 after 0 votes]
3962706
Dennis
718/102
Jun,1976

[0 after 0 votes]
 Foreign References
 Other References
 Market Review Submit all comments and votes
   
Market Size
Estimate the gross annual revenues of the relevant market sector:
> $10B
$5B - $10B
$2B - $5B
$500M - $2B
$100M - $500M
$10M - $100M
$1M - $10M
$500K - $1M
$100K - $500K
< $100K
[No votes]
$0
 
$0   $2.5B   $5B   $7.5B   $10B
Market Share
Estimate the percentage of the relevant market sector this invention will capture:
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Reasonable Royalty
What percentage of gross sales should the inventor or assignee be paid?
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Public's "Guesstimation" of Royalty Value
Market SizeN/A[No votes]
xMarket ShareN/A[No votes]
xReasonable RoyaltyN/A[No votes]

N/A

License Availablity
If you are NOT the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
License Availablity
If you ARE the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
Competitive Advantage
Does this invention have a significant competitive advantage over similar technologies?
Yes

No



[No votes]
Most helpful competitive advantage comment
[No comments]

Commercial Alternatives
Are there viable commercial alternatives for this invention?
Yes

No



[No votes]
Most helpful commercial alternative comment
[No comments]

 Technical Review Submit all comments and votes
 Claims Submit all comments and votes
 


What is claimed is:

1. A computer for executing a task, the task being segmented into a plurality of instructions to be executed sequentially, said computer comprising:

(a) an interconnection network;

(b) a first cell coupled to said interconnection network, said first cell including:

a first receive queue coupled to said interconnection network for receiving a first incoming transmission packet containing a first instruction and a first link address, said first link address identifying a first memory packet and forming a portion of a linked list of memory packets specifying an order of execution;

a first memory coupled to said first receive queue for storing said first memory packet, said first memory packet being selectively disposed in said first cell and including a second instruction, a first data element and a second link address, said second link address identifying a second memory packet and forming a portion of said linked list of memory packets specifying an order of execution;

a bypass queue coupled to said first memory for outputting a second incoming transmission packet that addresses no memory packet residing in said first memory;

a first arithmetic and logic unit (ALU) coupled to said first memory for executing said first instruction;

means coupled to said first ALU for generating a first outgoing transmission packet containing said second instruction and a link field for storing said second link address; and

(c) a second cell coupled to said first cell through said interconnection network, said second cell comprising:

a second receive queue coupled to said interconnection network for receiving said first outgoing transmission packet;

a second memory coupled to said second receive queue for storing said second memory packet, said second memory packet being selectively disposed in said second cell and including a third instruction, a second data element and a third link address, said third link address identifying a third memory packet and forming a portion of said linked list of memory packets specifying an order of execution;

a second ALU coupled to said second memory for executing said second instruction stored in said first outgoing transmission packet; and

means coupled to said second ALU for generating a second outgoing transmission packet containing said third instruction and said third link address.

2. The computer of claim 1 wherein said first input queue comprises a bidirectional shift register.

3. The computer of claim 1 wherein said interconnection network comprises a two-bus bidirectional ring.

4. The computer of claim 1 wherein said interconnection network comprises a four-bus bidirectional ring.

5. The computer of claim 1 wherein said interconnection network comprises a unidirectional Torus.

6. The computer of claim 1 further comprising a global synchronous clock coupled to said first cell and said second cell.

7. The computer of claim 6 further comprising a watchdog timer in said first cell, said timer generating a signal indicating that said first incoming receive queue has not changed state within a predetermined number of clock cycles of said global synchronous clock.

8. The computer of claim 1 wherein said interconnection network has a first width and said first incoming outgoing transmission packet has a second width, said first width being different from said second width.

9. The computer of claim 8 wherein said first width is less than said second width.

10. The computer of claim 9 wherein said interconnection network forwards a first incoming portion and a second portion of said first outgoing transmission packet serially.
 Description Submit all comments and votes
 


BACKGROUND OF THE INVENTION

This invention is in the field of computer architecture, and, more particularly it relates to parallel processing and multiprocessor computer systems.

Current approaches to parallel processing include Single Instruction Multiple Data (SIMD) machines, parallel processors, systolic arrays, and data flow machines. SIMD machines use a single sequential instruction stream to control parallel arithmetic units. Multiple Instruction Multiple Data (MIMD) computer have a large number of sequential processors which are connected to memory elements through a routing network. Systolic arrays such as WARP and iWARP pass data through a communications network in such a way that operands simultaneously arrive at a processing element. Data flow machines pass data tokens to processing elements which "fire" after all operands arrive.

In addition to the many possible architectures for parallel processing machines using known discrete microprocessors, developments in Integrated Circuit (IC) fabrication have created new possibilities for multiprocessor computer systems. The ever increasing density of Very Large Scale Integration (VLSI) components offers new challenges to computer architects to find optimal use for the added silicon area. Today it is feasible to include a processor, floating point arithmetic unit, and small, first-level cache on a single IC. Higher levels of integration have greatly improved single processor performance, largely due to the elimination of inter-circuit signal delays in the critical paths to the memory caches.

As circuit densities continue to increase, the next steps needed to improve performance are not clear. Increasing the size of the on-chip caches will improve performance but this "solution" soon reaches a point of diminishing returns. Another approach is to include multiple function units and to allow multiple instructions to be dispatched simultaneously. These "superscalar" designs also improve performance, but the techniques used do not scale beyond a few instruction dispatches per cycle.

The advances of Wafer Scale Integration (WSI) will add even more pressure to find creative uses for the vast increases in available circuit area. It is likely that the next major advance in processing speed will result from the incorporation of multiple parallel processors on the same silicon device. Many current approaches to parallel processing define the problem in terms of selecting an optimal interconnection network for multiple high-performance microprocessors.

Although much work has been done on parallel processors and their interconnection networks, the problems with known approaches include the complexity of the interconnection network and the necessity of frequent and time-consuming memory access. A parallel processing architecture which could utilize a relatively simple interconnection network and reduce the number of needed main memory accesses would be a significant advance in parallel computing.

SUMMARY OF THE INVENTION

The present invention comprises a new computer architecture which uses a plurality of processing units operating in parallel. This architecture is known herein as task flow architecture. Simple replicated cells contain both memory packets and processing logic. Each memory packet contains a data field, the next instruction and a link. Execution of a task is accomplished by following a linked list of memory packets. Register values and instructions are communicated as part of a transmission packet to the next memory packet in the linked list.

This invention involves sending computations to stationary data objects, rather than sending data from memory to stationary processors. There are multiple cells which contain both computing and memory elements. Data is partitioned across the simple identical interconnected cells. Computation is performed by a set of tasks flowing through the network. Each task is executed by following a linked-list of memory packets, each of which contains a data element, the next instruction to perform, and a link to the next memory packet.

This task flow architecture will now be described in detail with reference to the figures listed and described below.

BRIEF DESCRIPTION OF THE ILLUSTRATIONS

FIG. 1 is a high level block diagram of a first embodiment of the present invention;

FIG. 2 shows a typical memory packet used in the present invention;

FIG. 3 shows a typical transmission packet used in the present invention;

FIG. 4 illustrates how computations are performed by the present invention;

FIG. 5 shows a typical cell used in the present invention;

FIG. 6 shows a 3.times.4 array of cells coupled together by a network of multiplexers;

FIG. 7 is a machine level instruction set used by the present invention;

FIG. 8 is a description of several instructions used in the present invention;

FIG. 9 is an example of task flow execution;

FIG. 10 is an example of matrix-vector multiplication using the resent invention; and

FIG. 11 is another example of task flow execution;

FIG. 12 shows a diagram of two cells in a two-bus-per-cell bidirectional ring;

FIG. 13 shows a bidirectional ring implemented with four bus connections to each cell;

FIG. 14 shows four cells of a unidirectional torus;

FIG. 15 shows the receive queue at three sequential time steps;

FIG. 16 shows a possible encoding of variable length MPs; and

FIG. 17 illustrates the six classes of instructions in the first preferred embodiment of the present invention.

DETAILED DESCRIPTION OF THE EMBODIMENT

FIG. 1 is a high level block diagram of a task flow computing machine 10. Machine 10 comprises a plurality of cells 12 coupled together by an interconnection network. Each cell contains a processing element (e.g., an arithmetic unit) and memory. Data is partitioned across the cells. Computation is performed by a set of tasks flowing through the network and performing operations on the data. Multiple tasks may be active simultaneously, but there is no permanent connection between tasks and processing elements. A single instruction may be executed at each cell, or a task may remain local to a cell while many instructions are executed.

In the embodiment shown in FIG. 1, the interconnection network is a unidirectional ring. Each cell is unidirectionally coupled to one other cell. Other arrangements of cells are possible, such as a bidirectional ring, a chordal ring, a mesh, a torus, a hypercube, or a crossbar. The unidirectional ring is advantageous because it requires no logic for arbitration or routing, and neighboring cells of the ring can be physically adjacent, improving overall cycle time.

Each cell has a wide shallow memory for storing memory packets. The contents of a memory packet 20 are shown in FIG. 2. A memory packet 20 contains four separate fields for storing an instruction 21, a cell/address link 22 to another memory packet, a data element 23, and a lock bit 24.

The memory packets form a linked list. The cell/address link 22 points to the next cell, and the offset within that cell. A task is executed by following the linked list.

Program execution is accomplished through the flow of transmission packets. Transmission packets are transmitted from one cell to the next, carrying the state of each executing task. Tasks may flow around the ring, or remain local to one cell. Each cell includes bypass logic to allow transmission packets to bypass an intermediate cell between the sending cell and the receiving cell, even if the intermediate cell is busy.

The contents of a typical transmission packet are shown in FIG. 3. Transmission packet 30 comprises an instruction 31, data registers 32 (two in this embodiment), a destination cell identifier 33, and a memory offset address 34 referencing a memory location at the destination cell.

The cell identifier and memory address together point to a memory packet. The destination cell receives a transmission packet, uses the address embedded in the transmission packet to address a memory packet, performs the operation indicated by the instruction in the transmission packet on data contained in the memory packet, and produces a new transmission packet using the results of the just performed operation. The instruction stored in the addressed memory packet becomes part of the new transmission packet. The new transmission packet is routed to the next memory packet in the linked list, which may be in the same or a different cell.

This architecture differs from other parallel computing architectures by sending computations (instructions) to stationary data objects, rather than sending data from memory to stationary processors. A simple example will illustrate the flow of transmission packet. FIG. 4 shows an incoming packet arriving at cell 9, address 91. Address 91 contains an instruction that adds the incoming register field Ra (currently equal to 2) to the data stored in the memory packet to be visited next. A transmission packet 47 is generated in order to send the ADD to cell 11, address 45. When the packet arrives at cell 11, it finds a 12 in the data field, adds it to the register value, then places the sum 14 in the outgoing transmission packet. The outgoing packet also picks up the STORE instruction (ST) and the link to the next memory packet at cell 2, address 13. A more detailed example will be presented later, after the instruction set is described.

FIG. 5 is a block diagram showing the construction of a single cell of a computing machine constructed according to this embodiment of the present invention. Input pipeline register 51 receives a transmission packet. Comparator 52 compares the destination cell identifier in the transmission packet to the local cell identifier. If they do not match, the transmission packet is sent through bypass queue 53 to output pipeline register 54. If an instruction is creating a new transmission packet to be sent to another cell, the packet being created has priority for the output register and any bypass packets are queued.

If the received transmission packet cell number matches the local cell number, the transmission packet is passed through receive queue 55 to the execution logic. If the cell is busy, several packets may be queued until the current instruction completes.

Overflow of the queues is prevented by propagating a busy signal to the previous cell in the network when there is just enough room in the cell's queues to store the maximum number of packets which could be generated before the previous cell responds to the busy signal. A busy cell continues to forward packets not intended for it.

The cell cycles through those instructions in its input queue which may be successfully completed. The address field in the transmission packet addresses memory 56 while decoder 57 decodes the instruction field. The data field of the addressed memory packet may be modified by a store instruction or may be used as an operand in a load or arithmetic instruction. Multiplexers select from the memory data, values in received registers, and constants to determine what is sent to Arithmetic Logic Unit (ALU) 58. The lock bit in the memory packet may prevent an instruction from executing. Each instruction which reads or writes the data field has a locking bit which indicates whether the lock bit of the memory packet (see FIG. 2) is to be honored or ignored. If the lock bit is to be honored, instructions which generate data must wait for the lock bit to be cleared before proceeding. Instructions which read data must wait for the lock bit to be set. Tasks which share data use the lock instructions to guarantee that execution will proceed correctly regardless of the order in which instructions arrive. Other locking schemes could be adapted for the task flow machine, including most of the synchronized instructions used in shared memory multiprocessors. The present scheme has the advantage of distributing the synchronization across many different cells.

When an operation is completed, output latch 59 is loaded with the outgoing transmission packet. The outgoing instruction is decoded to determine whether a new task is to be started by a "FORK" instruction. Comparator 63 compares the destination cell identification number to the local cell identification number and, if they are equal, the transmission packet is routed back to receive queue 55. Instructions which fail to complete, either due to a busy signal from the destination cell or due to an inability to access a locked variable, are sent back to receiving queue 55 for later execution or bypass transmission. Transmission packets created by instructions which are just completed are given priority for the output register and any bypass packets are queued.

Instruction fetch is automatically overlapped with data fetch, because the same memory packet stores both the data for one instruction and the opcode for the next instruction. This effectively creates a two stage pipeline which is spread across the two cells visited successively by a given task.

The current embodiment uses physical addresses. The destination cell identifier and next memory packet address are derived by extracting bit fields from the current memory packet link fields. Cell assignments of variables are fixed at compile time. In other embodiments, virtual addressing could be used, allowing relocation of data and paging of memory packets. A virtual-physical address translation table could be placed between the memory link field output as it moves from memory 56 to output latch 59. The translation table could be accessed in parallel with the ALU operation.

Memory requirements for the cell could be reduced by using relative, rather than absolute addressing, to reduce the number of bits required to store the cell number and address offset, or by using variable length memory packets.

Although one embodiment of the present invention could comprise a single cell as shown in FIG. 5, another preferred embodiment of the invention uses a plurality of cells, coupled together by a communications network. One possible arrangement of a plurality of cells is shown in FIG. 6, which comprises a 3.times.4 array of cells on a single wafer. The cells are arranged in a checkerboard pattern with adjacent cells rotated 180.degree.. The delay between pipeline registers in the ring (and thus between the output of one cell and the input of the next cell in the chain) is always the sum of four separate multiplexer delays, and does not depend on the distribution of defective cells. The clock may be propagated through the same multiplexers as the data to reduce clock skew and to tolerate defects in the clock distribution network. The wafer scale integration technique supports fast clock rates due to the fixed delay between cells, the low skew clock distribution scheme, and the lack of run time arbitration and routing decisions that would be required in higher dimension networks.

Each cell border connects to either a neighboring cell or an internal bus through a multiplexer. Defective cells may be avoided through proper setting of multiplexer control latches. In the figure, cells 61, 62 and 63 are defective. The wafer scale integration technique of this embodiment is described in more detail in the article "A Linear-Array WSI Architecture for Improved Yield and Performance", in Proc. International Conference on Wafer Scale Integration, San Francisco, Calif., Jan. 4, 1990, and in copending U.S. patent Application Ser. No. 07/346,203, filed May 2, 1989, entitled "Linear Array Wafer Scale Integration Architecture", each of which is hereby incorporated by reference into the present specification in its entirety.

Using 1 micron CMOS technology, the maximum clock frequency for data transfers is approximately 200 MHz or 5 nanoseconds. Instructions require multiple clock cycles, typically ranging from 4 clock cycles for a simple load up to approximately 17 clock cycles for an integer multiply. Instructions which flow to the next cell in the ring incur a 1 clock cycle latency penalty over tasks that remain local. The extra time spent in communications affects the time required for a single task, but does not reduce the peak processing throughput. In many applications, the cells are kept fully utilized by generating multiple tasks per cell.

In this embodiment, half of each cell is devoted to processing and half to memory. Other mixes may prove more appropriate for particular embodiments and applications.

Referring to FIG. 7, a summary of the machine level instruction set of this embodiment is shown. Two operand fields 71 and 73 use data stored in the two registers, the constants 0 or 1, or memory data. In this embodiment, each instruction is limited to one memory reference. Destination field 75 specifies whether the ALU output is directed to Ra or Rb of the outgoing transmission packet.

A summary of the operation of some representative task flow instructions is shown in FIG. 8. For each instruction, the outgoing "SND" transmission packet resulting from execution of the instruction at the receiving cell is shown. In some cases, an "SEQ" transmission packet is created by the sending cell when the instruction is originally fetched.

The "RCV" prefix in several of the instructions refers to a field received in the incoming transmission packet. The "MEM" prefix refers to a field in the local memory packet addressed by the incoming transmission packet.

Arithmetic instructions (ADD, SUB, etc.) have three available operands: the register values Ra and Rb transmitted in the transmission packets sent from the sender to the receiver, and the data in the memory packet fetched at the receiver. For example, ADD instruction 81 adds the memory data value to one of the registers, and stores the result in one of the registers. The resulting transmission packet 83 contains the sum in a register field. The other register field is unmodified. The next instruction and the link address (cell and offset) of the next memory packet have been fetched from the current memory packet and placed in the outgoing transmission packet.

Some instructions "FORK" a task into multiple tasks. When the FORK instruction 85 is fetched from a memory packet, two transmission packets are created. The sending cell dispatches a first packet to the link address and a second (SEQ) packet to the next sequential memory packet in the sending cell. The first transmission packet sends the unchanged registers and the FORK instruction to the receiving cell addressed by the link field of the local memory packet. The FORK instruction is executed as a no-operation at the receiving cell. The second "SEQ" transmission packet sends a NOP instruction to the next sequential memory packet in the same (sending) cell. Some arithmetic operations might combine the FORK with arithmetic operations to provide the destination cell with an operation.

Load and Store instructions allow reading and writing of a data field, link field, or instruction field. Store instructions send a packet to perform the Store operation at the link address. When storing data at a memory location, it is usually not desirable to continue the task that performed the store. Instead, it is desirable to continue the original task at the next sequential memory packet. The Store Via Fork (STF) instruction sends data to the memory packet addressed by the link, but continues the task at the next sequential instruction.

The Store Via Fork (STF) instruction temporarily generates two tasks when it is fetched. A first task sends a first transmission packet to store data to the memory packet addressed by the link and then terminates without generating another transmission packet. A second task sends a SEQ transmission packet to the next sequential memory packet in the same cell. The second task continues at the sending cell, and the data can be accessed at a later state in the execution.

In the instruction set of this embodiment, the termination of the storing task, and the continuation of the sending task, are both optional. The four variations are specified by encoding of the OP.sub.-- M field (FIG. 7, 71).

The GET instruction (FIG. 8, 87) allows data to be retrieved from a memory packet while ignoring the instruction and link associated with that data. When a GET instruction is fetched, a return address is placed in one of the register fields (Rb) of the transmission packet. The cell that receives the transmission packet fetches the memory data, loads it into a register field, and sends the task to the return location (rather than the target of the link field). The GET instruction always retrieves the data field at the link address; GETA and GETI retrieve the link or instruction field at the link address.

The GET instruction emulates the load instruction of a traditional computer architecture. The address is sent to the data, the data returns, and execution continues at the next location. Without the GET instruction, a separate copy of each memory packet would be required for each unique algorithm accessing the data. The GET instruction allows a memory packet to have more than one possible successor, with the most frequent successor contained in the link field. Algorithms requiring infrequent access to data may use GET, while the most frequent algorithm may use the associated instruction and link fields to avoid the round trip path which would require sending the address and returning the data.

Conditional branch instructions perform a test, and then either a "normal" SND transmission packet is sent to the target of the link field, or a SEQ packet is sent to the next sequential location in the same cell. The BZERO instruction 89 is an example of a conditional branch. As shown in FIG. 7, two sets of Branch Ops allow comparisons of either floating point or fixed point operands selected by operand M and operand K. The six conditions are the conditional tests greater than, less than, greater or equal to, less than or equal to, equal, or not equal. The comparisons may involve register values, memory values, or constants.

The CALL instruction (FIG. 17) is similar to the first part of a GET. The sending instruction places the return address in register Rb. At the destination, CALL performs operations including than fetching of the next memory packet. At the end of the subroutine, a return operation is performed by specifying that Rb should be substituted for the link field. The most significant bit of the link field indicates that Rb is to be used as the link.

The Locked Load (LLOAD) 91 and 93 Locked Store via Fork (LSTF) instructions are examples of instructions which use the lock bit to synchronize tasks.

An example of task flow execution will now be given. FIG. 9 shows an assembly language program for calculating the sum of a multiplication and division for a particular set of values. Specifically, the program calculates:

A=B*C+B/D for B=6, C=2, D=3

FIG. 9 shows the execution of this program, after it is assembled and stored in the cells. Execution begins at cell 1, address 12, where a transmission packet 99 is generated with the Load (LD) instruction and a link address of cell 2, address 23. At cell 2, memory packet 101 is fetched, the load instruction is executed by loading the value "6" from the memory cell into register Rb, and sends a transmission packet 103 with the register value and the Multiply instruction (MPY) to cell 7. At cell 7, the memory packet at address 31 is fetched, and the MPY instruction is executed, multiplying the value in Rb times the memory value from the memory packet and loading the result in register Ra. The instruction in the memory packet is a forked store instruction, STF, so cell 7 generates a SEQ transmission packet 105 to itself, in addition to sending transmission packet 107 to the link address, cell 11, address 45. At cell 11, the STF instruction causes the value in register Ra to be stored in memory location 45, without generating another transmission packet.

At cell 7, the memory packet 106 is fetched from address 32 and transmission packet 109 is sent to cell 9 carrying the Divide (DIV) instruction. At cell 9, the contents of Rb are divided by the memory data element. The result is stored in Ra, in transmission packet 111, along with the ADD instruction. When the transmission packet 111 arrives at cell 11, the data in address 45 has been stored by the STF operation of the parallel task. If the communications network always delivers packets in order, then the store operation will always be completed by the time the main task arrives at cell 11 with the ADD instruction, and it is not necessary to use a locked instruction.

After the ADD instruction is executed at cell 11, transmission packet 113 is sent to cell 2 with a store instruction and the result in register Ra.

It will be understood that the multiplication (B*C) and division (B/C) operation could have been performed by separate, parallel tasks. In that case, B*C would have been stored using the Locked STF (LSTF) instruction, and the addition would have used the Locked Add (LADD) instruction to force the addition to be queued until the store has been completed.

Referring to FIG. 10, another example will illustrate the calculation of the product of a sparse matrix 151 and a vector 152. This calculation is common in neural network simulations and other scientific applications.

This problem can be solved on a three cell, two register task flow machine executing three tasks. Each task first multiplies one of the vector elements by each of the matrix elements in a column, then accumulates the products for one row. FIG. 11 is a simplified flow diagram showing the path traversed by each task when one column of the matrix is assigned to each cell.

An advantage of the machine architecture described herein is that most of the machine consists of fairly simple identical cells. Each cell may be built as a single integrated circuit with wide input and output buses to transmit packets. The preferred embodiment is implemented with wafer scale integration. Linear array wafer scale integration techniques are preferred because such techniques generally offer better "harvest" of good cells on the wafer, and require less configuration hardware then more complex two dimensional topologies. For the unidirectional ring topology of the preferred embodiment, linear array techniques provide high bandwidth communications between neighboring cells at fast clock rates. For large rings, for example more than 100 cells, it may be preferable to use several wafers with a ring network within each wafer and a mesh or crossbar network connecting the wafers.

The wafer scale technique of the preferred embodiment allows linear arrays to be configured from the working cells on a partially good wafer. The technique uses four multiplexers per cell to communicate with the four nearest neighbors. Configuration latches controlling the multiplexers are set in such a way that all working and reachable cells are configured in a linear chain.

Although a preferred embodiment of the present invention has now been described in detail, with reference to FIGS. 1-11, it is understood that additions or modifications may be made to the preferred embodiment without deviating from the teachings of the present invention. Most modifications increase the design complexity of the basic embodiment, but offer improved performance, reduced cost, or specific application support. These design alternatives are briefly outlined in the following section.

Interconnection Networks

The first preferred embodiment of the present invention uses a unidirectional ring, which is adequate for solving several types of problems. However, other applications may benefit from a richer interconnection network which minimizes communications bottlenecks. A drawback to any network more complex than the unidirectional ring is that phase-shifted synchronous clocking would have to be abandoned in favor of global synchronous clocking as when each cell receives data from multiple cells; it is no longer possible to send the clock along with the data. Although it might be possible to use some form of self-timed asynchronous bussing, the short packet lengths in a task flow machine would demand a large penalty for resynchronizing data with the cell's clock. The most practical approach is believed to be synchronous clocking via a carefully designed clock tree to reduce the clock skew. The remaining clock skew may still have some negative impact on performance.

Two-Bus Bidirectional Ring

The capability of supporting bidirectional communications may be important for some types of systolic algorithms and for finite-difference solutions to partial differential equations. The simplest approach to bidirectional communications is to allow the intercell busses to carry information in either direction. The number of connections per cell is maintained, but the ring becomes bidirectional. FIG. 12 shows a diagram of two cells in a two-bus-per-cell bidirectional ring.

During each clock cycle, the cell is set to accept a transmission packet from either its left or right neighbor, but not both. The packet may pass through the bypass queue to the outbound bus, or may proceed to the receive queue for processing by the cell. One or more arbitration cycles may be required before each transfer to coordinate access between each pair of cells. Compared with the unidirectional approach, the bidirectional busses are likely to cut total bandwidth by at least half.

Four-Bus Bidirectional Ring

In another embodiment, separate busses are added for each direction of transfer. FIG. 13 shows a bidirectional ring implemented with four bus connections to each cell. Using this structure, the traffic can flow through a cell simultaneously in both directions. The extra busses also allow a more optimistic communications protocol. The sending cell can normally assume there is room in the next cell's bypass or receive queue. A busy signal is sent from each cell to instruct its neighbors to stop sending whenever a queue is about to overflow.

The optimistic protocol should make each bus of the four-bus cell perform approximately the same as busses in the original unidirectional approach. With twice as many busses, bandwidth is approximately doubled. The added performance comes at a cost of twice the number of external connections per cell, additional area for queues, and increased control complexity. However, if there are constraints on the maximum number of I/O lines per cell, adding more busses may require additional multiplexing or weaker I/O buffers, partially offsetting the added bandwidth.

Unidirectional Torus Network

The same cell structure as in FIG. 13 can be used to construct a unidirectional torus network. Four cells of a unidirectional torus are shown in FIG. 14. Each cell participates in one ring in the X-direction and one in the Y-direction.

The torus may be the preferred network for applications requiring higher-dimension routing. It gives three efficient directions for task propagation: internal, X and Y, that may be useful in 3-dimensional matrix calculations.

One drawback of the torus is that it is difficult to imbed in a two dimensional wafer or circuit board while maintaining equal propagation delays between cells.

Other Networks

Many different networks are known for the interconnection of multiprocessor systems, and any of them could be employed in a task flow machine. Some possibilities include the chordal ring, mesh, hypercube, crossbar, and various multistage networks. At each increment of complexity, there is likely to be some loss in performance of a single path due to arbitration,