WikiPatents - Community Patent Review
Create Free Account  |  License or Sell Your Patent  |  WikiPatents Marketplace  |  WikiPatents Blog
Username:  Password:  
    
Advanced Search
Processor chip for parallel processing system    
United States Patent5535408   
Link to this pagehttp://www.wikipatents.com/5535408.html
Inventor(s)Hillis; W. Daniel (Brookline, MA)
AbstractA monolithic processing chip for a parallel processing system comprises a processor circuit and a memory circuit. The processor circuit processes data received from said associated memory circuit in accordance with processor control signals to generate processed data. The memory circuit includes a plurality of registers for storing data, each register including at least one data storage cell including at least one dynamic memory data bit store for storing a data bit. The memory circuit is responsive to memory control signals and register address signals to transmit stored data from the registers to the processor for processing and to store processed data received from the processor circuit in the register identified by the register address signals.
   














 Title Information Submit all comments and votes
 
Patent Text Patent PDF Print Page Summary File History
Plain text PDF images Print Summary File History
Drawing from US Patent 5535408
Processor chip for parallel processing system - US Patent 5535408 Drawing
Processor chip for parallel processing system
Inventor     Hillis; W. Daniel (Brookline, MA)
Owner/Assignee     Thinking Machines Corporation (Cambridge, MA)
Patent assignment
All assignments
Publication Date     July 9, 1996
Application Number     08/237,981
PAIR File History     Application Data   Transaction History
Image File Wrapper   Patent Term   Fees
Litigation
Filing Date     May 2, 1994
US Classification     712/16 710/52 711/106 712/13
Int'l Classification    
Examiner     Eng; David Y.
Assistant Examiner    
Attorney/Law Firm     Jordan; Richard A.
Address
Parent Case     CROSS REFERENCE TO RELATED APPLICATIONS This application is a continuation of U.S. patent application Ser. No. 06/626,362, filed Dec. 12, 1990, now abandoned which is a divisional of U.S. patent application Ser. No. 07/478,082, filed Feb. 9, 1990, now U.S. Pat. No. 5,152,000, issued Sep. 29, 1992, which is a divisional of U.S. patent application Ser. No. 07/184,739, filed Jun. 27, 1988, now U.S. Pat. No. 5,008,815, issued Apr. 16, 1991, which is a continuation of U.S. patent application Ser. No. 06/499,474, filed May 31, 1983, now U.S. Pat. No. 4,814,973, issued May 21, 1989.
Priority Data    
USPTO Field of Search     395/200 395/325 395/775 395/800
Patent Tags     processor chip parallel processing
   
Enter a comma (,) or semicolon (;) between multiple tag words/phrases.
Describe this patent:
 Amusing   
 Clever   
 Complex   
 Efficient   
 Historic   
 Important   
 Innovative   
 Interesting   
 Practical   
 Simple   
[no votes]
Patent WIKI

Share information and news about this patent, including information and news about the technology, inventors, company, ligation and licensing.

 References Submit all comments and votes
 
*references marked with an asterisk below are user-added references
 U.S. References
 
Add a new US reference:  
ReferenceRelevancyCommentsReferenceRelevancyComments
4644496
Andrews

Feb,1987

[0 after 0 votes]
4523273
Adams

Jun,1985

[0 after 0 votes]
4459660
Bellay

Jul,1984

[0 after 0 votes]
4380046
Frosch et al.

Apr,1983

[0 after 0 votes]
4320500
Barberis

Mar,1982

[0 after 0 votes]
3979728
Reddaway

Sep,1976

[0 after 0 votes]
3582899
Semmelhaack

Jun,1971

[0 after 0 votes]
3312943
McKindles

Apr,1967

[0 after 0 votes]
 Foreign References
 Other References
 Market Review Submit all comments and votes
   
Market Size
Estimate the gross annual revenues of the relevant market sector:
> $10B
$5B - $10B
$2B - $5B
$500M - $2B
$100M - $500M
$10M - $100M
$1M - $10M
$500K - $1M
$100K - $500K
< $100K
[No votes]
$0
 
$0   $2.5B   $5B   $7.5B   $10B
Market Share
Estimate the percentage of the relevant market sector this invention will capture:
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Reasonable Royalty
What percentage of gross sales should the inventor or assignee be paid?
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Public's "Guesstimation" of Royalty Value
Market SizeN/A[No votes]
xMarket ShareN/A[No votes]
xReasonable RoyaltyN/A[No votes]

N/A

License Availablity
If you are NOT the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
License Availablity
If you ARE the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
Competitive Advantage
Does this invention have a significant competitive advantage over similar technologies?
Yes

No



[No votes]
Most helpful competitive advantage comment
[No comments]

Commercial Alternatives
Are there viable commercial alternatives for this invention?
Yes

No



[No votes]
Most helpful commercial alternative comment
[No comments]

 Technical Review Submit all comments and votes
 Claims Submit all comments and votes
 


What is claimed is:

1. A parallel computer system comprising:

A. a processor array comprising a plurality of integrated circuit chips each comprising a plurality of processor circuits and a plurality of memory circuits, each processor circuit having an associated memory circuit, said processor circuit for processing data received from its associated memory circuit in accordance with processor control signals to generate processed data, each said memory circuit including a plurality of registers for storing data, each register including at least one data storage cell including at least one dynamic memory data bit store for storing a data bit, said memory circuits being responsive to memory control signals and register address signals to

(i) transmit stored data from correspondingly-addressed registers, as identified by the register address signals, to their associated processors for processing during a read operation, and to store processed data received from their associated processors in correspondingly-addressed registers, as identified by said register address signals during a write operation; and

(ii) perform a refresh operation concurrently with said write operations in connection with registers which are not identified by the register address signals; and

B. a host for generating said processor control signals, said memory control signals and said register address signals for controlling the operations of the processor circuits and the memory circuits of said processor array in parallel.

2. A parallel processing system as defined in claim 1 in which:

A. in each said memory circuit, each register further includes:

i. a register data transfer path; and

ii. a cell read/write control circuit for controlling the transfer of data between selected ones of said data storage cells and said register data transfer path in response to cell address signals received by registers of all of said memory circuits in parallel; and

B. each said memory circuit includes a register selector circuit connected to the register data transfer paths of all of said registers for selectively controlling the transfer of data and processed data between register data transfer paths of selected ones of said registers as identified by said register address signals and said processor circuit.

3. A parallel processing circuit as defined in claim 2 in which, in each memory, each register further includes a data buffer circuit connected to the register's respective register data transfer path and the memory circuit's register selector circuit, the data buffer of each register:

A. buffering data from its register data transfer path during a read phase; and

B. for selected ones of said registers as identified by the register address signals, buffering processed data generated by the associated processor circuit as received by the data buffer circuit from its respective register selector circuit during a processing phase,

each said data buffer circuit coupling the buffered data onto the respective register data transfer path during a write phase, whereby, for a register in each memory circuit that is a selected one of said registers, processed data is written into said register, and, for a register in each memory circuit that is not a selected one of said registers, data transfer to said data buffer circuit during said read phase is written into said register thereby to refresh the register.

4. A parallel processing system as defined in claim 3 in which, in each memory, each said cell read/write control circuit comprises a plurality of cell transfer circuits each associated with a cell for controlling, in response to the cell address signals, the transfer of data from its associated cell to the respective register data transfer path during the read phase and from the respective register data transfer path to its associated cell for storage therein during the write phase.

5. A parallel processing system as defined in claim 3 in which, in each memory circuit, each said data buffer circuit includes:

A. a data store connected to its respective register selector circuit, said data store receiving and storing data as transferred to said register by said register selector circuit in response to said register address signals;

B. a read gate for coupling data from the register transfer data path to both said data store and said respective register selector circuit in response to a read signal defining said read phase; and

C. a write gate for coupling data from said data store to said respective register transfer data path for storage in a cell in response to a write signal defining said write phase.

6. A parallel processing system as defined in claim 5 in which said data as coupled by the data store of each respect memory has a data value represented by one of a selected number of voltage levels, including a high voltage level and a low voltage level, and in which each register further includes a pre-charge circuit for establishing, in response to a pre-charge signal generated by said host received by all of said registers of all of said memory circuits in parallel, a pre-charge voltage level corresponding to said high voltage level on said register data transfer path in advance of said read phase.

7. A parallel processing system as defined in claim 1 in which each said processor circuit further comprises a global signal generating circuit for generating a global status signal in response to processed data and said processor control signals provided by said host, the host receiving said global status signal and using it in connection with generation of said processor control circuits.

8. A parallel processing system as defined in claim 1 in which each said processor circuit further generates message packets and receives message packets, each message packet including a destination identification identifying a processor circuit to receive the message packet, each said integrated circuit chip further including a global router interface circuit responsive to routing control signals for routing each message packet over interconnection links interconnecting said integrated circuit chips in accordance with the message packet's respective destination identification, said host generating said routing control signals and coupling them to the global router interface circuits of all of said integrated circuit chips in parallel.
 Description Submit all comments and votes
 


CROSS-REFERENCE TO RELATED APPLICATION

A related application is Processor/Memory Circuit filed concurrently herewith by W. Daniel Hillis and others.

BACKGROUND OF THE INVENTION

This relates to a computer that uses parallel processors and, in particular, to one that uses a vastly greater number of parallel processors than previously.

A typical digital computer includes a central processing unit (CPU), a memory which stores data and a program for controlling the computer, and various input and output devices. The stored program is a series of instructions that directs the CPU to perform certain arithmetic, transfer or logical operations on the data available to the computer. Such data are ultimately provided to the computer from the input devices, and the results of the CPU operations are supplied to the output devices. In the typical computer this series of instructions is executed serially one-at-a-time.

In the forty or so years that digital computers have been used, the computers and the programs that run them have become more and more complex. Increasing complexity in a serial computer is usually manifested by increases in the size of its memory and the programs and/or data stored therein. In some senses, however, these more complicated serial computers have become less and less efficient. At any given time, only a very small part of the serial computer is actually being used because the instruction that is being executed by the CPU is obtained from no more than a few memory locations and affects data at only a few other locations. Moreover, as the computer becomes smarter in terms of the size of its memory, it becomes dumber in terms of its ability to produce an output from its memory because the time required to retrieve data from the memory increases with the amount of data stored in the memory

These problems with serial computers have been called the yon Neumann Bottleneck, after John von Neumann who contributed so much to the early development of the serial computer. See J. Backus, "Can Programming Be Liberated from the Von Neumann Style?", Communications of the ACM, Vol. 21, No. 8, p. 613 (August 1978).

These problems are particularly acute in the field of Artificial Intelligence where the computer is often called upon to retrieve knowledge stored in a network of interrelationships that is often referred to as a semantic network. Retrieving this knowledge may involve searching the entire network. It may also involve deducing the desired fact from other stored information. In performing such retrieval, a few simple operations are often repeated for most of the operating time of the program. Such operations include:

1. the sorting of a set of data according to some parameter such as size or numerical order;

2. the searching of ordered sets of data or graphs for sub-sets or sub-graphs with a specified structure;

3. the matching of patterns against sets of assertions;

4. the deduction of facts from the semantic networks of stored information.

Performing such operations one-at-a-time can be prohibitively expensive in terms of computer time and facilities. As a result, numerous problems in Artificial Intelligence cannot be addressed by presently available serial computers. These problems, however, are fundamental problems such as image processing for which solutions are urgently needed.

Alternatively, the time for performing such operations can be reduced if it is possible to perform such operations in parallel. The desirability of doing such is well recognized. See, for example, C. Mead and L. Conway, Introduction to VLSI Systems, ch. 8, "Highly Concurrent Systems", Addison Wesley (1980), and the references crated therein; W. D. Hillis, "The Connection Machine", Massachusetts Institute of Technology Artificial Intelligence Laboratory Memo No. 646 (September 1981) and the references cited therein; also A. Rosenfeld, "Parallel Image Processing Using Cellular Arrays", Computer, Vol. 16, No. 1, p. 14 (January 1983).

These documents also describe to varying degrees general concepts of devices for performing parallel operations on data. For example, Hillis and Rosenfeld contemplate an array of identical processor/memories, each of which contains both the hardware required to store data and that required to process it. However, the specific details of a fully operating computer including the interconnection of processor/memories and their control are not the subject of these papers.

SUMMARY OF THE INVENTION

I have devised a parallel processor array comprising an array of processor/memories and means for interconnecting these processor/memories in an n-dimensional pattern having at least 2.sup.n nodes through which data may be routed from any processor/memory in the array to any other processor/memory. Advantageously, the n-dimensional pattern is a Boolean cube of 15 dimensions.

Each processor/memory comprises a read/write memory and a processor for producing an output depending at least in part on data read from the read/write memory and on instruction information. The interconnecting means comprises means for generating an addressed message packet that is routed from one processor/memory to another in accordance with address information in the message packet and a routing circuit at each node in the n-dimensional pattern for routing message packets in accordance with the address information in the packets.

In a preferred embodiment of the invention, the processor/memories are also interconnected in a 2-dimensional pattern in which individual processor/memories are directly connected to processor/memories that are adjacent to them in the 2-dimensional pattern.

With presently available technology, more than one million such processor/memories can be operated in parallel while interconnected by these interconnecting means.

Preferably the address information in the message packet is relative to the node in which the message packet is being sent and each digit of the address represents the relative displacement of the message packet in one dimension from the node to which the message packet is being sent. For each dimension of the n-dimensional pattern, the routing circuit comprises logic for determining if the message packet has reached its destination in that dimension and for routing it on to another node in that dimension if it has not and if a connection to that node is available. When the connection from the first destination determining logic to another node is not available or when the first destination determining logic determines that the message packet has reached its destination in that dimension, the routing circuit provides the message packet to similar logic for determining if the message packet has reached its destination in a second dimension. Further, the routing circuit comprises logic for providing a message packet that has reached its destination node to a processor/memory at that node as well as means for storing message packets when they cannot be routed on because of connection conflicts.

Advantageously, the destination determining logic and the routing logic for each dimension are operated simultaneously throughout all nodes of the n-dimensional pattern. As a result, a message packet can be routed through the entire n-dimensional pattern during a single routing cycle. In addition, each routing circuit is small enough that it can be fabricated on a single integrated circuit chip along with several processor/memories.

BRIEF DESCRIPTION OF THE DRAWING

These and other objects, features and advantages of the invention will be more readily apparent from the following detailed description of the preferred embodiment of the invention in which:

FIGS. 1A and 1B are schematic depictions of a computer system using an array of parallel processing integrated circuits (ICs) in accordance with the invention;

FIGS. 2 and 3 are schematic representations useful in understanding certain of the interconnection patterns between the parallel processing ICs;

FIG. 4 depicts the format of a message that can be sent from one IC to another in the array of FIG. 1, as well as certain clock signal waveforms useful in understanding the operation of the computer system depicted in FIG. 1;

FIG. 5 is a schematic illustration of a printed-circuit board mounting several VLSI packages containing parallel processing ICs;

FIGS. 6A and 6B are block diagrams of an illustrative embodiment of one parallel processing IC of the array of FIG. 1;

FIGS. 7A and 7B are block diagrams of one of the processor/memories depicted in the block diagram of FIG. 6A;

FIG. 8 is a logic diagram of an interface unit depicted in the block diagram of FIG. 6B;

FIGS. 9 and 10 depict certain waveforms useful in understanding the operation of the circuit of FIG. 8;

FIG. 11 is a block diagram of a routing circuit depicted in the block diagram of FIG. 6B;

FIG. 12 is a logic diagram of an illustrative embodiment of a line assigner in the routing circuit of FIG. 11;

FIG. 13 is a logic diagram of a portion of the line assigner of FIG. 12;

FIG. 14 depicts certain waveforms useful in understanding the operation of the circuit of FIGS. 11-13;

FIG. 15 is a logic diagram of an illustrative embodiment of additional portions of the routing circuit depicted FIG. 11;

FIG. 16 depicts certain waveforms useful in understanding the operation of the circuit depicted in FIG. 14; and

FIG. 17 depicts a VLSI circuit layout for the circuit depicted in FIGS. 6A and 6B.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENT OF THE INVENTION

General Description of the System

As shown in FIG. 1A, the invention may be practiced in a computer system that comprises a mainframe computer 10, a microcontroller 20, an array 30 of parallel processing integrated circuits 35, a data source 40, a first buffer and multiplexer/demultiplexer 50, first, second, third and fourth bidirectional bus control circuits 60, 65, 70, 75, a second buffer and multiplexer/demultiplexer 80, and a data sink 90. Mainframe computer 10 may be a suitably programmed commercially available general purpose computer such as a VAX computer manufactured by Digital Equipment Corp. Microcontroller 20 is an instruction sequencer of conventional design for generating a sequence of instructions that are applied to array 30 by means of a thirty-two bit parallel bus 22. One of the thirty-two lines in bus 22 supplies array 30 with a RESET signal; three lines supply timing signals; and the other twenty-eight lines are available for transmission of instructions. Additional addressing signals to address individual parallel processing ICs 35 of array 30 are supplied to the array on bus 24. Microcontroller 20 receives from array 30 a signal on line 26. This signal is a general purpose or GLOBAL signal that can be used for data output and status information. Bus 22 and line 26 are connected in parallel to each IC 35. As a result signals from microcontroller 20 are applied simultaneously to each IC 35 in array 30 and the signal applied to microcontroller 20 on line 26 is formed by combining the signal outputs from all of ICs 35 of the array.

Array 30 contains 32,768 (=2.sup.15) identical ICs 35; and each IC 35 contains 32 (=2.sup.5) identical processor/memories 36. Thus the entire array 30 contains 1,048,576 (=2.sup.20) identical processor/memories 36.

Processor/memories 36 are organized and interconnected in two geometries. The first is a conventional two-dimensional grid pattern in which the processor/memories are organized in a square array and connected to their four nearest neighbors in the array. The second is a Boolean n-cube of fifteen dimensions. To connect processor/memories 36 in a two-dimensional grid pattern, ICs 35 of array 30 are organized in a rectangular array of 256 (=2.sup.8) rows and 128 (=2.sup.7) columns; and the 32 processor/memories of each IC are connected in a rectangular array of 4 (=2.sup.2) rows and 8 (=2.sup.3) columns. As a result, the 1,048,576 processor/memories 36 of array 30 are connected in a square of 1024 (=2.sup.3) rows and 1024 columns. For convenience, the sides of this square array are identified as NORTH, EAST, SOUTH and WEST. To connect each processor/memory to its four nearest neighbors, the individual processor/memories are connected by electrical conductors between adjacent processor/memories in each row and each column; and the four nearest neighbors of any IC except those on the edges of the array will be recognized to be the four ICs immediately adjacent that IC on the North, East, South and West, respectively.

The individual processor/memories 36 in the columns and rows of the two-dimensional array may be identified by systematically numbering them, using a first number to represent their column number or position in the first dimension and a second number to represent their row number or position in the second dimension. For example, if we number the columns starting with zero in the left-hand or Westernmost column and the rows starting with zero in the bottom or Southernmost row, the nine processor/memories nearest the bottom left-hand or Southwest corner are identified or addressed by:

______________________________________ 0, 2 1, 2 2, 2 0, 1 1, 1 2, 1 0, 0 1, 0 2, 0 ______________________________________

and the processor/memory in the upper right-hand or Northeast corner is identified by the numbers 1023, 1023. Each such pair of numbers will be referred to as the address of the associated processor/memory.

For this numbering scheme, it will be recognized that the four nearest neighbors of any processor/memory in the two-dimensional array have an address theft differs from the address of that processor/memory by 1 in only one of the two numbers that make up their addresses. For example, the four nearest neighbors of the processor/memory having the address 1, 1 are the four processor/memories at addresses 1, 2; 2, 1; 1, 0; 0, 1 to the North, East, South and West, respectively.

As shown schematically in FIG. 1A, the two-dimensional grid pattern of array 30 extends beyond the Northern, Eastern, Southern and Western edges of array 30 to first, second, third and fourth bidirectional bus control circuits 60, 65, 70, 75 and to first and second buffers 50, 80. In particular, each of the 1024 processor/memories 36 on each of the four edges of the array is connected by one of 1024 bidirectional leads 61, 66, 71, 76 to bus control circuits 60, 65, 70, 75, respectively.

Data source 40 supplies input data over high speed data bus 41 to buffer and multiplexer/demultiplexer 50. Data source 40 may be any source of data such as a computer terminal, a communication line, a visual, audible or tactile input, a radar or sonar system, a disk file or a combination thereof illustratively data bus 41 may be a thirty-two bit wide bus and buffer 50 may be thirty-two serial-input, parallel-output shift registers, each of which has a thirty-two bit capacity. In such a configuration, each line of bus 41 feeds a separate serial-input shift register and there is no need for conventional multiplexing or demultiplexing. Where the number of lines in bus 41 is different from the number of shift registers, multiplexing or demultiplexing circuits are used to distribute the data from the individual data lines of bus 41 to the serial inputs of the shift registers in buffer 50.

Buffer 50 supplies the data in parallel on a 1024 line bus 51 to one of bus control circuits 60, 65, 70, 75 which provides these data via busses 61, 66, 71 or 76 to the processor/memories at the outer edge of the array on the side to which it is connected.

Data from array 30 are provided in parallel on one of busses 61, 66, 71 or 76 from the processor/memories 36 along one edge of the array to one of bus control circuits 60, 65, 70, 75 which switches the data onto a bus 81 that is connected to the input to buffer 80. The output of buffer 80 is a high speed data bus 86 that is connected to data sink 90. Buffer 80 illustratively is an array of thirty-two parallel-input, serial-output shift registers, each of which has a thirty-two bit capacity; and data bus 86 may be a thirty-two bit wide bus. For this configuration, there is no need for conventional multiplexing or demultiplexing. When the number of data lines in bus 86 is different from the number of of shift registers in buffer 80, multiplexing or demultiplexing circuits are used to provide the data from the serial outputs of the shift registers to the individual data lines of bus 86. Data sink 90 may De any sink of data such as a computer terminal, a communication line, a display, a printer, a plotter, a voice synthesizer, a mechanical device, a robot, a disk file or a combination thereof.

The direction of data flow through array 30 is controlled by microcontroller 20 and bus control circuits 60, 65, 70, 75 and may be from East to West, North to South, or vice versa. As shown in FIG. 1B, each buffer 60, 65 70 or 75 contains 1024 selectors 10,001, 10,002, 10,003 . . . 11,024 Each selector has four signal inputs and four input selector lines. One of the signal inputs to each selector is one of the lines of data bus 51 from bus 50. Another signal input is ground. The other two signal inputs are outputs from array 30. In one case the input is the output from the array in the same row or column as the selector. In the other case the input is the output from the array in the row or column immediately adjacent the selector. In the case of the bottommost selector, two of the inputs to the selector are grounded. Each of the four input selector lines selects one of the four signal inputs to be the output from each selector. Signals on the four input selector lines are generated by microcontroller 20.

As a result of this arrangement, each buffer may provide to the array one of four sets of signals: the data input from buffer 50, recirculated data from array 30, recirculated data from an adjacent row or column in array 30, and all zeroes. In the case of recirculated data from an adjacent row or column, the buffer, in effect, has interconnected all the individual processor/memories of the array in a single line that spirals through the 1024 rows or columns of the array.

The above-described two-dimensional grid of interconnections is useful both for writing large amounts of data into array 30 as, for example, at the beginning of a computation and for reading out the contents of the array, for example, when it is necessary to interrupt processing and store the state of the array at such time. However, this interconnection array does not provide for rapid interchange of data in random directions between processor/memories 36 in the two-dimensional array. Moreover, to move data between an edge of the array and a specific processor/memory, it is necessary to shift it through all the processor/memories between the edge and the processor/memory of interest, which may require shifts through more than 500 processor/memories. Even where it is possible to make a single such shift at very high speeds, the need to do more than 500 such shifts makes the complete operation maddeningly slow. With the added complications of making such shifts at the same time for large numbers of processor/memories in random and independent directions, it becomes impossible to operate such a large two-dimensional grid of processor/memories at reasonable cost.

In the present invention, this problem is alleviated by also organizing and interconnecting processor/memories 36 in accordance with a second geometry. In particular, ICs 35 are organized and interconnected in the form of a Boolean n-cube of fifteen dimensions. Each IC is provided with logic circuitry to control the routing of messages through such an interconnection network; and within each IC, bus connections are provided to the thirty-two processor/memories so that every one of the more than one million processor/memories can send a message to every other. Moreover, large numbers of messages may be sent at any time and the messages may be routed in random directions.

To understand this connection pattern for ICs 35, it is helpful to number the ICs from 0 to 32,767 and to express these numbers or addresses in binary notation using fifteen binary digits as in Table I.

TABLE I ______________________________________ IC address IC address in decimal in binary notation notation ______________________________________ 0 000 000 000 000 000 1 000 000 000 000 001 2 000 000 000 000 010 3 000 000 000 000 011 4 000 000 000 000 100 . . . . . . . . . . . . 32765 111 111 111 111 101 32766 111 111 111 111 110 32767 111 111 111 111 111 ______________________________________

The concepts described above in reference to the interconnection of a two-dimensional grid can be readily extended to the interconnection of a fifteen-dimensional grid. Just as we identified each processor/memory 36 by two numbers, one of which specified its position in the first dimension of the two-dimensional grid and the other of which specified its position in the second dimension, so too we can use a-number to identify the position of a IC in each of the fifteen dimensions of the Boolean 15-cube. In an n-cube, however, an IC can have one of only two different positions, 0 and 1, in each dimension. Thus, the fifteen-digit IC address in binary notation as set forth in Table I also specifies the IC's position in the fifteen dimensions of the n-cube. For convenience, we will use the left-hand-most digit of the fifteen binary digits to specify the IC's position in the first dimension, and so in order to the right-hand-most digit which specifies the IC's position in the fifteenth dimension.

Moreover, because a binary digit can have only two values, zero or one, and because each IC is identified uniquely by fifteen binary digits, each IC has fifteen other ICs whose binary address differs by only one digit from its own address. We will refer to these fifteen ICs whose address differs by only one from that of a first IC as the first IC's nearest neighbors. Those familiar with the mathematical definition of a Hamming distance will recognize that the first IC is separated from each of its fifteen nearest neighbors by the Hamming distance one. Two examples of the addresses of an IC an its fifteen nearest neighbors are set forth in Table II.

TABLE II ______________________________________ Example I Example II ______________________________________ IC address: 000 000 000 000 000 010 101 010 101 010 Addresses of nearest neighbors: 000 000 000 000 001 010 101 010 101 011 000 000 000 000 010 010 101 010 101 000 000 000 000 000 100 010 101 010 101 110 000 000 000 001 000 010 101 010 100 010 000 000 000 010 000 010 101 010 111 010 000 000 000 100 000 010 101 010 001 010 000 000 001 000 000 010 101 011 101 010 000 000 010 000 000 010 101 000 101 010 000 000 100 000 000 010 101 110 101 010 000 001 000 000 000 010 100 010 101 010 000 010 000 000 000 010 111 010 101 010 000 100 000 000 000 010 001 010 101 010 001 000 000 000 000 011 101 010 101 010 010 000 000 000 000 000 101 010 101 010 100 000 000 000 000 110 101 010 101 010 ______________________________________

To connect ICs 35 in the form of a Boolean 15-cube, each IC is connected to its fifteen nearest neighbors. In FIG. 1, these connections are schematically represented by fifteen input lines 38 and fifteen output lines 39 although the actual connection paths are not shown because of the complexity they would add to the drawing. Each of these fifteen input lines 38 to each IC 35 is associated with a different one of the fifteen dimensions of the Boolean 15-cube and likewise each of the fifteen output lines 39 from each IC 35 is associated with a different dimension.

An appreciation of the interconnection pattern of a Boolean n-cube can be obtained from a consideration of the interconnections that would be used for an array of ICs 35' in Boolean n-cubes of three dimensions and four dimensions. FIG. 2 is a schematic illustration of the Boolean n-cube of three dimensions. This will be recognized as a conventional cube having eight vertices or nodes and twelve edges. The three dimensions of this cube are identified by Roman numerals, I, II, III. At each of the vertices is an IC 35'; and from each IC there are three output lines 39' that extend along the three dimensions of the cube to the IC's three nearest neighbors. As will be apparent, each IC 35' also has three input lines 38' that are the output lines from its three nearest neighbors. The bottom left-hand vertex is assumed to be the origin of this system and accordingly the IC at this vertex has the 0 position or address in the first, second and third dimensions of the three-dimensional cube of FIG. 2. This address will be written 000. Because each IC can be at one of only two positions in each dimension, the other ICs have addresses that are other three-digit combinations of 0 and 1 as shown in FIG. 2.

FIG. 3 illustrates a Boolean n-cube of four dimensions. In such a cube there are sixteen vertices and thirty-two edges. Again, an IC 35' is located at each vertex or node and is connected to its nearest neighbors by input lines 38' and output lines 39'. In this case, however, each IC has four nearest neighbors and therefore four input lines and four output lines extending along the four dimensions of the 4-cube. The position of each IC in the Boolean 4-cube is identified by a four-digit binary number as shown in FIG. 3; and the four dimensions of this 4-cube are identified by Roman numerals, I, II, III, IV as shown in FIG. 3.

The extrapolation of this pattern to cubes of higher dimensions will be apparent. In each case, the next higher dimension will have twice as many vertices and each IC will have one additional nearest neighbor. Accordingly, a Boolean 15-cube will have 32,768 vertices with an IC at each vertex and each IC will have fifteen nearest neighbors.

To permit communication through the interconnection pattern of the Boolean 15-cube, the computer system is operated so that it has both processing cycles and routing cycles. Computations are performed during the processing cycles. During the routing cycles, the results of the computations are organized in the form of message packets; and these packets are routed from one IC to the next by routing circuitry in each IC in accordance with address information that is part of the packet. The format of the message packet is depicted in FIG. 4 where it is seen to comprise fifteen bits of IC address, a format bit, another fifteen bits duplicating the IC address, five bits of address to the processor/memory in the IC, four bits of address to a register in the processor/memory, thirty-two bits of a message and one bit for error detection, a total of seventy-three bits. Optionally additional bits may be provided for error correction. The time duration of each bit illustratively is 0.1 to 1 microseconds corresponding to a frequency of 1 to 10 MegaHertz (MHz). FIG. 4 also illustrates the basic clock signals phi 1 and phi 2 used the system. These signals are non-overlapping two-phase clocks each having a period and a frequency that is the same as that of one bit of the message packet.

In the message packet the IC address information is relative to the address of the destination IC. Initially, it is the difference or the displacement between the address of the IC that is the source of the message and that of its destination. For example, if the address of the source IC is 010 101 010 101 010 and the address of the destination IC is 111 111 111 111 111, then the relative address that is generated at the source IC is 101 010 101 010 101. It will be apparent that this relative address is the logical EXCLUSIVE OR (XOR) of the addresses of the source and destination. It also will be apparent that 1-bits in the relative address identify the dimensions where the message packet is not in the correct position and therefore identify the dimensions through which the message packet must be moved to reach the destination IC. Thus, in the above example, where the addresses of the source and destination ICs are the same in each of the even-numbered dimensions, the message is already located in the proper position in those dimensions. However, in the odd dimensions where the addresses of the source and the destination ICs are different, the presence of 1-bits in the relative address for those dimensions indicates that it is necessary to move the message packet from one IC to another in that dimension.

As the message is routed from one IC to the next, the relative address is updated to take into account each move. This is conveniently done by complementing the bits in the duplicate IC address that are associated with the dimensions through which the message packet is moved. As a result, when the message packet arrives at the destination IC, the bits in the duplicate IC address will be all zeroes.

The routing circuitry in all the ICs is identical and operates in synchronism using the same routing cycle. For the example of FIG. 4 of a message packet of seventy-three bits with fifteen bits of IC address, the length of the routing cycle is eighty-eight cycles of the basic clock signal phi 1. In the first time period of each routing cycle, the routing circuitry at each IC tests the leading bit of the first copy of the IC address of each message packet in the routing circuitry to determine its level. If there is a 1-bit in this position and if the output line from that IC which is associated with the first dimension is not already busy, the message packet is routed down the first dimension output line to the IC's nearest neighbor in the first dimension. If the leading bit of the message packet address is a 0-bit, the message packet remains in the same IC because it is in the correct position in the first dimension. As a result, in the first time period, all message flow between the routing circuits of the ICs is along the first dimension.

The leading bit of the first copy of the IC address in the message packet is then discarded. If the message packet was routed to another IC, the corresponding address bit in the duplicate IC address is complemented in order to account for such move.

In the second address time period, the routing circuitry of each IC again tests the leading bit of the message packets present at the IC. However, this bit is the bit that indicates whether the message packet is in the proper position in the second dimension. If the bit is a 1-bit and if the second dimension output line is not already busy, the message packet is then routed out on the second dimension output line to that IC's nearest neighbor in the second dimension. If the first bit is a 0-bit, the message packet remains in the IC.

This process continues through fifteen address time periods, at the end of which the first fifteen address bits of the message packet will have been used up. However, if the needed output lines were available, a path will have been established through the Boolean 15-cube through which the remainder of the message packet can be transmitted.

An illustrative example of this routing scheme may be provided with reference to the Boolean 4-cube of FIG. 3. Assume that a message is to be sent from a source IC 35' having an address 1111 to a destination IC 35' having an address 0010. The relative address or displacement of the destination IC is obtained by taking the EXCLUSIVE OR of the address of the source and destination ICs. Accordingly, the relative address is 1101 which indicates that the message packet must be moved in the first, second and fourth dimensions but not in the third dimension. The routing circuit at the source IC then examines the first bit of the first copy of the relative address, identifies the 1-bit, routes the message along the first dimension to IC 0111 if this output line is available, discards the first 1-bit in the first copy of the IC address, and complements the first 1-bit in the duplicate IC address. In the second address time period, the routing circuit at the IC whose address is 0111 examines the first of the three remaining address bits and again finds a 1-bit. Accordingly, if the output line is available, the routing circuit sends the message packet to the IC whose address is 0011, discards the 1-bit in the first copy of the IC address that is representative of movement in the second dimension, and complements the 1-bit in the duplicate IC address in order to indicate that such movement has taken place.

In the third address time period, the routing circuit at address 0011 examines the first of the two remaining address bits and identifies a 0-bit. It therefore retains the message packet at this IC and discards the 0-bit. In the fourth address time period, the routing circuit at address 0011 examines the remaining address bit and identifies a 1-bit. Accordingly, it routes the message packet along the output line to IC 0010, discards the final bit of the first copy of the IC address and complements the final bit of the duplicate IC address.

Upon arriving at IC 0010, the routing circuit recognizes the absence of any 1-bit in the duplicate IC address that it tests and accordingly knows that the message packet has reached its destination. The message packet is then delivered to the processor/memory whose address is specified in the message packet. Further details concerning the routing process are set forth below in conjunction with FIGS. 11 through 16.

General Description of a Parallel Processing IC

Each IC 35 is fabricated as a very large scale integrated circuit (VLSI) on a single silicon chip. As shown in FIG. 5, sixty-four (=2.sup.6) of these chips are encased in individual chip packages 100 and are mounted on and interconnected through an individual printed circuit (PC) board 130. To provide for 32,768 such ICs, 512 (=2.sup.9) printed circuit boards are mounted in a suitable housing. Conventional wiring harnesses 132 interconnect these boards in both the two-dimensional grid and Boolean 15-cube geometries. For the configuration shown in FIG. 5, six of an IC's nearest neighbors will be mounted with it on the same PC board and the other nine will be on different PC boards.

Ninety-seven pins I02 are provided on each package to connect the chip to other chips on the PC board and to the rest of the system. The signals carried by these ninety-seven pins are set forth in Table III.

TABLE III ______________________________________ Pin Name No. Type Function ______________________________________ Phi1 1 Input clock pulse Phi2 1 Input clock pulse KSYNCH 1 Input Last clock in cycle RESET 1 Input Initialization on power up I0-1 2 Input ALU operation select I2 1 Input RegA source invert control, ALU operation select, InvA I3 1 Input RegB source invert control, InvB I4 1 Input Flag source invert contol, InvF I5-8 4 Input First register source and destination, Reg A0-3 I9-12 4 Input Second register source, Reg BO-3 I13-17 5 Input Register column select, Col 0-31 I18-21 4 Input Flag addresses, Source and Destination, Flag A0-3 I22-25 4 Input Flag selection for condi- tional, Cond 0-3 I26 1 Input Sense of condition test, 0=skip-on-zero, Not I27 1 Input Modify RegA with ComIn bits, Mod A CS0-1 2 Input Chip select, active low N0-7 8 In/Out Two-dimensional Grid extension to north SW0-7 8 In/Out Two-dimensional Grid extension to south and west E0-7 8 In/out Two-dimensional Grid extension to east (4 pins presently not used are available for future expansion) CubeIn0-14 15 Input N-cube input from nearest neighbors CubeOut0-14 15 Output N-cube output to nearest neighbors GLOBAL 1 Output NOR of all Global flags, open drain LED 1 Output Same as Global, open drain driver for LED VSS 3 Power Ground VDD 2 Power +5 volt power VBB 1 Power substrate bias ______________________________________

The pins named IO-127, RESET phi 1 phi 2 and KSYNCH are connected to bus 22 and receive the instruction signals, RESET signal and timing signals, phi t, phi 2 and KSYNCH from microcontroller 20. The pins CSO and CS1 are chip select pins that address the chip when the signals at both pins are low. The signals that select these pins are provided to array 30 by bus 24. Pins NO-7, SWO-7, and EO-7 provide connections to the nearest processor/memories on the adjacent chips to the North, South, West and East. Pins CubeInO-14 and CubeOutO-14 provide connections to the nearest neighbor processor/memories 36 in the Boolean 15-cube. The GLOBAL pin is connected over line 26 to microcontroller 20. The LED pin provides an output that drives a light emitting diode when active, thereby permitting the chip to generate a visual signal. This signal can be used for testing or supervisory purposes and even for display of computational results. The six around and power supply pins provide ground and power connections to the chip.

FIGS. 6A and 6B depict in block diagram form one of the 32,768 identical ICs 35 of array 30. As shown in FIG. 6A, the thirty-two processor/memories 36 of an IC are connected in a