WikiPatents - Community Patent Review
Create Free Account  |  License or Sell Your Patent  |  WikiPatents Marketplace  |  WikiPatents Blog
Username:  Password:  
    
Advanced Search
Data driven processor performing parallel scalar and vector processing    
United States Patent5926643   
Link to this pagehttp://www.wikipatents.com/5926643.html
Inventor(s)Miura; Hiroki (Osaka-fu, JP)
AbstractA data driven data processing apparatus executes a data flow graph as a program for a plurality of processing elements arranged in a pipeline ring for transferring packets of scalar and vector operation data. An execution circuit on the pipeline ring controls execution of scalar and vector operations by separate scalar and vector operation circuits. The execution circuit has a single arithmetic logic circuit that performs both the scalar and vector operations, preferably processing in a time-shared, parallel manner.
   














 Title Information Submit all comments and votes
 
Patent Text Patent PDF Print Page Summary File History
Plain text PDF images Print Summary File History
Drawing from US Patent 5926643
Data driven processor performing parallel scalar and vector processing - US Patent 5926643 Drawing
Data driven processor performing parallel scalar and vector processing
Inventor     Miura; Hiroki (Osaka-fu, JP)
Owner/Assignee     Sanyo Electric Co. Ltd. (Osaka-Fu, JP)
Patent assignment
All assignments
Publication Date     July 20, 1999
Application Number     08/754,870
PAIR File History     Application Data   Transaction History
Image File Wrapper   Patent Term   Fees
Litigation
Filing Date     November 22, 1996
US Classification    
Int'l Classification    
Examiner     Kim; Kenneth S.
Assistant Examiner    
Attorney/Law Firm     Darby & Darby
Address
Parent Case     This is a division of application Ser. No. 08/551,694, filed Nov. 1, 1995, allowed Sep. 17, 1996, now issued U.S. Pat. No. 5,689,647 which is a continuation of application Ser. No. 08/370,459, filed Jan. 9, 1995, now abandoned which in turn is a continuation of application Ser. No. 08/067,268, filed May 24, 1993, now abandoned which in turn is a continuation of application Ser. No. 07,492,680, filed Mar. 13, 1990, now abandoned.
Priority Data     Mar 14, 1989 [JP] 1-63091 Mar 14, 1989 [JP] 1-63092 Jun 15, 1989 [JP] 1-153276 Jun 15, 1989 [JP] 1-153277 Oct 13, 1989 [JP] 1-266624 Oct 13, 1989 [JP] 1-266625 Oct 13, 1989 [JP] 1-266626
USPTO Field of Search    
Patent Tags     data driven processor performing parallel scalar vector processing
   
Enter a comma (,) or semicolon (;) between multiple tag words/phrases.
Describe this patent:
 Amusing   
 Clever   
 Complex   
 Efficient   
 Historic   
 Important   
 Innovative   
 Interesting   
 Practical   
 Simple   
[no votes]
Patent WIKI

Share information and news about this patent, including information and news about the technology, inventors, company, ligation and licensing.

 References Submit all comments and votes
 
*references marked with an asterisk below are user-added references
 U.S. References
 
Add a new US reference:  
ReferenceRelevancyCommentsReferenceRelevancyComments
 Foreign References
 Other References
 Market Review Submit all comments and votes
   
Market Size
Estimate the gross annual revenues of the relevant market sector:
> $10B
$5B - $10B
$2B - $5B
$500M - $2B
$100M - $500M
$10M - $100M
$1M - $10M
$500K - $1M
$100K - $500K
< $100K
[No votes]
$0
 
$0   $2.5B   $5B   $7.5B   $10B
Market Share
Estimate the percentage of the relevant market sector this invention will capture:
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Reasonable Royalty
What percentage of gross sales should the inventor or assignee be paid?
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Public's "Guesstimation" of Royalty Value
Market SizeN/A[No votes]
xMarket ShareN/A[No votes]
xReasonable RoyaltyN/A[No votes]

N/A

License Availablity
If you are NOT the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
License Availablity
If you ARE the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
Competitive Advantage
Does this invention have a significant competitive advantage over similar technologies?
Yes

No



[No votes]
Most helpful competitive advantage comment
[No comments]

Commercial Alternatives
Are there viable commercial alternatives for this invention?
Yes

No



[No votes]
Most helpful commercial alternative comment
[No comments]

 Technical Review Submit all comments and votes
 Claims Submit all comments and votes
 


What is claimed is:

1. A data-driven data processing apparatus (PE) which executes a data flow graph as a program therefor, comprising:

a pipeline ring which transfers thereon scalar operation packets each having scalar operation data and instruction code and vector operation instruction packets;

vector data storing means for storing vector data including element data;

execution means arranged on said pipeline ring operating in a time shared manner for executing both a scalar operation using said scalar operation data and said instruction code included in said scalar operation packets and a vector operation based upon said vector operation instruction packets and said element data;

scalar operation control means arranged on said pipeline ring to control supply of said scalar operation data to said execution means; and

vector operation control means connected to said execution means for controlling the supply of said element data to said execution means so as to execute the same operation with respect to said element data read from said vector data storing means when a vector operation packet designating the vector operation is input to said execution means.

2. A data driven data processing apparatus in accordance with claim 1, wherein said execution means includes an arithmetic logic unit, and said vector operation control means includes first supplying means for successively supplying said element data to said arithmetic logic unit.

3. A data driven data processing apparatus in accordance with claim 2, further comprising second supplying means arranged on said pipeline ring for supplying operand data included in a scalar operation packet to said arithmetic logic unit.

4. A data driven data processing apparatus which executes a data flow graph as a program therefor comprising:

a pipeline ring which transfers thereon scalar operation packets each having scalar operation data and instruction code and vector operation instruction packets;

vector data storing means for storing vector data including element data;

execution means arranged on said pipeline ring for executing a scalar operation using said scalar operation data and said instruction code included in said scalar operation packets and a vector operation based upon said vector operation instruction packets and said element data;

scalar operation control means which is arranged on said pipeline ring and controls supply said scalar operation data to said execution means:

vector operation control means connected to said execution means for controlling the supply of said element data to said execution means so as to execute the same operation with respect to said element data read from said vector data storing means when a vector operation packet designating vector operation is input to said execution means;

wherein said execution means includes an arithmetic logic unit, and said vector operation control means includes first supplying means for successively supplying said element data to said arithmetic logic unit,

second supplying means arranged on said pipeline ring for supplying operand data included in a scalar operation packet to said arithmetic logic unit; and

control means for controlling said arithmetic logic unit in a time-shared manner to process in parallel a vector operation packet and a scalar operation packet.
 Description Submit all comments and votes
 


BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention relates to a parallel computer system including processing elements. More specifically, the present invention relates to a parallel computer system which is provided with a plurality of processing elements and data is sent and received between the processing elements in accordance with a processing element number assigned to each processing element.

2. Description of the Prior Arts

Recently, researches are advanced toward realization of a practical parallel computer. Especially, as advancement of a semiconductor technology, there are researches in which a communication control unit and a data processing unit are implemented as a processing element of a single-chip LSI and a large number of such processing element LSIs are connected to realize a parallel computer.

For example, in a parallel computer disclosed in pages 181-218 of "Nikkei Electronics" issued on Apr. 9, 1984, a data communication system in which a plurality of single-chip processing elements called as ImPP (Image Pipelined Processor) are connected in a ring manner to send and receive the data.

In addition, as disclosed in pages 1048-1049 of a collection of papers 2T-2 for the 38th national meeting (the first half year in 1989) of the Information Processing Society, the inventors et al. are advancing development of a large-scale parallel data driven computer EDDEN (Enhanced Data Driven ENgine) wherein a maximum of 1024 processing elements each of which is incorporated in a single chip are connected.

In such a parallel computer, generally, a processing element number is assigned to each of the processing elements to identify respective processing elements and the number (addressing number) of the processing element to which communication data is to be sent is added to the communication data. In each processing element, the addressing number included in the communication data arrived at a network control unit and its own processing element number are compared with each other and, if the both are coincident with each other, the communication data is fetched in a data processing unit.

In order to implement a data communication system in such a parallel computer, first, it is necessary to provide with a function for setting processing element numbers into respective processing elements. For example, in the aforementioned processing element called as ImPP, a method in which a module number indicated by 4 bits is set by means of a module setting buffer externally connected to the LSI is adopted (see FIG. 14 in page 206 of "Nikkei Electronics").

In such a case, since pins for data signals of the LSI are also used for setting the module number, it is necessary to connect an external circuit including a tri-state buffer, dip switch and etc. to each of the processing element chip. However, in each processing element chip, it is possible to separately provide with pins for setting the module number and pins for data signals. In this case, the above described tri-state buffer becomes unnecessary, but the total number of pins of the processing element LSI increases.

However, in the parallel computer system in which a large number of processing elements are connected as the aforementioned EDDEN, since a size of an LSI package of one processing element, a scale of the external circuit and etc. have a great influence on a size of a whole system, it is desired that the total number of pins of the LSI is made small and the scale of the external circuit is miniaturized as much as possible.

In addition, in the above described method wherein the processing element number is directly set from the outside to each of the processing element LSIs, if change of the processing element number become needed, all the dip switches corresponding to all the processing elements must be operated to change the processing element numbers, and thus, an extensive working amount becomes necessary.

SUMMARY OF THE INVENTION

Therefore, a principal object of the present invention is to provide a novel parallel computer system using a plurality of processing elements.

Another object of the present invention is to provide a parallel computer system in which it is possible to simply set processing element numbers to processing elements.

Another object of the present invention is to provide a parallel computer system in which change of processing element numbers being set in processing elements is easy.

Another object of the present invention is to provide a parallel computer system in which it is possible to efficiently transfer data on a network.

Another object of the present invention is to provide a parallel computer system in which it is possible to transfer data to an addressed processing element through a path of the shortest distance.

Another object of the present invention is to provide a parallel computer system in which it is possible to efficiently transfer data to a host computer.

Another object of the present invention is to provide a parallel computer system of a low cost.

Another object of the present invention is to provide a parallel computer system in which it is possible to utilize the same LSI as not only a processing element but also an interface.

Another object of the present invention is to provide a data transfer system using a self-synchronous shift register and capable of flexibly dealing with variations of data flows.

The other object of the present invention is to provide a data-driven data processing apparatus capable of efficiently performing a vector operation.

A parallel computer system in accordance with the present invention comprises: a plurality of processing elements, said plurality of elements respectively including network control units and data processing units and capable of being selectively set as a first mode or a second mode; a communication line for coupling a plurality of said network control units to construct a network; sending means for sending data including a processing element number onto the communication line; number holding means provided in each of the plurality of processing elements for holding a processing element number; and number setting means provided in each of the plurality of processing elements for setting the processing element number included in said data into the number holding means when said data is sent onto the communication line in said first mode.

For example, in response to a hardware reset signal from a host computer, each processing element is set in the first mode by means of mode indicating means such as a mode flag. The first mode is a specific mode for setting processing element numbers into respective processing elements. The mode flag is reset when the processing element number is set into the processing element and the processing element is set in the second mode, that is, a normal mode. When a packet including processing element number is sent onto the communication line, that is, network from the host computer coupled to the network in the first mode, in the processing element being set in the first (specific) mode, the packet is fetched irrespective of instructions included in the packet, and the processing element number included in a received packet is set in the number holding means such as a number register. At that time, the mode flag is reset so that the processing element is changed into the second mode.

In accordance with the above described parallel computer system, it is extremely easy to set the processing element numbers into the respective processing elements. More specifically, since in each processing element, the number data is fetched automatically when its own processing element number has not been set, by only sending the number data from the host computer, the processing element number can be set. Therefore, in comparison with a case where circuit elements are used as in the conventional ImPP, only a few additional circuit may be provided and a complex working such as an operation of a large number of dip switches is unnecessary, and therefore, the processing element numbers can be set easily.

A parallel computer system in accordance with the present invention comprises: a plurality of processing elements, said plurality of processing elements respectively including network control units and data processing units, said network control unit including a plurality of ports; a communication line for coupling the ports of a plurality of said network control units to construct a network; and routing means provided in each network control unit for determine a port through which data is to be outputted so that data outputted from said data processing unit or data transferred to a port from another processing element can be transferred to an objective processing element with the shortest distance.

Date includes an addressing processing element number indicative of that the data is to be received by what processing element, and the routing means includes means for calculating a difference between said addressing processing element number and the processing element number held in the number holding means such as a number register. If the plurality of processing elements are arranged in a form of a mesh network, the processing element number composed of a row direction number and a column direction number, and thus, said means for calculating a difference includes means for calculating a difference of the row direction numbers and means for calculating a difference of the column direction numbers. In addition, the plurality of processing elements are arranged in a form of a torus mesh network, the means for calculating a difference of the row direction numbers and the means for calculating a difference of the column direction numbers include modulo operation means, respectively.

By providing with such routing means, the transfer of the data on the network can be performed very efficiently.

As a data transfer line such as an input/output register provided at each port of the network control unit, a novel data transfer system in accordance with the present invention can be utilized.

A packet transfer system where a packet in which a plurality of words each including a data of a predetermined bits and a transfer control bit of 1 bit are sequentially arranged is transferred word by word, said transfer control bit being alternately set as "1" or "0" word by word, said packet transfer system comprises: a plurality of parallel latch means each having the bit number corresponding to a word of said packet, said plurality of parallel latch means being connected in a cascade fashion; clock means for alternately applying two clock signals having opposite phases to said plurality of latch means; comparing means for comparing respective transfer control bits of words latched in adjacent parallel latch means; and control means for enabling or disabling the clock signals given to the parallel latch means in response to an output of the comparing means.

The data transfer system can deal with variations of a data flow without using a complex simultaneously stopping circuit and simultaneously stop releasing circuit or the like.

In addition, the present invention can be implemented as a parallel computer system using a data-driven data processing apparatus set forth in the following.

A data-driven data processing apparatus which executes a data flow graph as a program, comprises: a pipeline ring For transfer a packet; instruction execution means arranged on the pipeline ring for executing instruction in accordance with said packet; vector data storing means for storing vector data including element data; and vector operation control means connected to said instruction execution means for repeatedly executing the same operation with respect to said element data of the vector data read from the vector data storing means when a vector operation packet designating a vector operation is inputted to the instruction execution means.

In the data-driven data processing apparatus, it is possible to easily and efficiently execute a vector operation in which the conventional data-driven processing apparatus was weak.

The objects and other objects, features, aspects and advantages of the present invention will become more apparent from the following detailed description of the embodiments of the present invention when taken in conjunction with accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram showing a data-driven parallel computer as one embodiment in accordance with the present invention.

FIG. 2 is a block diagram showing one example of a processing element (PE) of FIG. 1 embodiment.

FIGS. 3A, 3B, 3C, 3D, and 3E are illustrative views respectively showing formats of different packets used in FIG. 1 embodiment.

FIG. 4 is an illustrative view showing one example of a memory format of a program storage (PS) shown in FIG. 2.

FIG. 5 is an illustrative view showing a memory format of a filing control and color management unit (FCCM) shown in FIG. 2.

FIG. 6 is an illustrative view showing one example of a number register included in an input control unit (IC) shown in FIG. 2.

FIG. 7 is a block diagram showing one example of a network control unit (NC) shown in FIG. 2.

FIG. 8 is a block diagram showing a modified embodiment of FIG. 1 embodiment.

FIG. 9 is a block diagram showing another modified embodiment of FIG. 1 embodiment.

FIG. 10 is a block diagram showing one example of a self-synchronous shift register capable of being used as a data transfer line such as an input/output register of a network control unit (NC) of FIG. 7 embodiment.

FIG. 11 is a timing chart showing an operation of FIG. 10 embodiment.

FIG. 12 is a circuit diagram showing one example of a network and size judgment unit (NSJ) included in a network control unit shown in FIG. 2.

FIG. 13 is a block diagram showing an execution unit (EXE) and a vector operation control unit (VC) shown in FIG. 2.

FIG. 14 is a block diagram showing a vector operation control unit of FIG. 13 embodiment in more detail.

FIGS. 15A, 15B, 15C, 15D and 15E are illustrative views respectively showing formats of different packets used in an embodiment shown in FIGS. 13 and 14.

FIGS. 16 and 17 are illustrative views respectively showing examples of memory formats of an external data memory (EDM).

FIG. 18 shows FIGS. 18A and 18B, which are timing chart showing a time-shared operation of a scalar operation and a vector operation in FIG. 14 embodiment.

DETAIL DESCRIPTION OF THE PREFERRED EMBODIMENTS

FIG. 1 is a block diagram showing a data-driven parallel computer system as one embodiment in accordance with the present invention. In the following, a description will be made in a case where the present invention is applied to a data-driven parallel computer system, but it is pointed out in advance that the present invention can be also applied to a parallel computer system other than a data-driven type unless otherwise referred to.

With reference to FIG. 1, a parallel computer system 10 of this embodiment shown includes a host computer (this may be a high-level network) 12, and the host computer (or the high-level network) 12 is coupled to a network 16 through an interface 14. The interface 14 is a bus interface or a cluster interface.

The network 16 includes a number of processing elements PE00-PEmn arranged in a mesh network form. Each of the processing elements PE00-PEmn is specifically constructed as shown in FIG. 2. On the network 16, the processing elements PE00-PEmn are constructed as a torus mesh network. In addition, a torus mesh network means constitution wherein a large number of processing elements are arranged in a matrix form and it is possible to communicate data between arbitrary processing elements by row direction communication lines RC and column direction communication lines CC which circularly couple the processing elements in row directions to each other and the processing elements in column directions to each other, respectively. In FIG. 1 embodiment, n processing elements PE00-PE0n arranged in a first row are circularly (in a ring form) coupled to each other by a row direction communication line RC0, processing elements PE10-PE1n included in a second row are circularly coupled to each other by a row direction communication line RC1, and processing elements PEm0-PEmn of an m-th row are circularly coupled to each other by a row direction communication line RCm. Furthermore, m processing elements PE00-PEm0 arranged in a first column are circularly coupled to each other by a column direction communication line CC0, processing elements PE10-PEm1 included in a second column are circularly coupled to each other by a column direction communication line CC1, and processing elements PE1n-PEmn of an n-th column are circularly coupled to each other by a column direction communication line CCn.

In this torus mesh network 16, in order to communicate data between the host computer 12 and respective processing elements PE00-PEmn, as inserters, network interfaces NIF are respectively inserted in the respective column direction communication lines CC0, CC1, . . . , CCn.

A processing element PE shown in FIG. 2 is a constructed as an LSI of a single chip, which basically includes a network control unit (NC) 18 and a pipeline ring (PR) 20, and further includes a program storage (PS) 22, a firing control and color management unit (FCCM) 24, an execution unit (EXE) 26 and a queue memory (QM) 28. Between the PR 20 and the NC 18, the data is sent and received via an input control unit (IC) 30 and an input buffer (IBUF) 32, and an output control unit (OC) 34 and an output buffer (OBUF) 36.

In order to construct the torus mesh network 16 (FIG. 1), the NC 18 has four (4) bi-directional communication links or ports of north (N), east (E), south (S) and west (W), and controls data transfer between the PR 20 and the NC 18 and data transfer between the processing element and another processing element or the host computer 12. More specifically, the NC 18 inputs a packet inputted from any one of the four ports N, E, S and W to the IBUF 32 or outputs another port. Furthermore, the NC 18 outputs a packet from the PR 20 to any one of the four ports N, E, S and W. Then, as described later with reference to FIG. 7, the NC 18 has a self-routing function by which an inputted packet is outputted to a predetermined port so that the packet inputted to a given port can be sent to an addressed processing element with the shortest distance.

The IBUF 32 is constructed by a buffer register for a single packet so that a packet inputted from the NC 18 can be temporarily stored, and the packet inputted to the IBUF 32 is inputted to the PR 20 by the IC 30. Although not shown in detail, the IC 30 includes various registers in which a processing element number (PE) number of a processing element including the IC 30, an addressed PE number to which a dumped packet is to be sent, a status flag of the processing element, and etc, are stored. In addition, the IC 30 gives a packet to a vector operation control unit (VC) 38 as necessary.

The QM 28 is constructed as a FIFO (First-In First-Out) memory having a capacity of 64 packets, for example. The QM 28 absorbs a temporary increase of the number of packets on the PR 20 when the packets are inputted or when the packet is copied and variations of a data flow due to variations of times for various kinds of operations on the PR 20.

The PS 22 includes a program memory for storing a data-driven type program having a format as shown in FIG. 4, for example, i.e. connection information and operation codes at respective nodes and etc. of a data flow graph, and the PS 22 performs renewal of control information such as a node number, a copy of the data, application of constants, and etc.

The OC 34 outputs a packet on the PR 20 or a packet from the VC 30 to the OBUF 36. The OBUF 36 has a capacity of one packet so that an output packet from the OC 34 can be temporarily stored. As described above, a packet outputted to the NC 18 is outputted to any one of the four ports N, E, S and W from the OBUF 36.

The FCCM 24 performs queuing of a left and right operands to execute instruction, queuing of a large number of data for synchronous processing, getting of a color in calling a subroutine, or returning of a color in returning a subroutine. The FCCM 24 includes a memory having a format shown in FIG. 5, and the memory has a queue memory in which unfired operand data and an unused color are chained and stored, and a stack memory for storing vacant addresses of the queue memory. A packet outputted from the FCCM 24 is given to the EXE 26 in which various kinds of instructions such as 32-bit floating point operation, 32-bit integer operation, memory access, vector operation, input/output of structures, judgment of condition, branching and etc.

In addition, the processing element PE has an external data memory (EDM) 40 which is constructed by an SRAM of 512 Kbytes, for example, and the EDM 40 is also used as a local memory.

As described later in detail with reference to FIG. 13 and thereafter, the VC 38 controls of a vector operation instruction such as an operation of fellow vectors stored in the EDM 40, an operation of a vector with a constant, a total sum of vector data, a copy of a vector, and etc. However, in the VC 38, a normal memory access instruction is also controlled.

Packet formats used in this embodiment shown are illustrated in FIGS. 3A-3E. Packets are divided broadly into executive packets which are used for executing a program and non-executive packets used for other than execution of a program. However, in packet formats of this embodiment, a packet other than a packet holding structures is constructed as a fixed-length of 33 bits.times.2 words on the PR 20 and, on the network 18, 18 bits.times.4 words.

A format shown in FIG. 3A shows an executive packet processed in only the inside of the processing element, being constructed by 33 bits.times.2 words. A field HD is 1 bit, which is an identifier for identifying a first word (header) and a second word (tail) of 2-word packet. "1" is applied to the header and "0" is applied to the tail. A field EX is 1 bit, which is a flag for indicating that the packet is a packet to be outputted from the PR 20 to the outside of the processing element. Therefore, in a case of FIG. 3A, the flag EX is set as "0". A field MODE is 2 bits, which is an identification code for identifying a kind of packet such as an executive packet or non-executive packet. A field S-CODE is 3 bits, which is an identifying code for defining a content of process for the packet together with the identification code MODE. A field OPCODE-M is a main operation code of 5 bits. The main operation code OPCODE-M defines a kind of operation to be executed in the EX 26 and holds the number of data to be processed in synchronous with each other. A field OPCODE-S is a sub-operation code of 6 bits, which defines an instruction defined by the main operation code OPCODE-M in more detail. A field NODE# is constructed by 11 bits at maximum, which holds the node number of a data flow graph. A field COLOR is a color identification code of 4 bits, which is an identification code for identifying an execution environment such as co-use of a program by calling a sub-routine call, a process for a time-series data, and etc, in executing the same data flow graph in multiple. A field DATA is 32 bits, which holds numeral value data for an integer operation, a floating point operation, and etc. In addition, in the EXE 26, the second word of the executive packet is further increased by 32 bits. FIG. 3B shows an executive packet which is outputted to the outside from the processing element, which is constructed by 33 bits.times.2 words as similar to that of FIG. 3A. However, since FIG. 3B packet is to be outputted to the outside, the flag EX is set as "1". When the packet is transferred onto the network, the same is converted into a packet of 18 bits.times.4 words in the NC 18.

FIG. 3C shows a non-executive packet which is to be processed in only the inside of the processing element, which is constructed by 33 bits.times.2 words. A filed ADDRESS shown in FIG. 3C is 16 bits, which stores a memory address when data is loaded to the QM 28 or the EDM 40, or the data is dumped therefrom.

FIG. 3D shows an executive packet transferred onto the network, which is constructed by 18 bits.times.4 words. In FIG. 3D, a field PE# existing in the first word is 10 bits, which holds identification numbers, that is, the PE numbers for identifying a maximum of 1024 the processing elements. In addition, in a case where the packet is constructed by 4 words as shown in FIGS. 3D and 3E, flags RQ and HT each being 1 bit are added thereto. The flag RQ is added to a packet transferred on the network in a manner that the same is applied as "1" or "0" alternately word by word. Since the flag RQ is inverted at every timing when one word is transferred, by referring to the flag RQ, it is possible to recognize presence of a word. In addition, since the flag RQ is inverted for each word, as described later, the flag RQ functions as a transfer request signal for transferring a packet forward word by word in the data transfer line such as the PR 20. The flag HT is a packet recognition flag which is set as "1" in the first word, i.e. the header and in the last word, i.e. the tail which construct one packet, and "0" is set in the other words. In addition, by the flag RQ and the flag HT, it is possible to identify the header and the tail of the 4-word packet. Specifically, the flags RQ and HT are both set as "1" in a case of the header and, in a case of the tail, the flags RQ and HT are set as "0" and "1", respectively.

FIG. 3E shows a non-executive packet transferred on the network, which is constructed by 18 bits.times.4 words as similar to FIG. 3D packet.

In next tables I, II, III, IV and V, kinds or meanings of the packets which are identified by the identification codes MODE and S-CODE.

TABLE I ______________________________________ Normal executive packets MODE S-CODE Meaning ______________________________________ 11 000 Result packet processed in PS 11 001 Packet synchronously processed in FCCM 11 110 Right operand packet processed in FCCM 11 111 Left operand packet processed in FCCM 11 100 Get color packet processed in FCCM 11 101 Free color packet processed in FCCM 11 01- Operation packet processed in EXE ______________________________________

TABLE II ______________________________________ Load/dump packets MODE S-CODE Meaning ______________________________________ 10 000 Dump packet from PS 10 010 Dump packet from EDM 10 100 Dump packet from FCCM 10 110 Dump packet from various registers 10 001 Load packet to PS 10 011 Load packet to EDM 10 101 Load packet to FCCM 10 111 Load packet to various registers ______________________________________

TABLE III ______________________________________ Broadcast packets MODE S-CODE Meaning ______________________________________ 01 001 Broadcast packet of PS 01 011 Broadcast packet of EDM 01 101 Broadcast packet of FCCM 01 111 Broadcast packet of various registers ______________________________________

TABLE IV ______________________________________ Specific packets MODE S-CODE Meaning ______________________________________ 01 000 Packet for setting communication path of 01 010 structure 01 100 Structure packet 01 110 ______________________________________

TABLE V ______________________________________ Output packets MODE S-CODE Meaning ______________________________________ 00 000 Result packet of dump of PS 00 010 Result packet of dump of EDM 00 100 Result packet of dump of FCCM 00 110 Result packet of dump of various registers 00 001 00 011 End trigger(sink) packet indicative of 00 101 termination of execution of program 00 111 ______________________________________

Now, constructions of respective units shown in FIG. 2 and operations therein will be described in more detail.

A format of the PS 22 is shown in FIG. 4, and a capacity of the PS 22 is a range of 1k.times.32 bits-2k.times.32 bits, and the field NODE# is 11 bits at maximum. When a result packet is arrived at the PS 22, the program memory is read while the field NODE# of the packet is used as an address therefor, and 28 bits from the fourth bit to the 31st bit of a data as read is stored in a predetermined field of the first word of an arrived packet. Thus, a node number is renewed. At this time, if a flag CONST is "0", the arrived packet is outputted while the second word, i.e. the field DATA of the arrived packet remains. If the flag CONST is "1", since the constant data is stored in the next address, the constant data is stored in the second word of the arrived packet, being outputted. Thus, the constant is applied. In addition, if a flag COPY is "1", a next address of the program memory is further read and the above described process is repeated. Thus, a copy operation is executed.

In the FCCM 24, two packets having the same identification code (tag) which is 15 bits in total of the identification code NODE# (11 bits) and the color code COLOR (4 bits) are queued, and the FCCM 24 outputs the same to the EXE 26 as an operand pair. Therefore, if there is a queue memory in which 32k packet can be stored, it is possible to easily firing-control by accessing the queue memory by the tag of 15 bits. However, a memory capacity assigned to such a queue memory is 256 packets at most. Therefore, in this embodiment shown, 7 bits of the 15-bit tag is used as a hash address, and the coincidence of the tags is detected by comparing the remaining 8 bits by means of a comparator. Then, as shown in FIG. 5, as the queue memory, there are provided with a hash memory directly accessed by the hash address and a buffer memory which stores packets being chained by a pointer when a hash crash occurs (when no fellow exist in the hash memory), and a portion of the buffer memory is also used for color management. A stack memory (128 words) is provided as a mechanism for controlling vacant addresses of the buffer memory. In addition, the hash memory and the buffer memory are formed by dividing a memory of 256 packets. Furthermore, in the field of DATA of the hash memory or a buffer memory, operand data being queuing or unused color is stored. In a field TAG, 8 bits of the tag other than the bit used as the hash address are stored. A flag LR distinguishes that the operand data being queuing is a left operand or a right operand by means of "1" or "0". A flag FE distinguishes whether or not there is an operand being queuing by means of "1" or "0". A flag NX distinguishes whether or not a succeeding operand is chained by a pointer by means of "1" or "0".

In the FCCM 24 having such a hash memory and a buffer memory, a firing control process is executed set forth in the following. More specifically, in an initial state, the flags FE and NX of all the addresses are cleared (load "0") and vacant addresses of the buffer memory are loaded in the stack pointer. Thereafter, when a right or left operand packet is arrived, the hash memory is read by the hash address derived from the fields NODE# and COLOR of that packet. At this time, if the flag FE is "0", the packet is written in the address designated by the hash address, and thereafter, the flag FE is set as "1". If the flag FE is "1", the tag of the hash memory addressed by the hash address and the tag of the packet are compared with each other. When the both are coincident, the field DATA of the operand packet and the field DATA of the hash memory addressed by the hash address are paired and outputted. If the flag NX is "0", the flag FE is cleared and, if the flag NX is "1", the buffer memory is read by a pointer value of an address designated by the hash address, and thereafter, the same is written in an address of the hash memory designated by the hash address. In addition, the pointer value is pushed down into the stack pointer by using the pointer value as a vacant address.

In the comparison of the tags, if the both tags are not coincident with each other, the flag NX of the address designated by the hash address is referred to. If the flag NX is "0", the vacant address of the buffer memory is popped up from the stack pointer, and the vacant address is written in the pointer field designated by the hash address, and thereafter, the flag NX is set as "1" and the packet is written into the vacant address of the buffer memory. If the flag NX is "1", the buffer memory is read by the pointer value of the address designated by the hash address to compare the tags. If the both tags are coincident with each other, the field DATA of the operand packet and the field DATA of the address designated by the pointer value are paired and outputted, and thereafter, the flag FE is reset and an address that becomes vacant is pushed down in the stack. Furthermore, if the operand packet thus fired is the last packet of a queue list, the pointer is traced in a reverse direction and the flag NX is set as "0". The operand as fired is one on a midway of the queue list, the pointer is operated so that the operand is deleted from the queue list. In addition, the tags are not coincident with each other in the above described comparison, the flag NX is referred to. If the flag NX is "0", the vacant address of the buffer memory is popped up from the stack and the vacant address is written in the pointer field of the address designated by the pointer value, and the flag NX is set. In addition thereto, the packet is written in the vacant address of the buffer memory. If the flag NX is "1", the pointer value of the address designated by the pointer value is renewed as a new pointer value, and thereafter, the above described process is repeated.

Then, a fired packet which holds an operand pair is renewed as an operation packet to be processed in the EXE 26 while the most significant bit of the field S-CODE is set as "0".

In addition, in the FCCM 24, a synchronous process of multiple inputs (Nsync) is executed as set forth in the following. It is waited that N packets having the same tag are arrived, and when all the N packets are arrived at, one result packet in which the field S-CODE is set as "000" is outputted. A packet of the multiple inputs synchronous process holds the number of packets to be synchronously processed in the main operation code OPCODE-M. Therefore, although the content of the process is substantially the same as that of the above described firing control, not the operand but the number of packets is stored in the field DATA of the queue memory, and the data of the number of packets is decremented at every timing when the coincidence of the tags is detected. Then, the tags are successively coincident with each other and the number of packets becomes zero, the least significant bit of the field S-CODE of the packet as arrived lastly is cleared and outputted as a result packet.

A color process is performed as set forth in the following. In an initial state, usable colors are stored in the buffer memory in a state where the same are chained by a pointer while the 255th address of the queue memory is at the head. Vacant addresses except for the addresses in which the colors are stored are loaded in the stack and, in response thereto, a stack pointer is set in a correct value. If the get color packet or free color packet is arrived at, the head of the color list, that is, the 255th address is accessed. The aforementioned firing-control is performed while the get color packet is a right operand and the free color packet is a left operand. Such a firing-control is different from the normal firing control in a point that a firing occurs without condition even though the tags are not coincident with each other. Then, when the firing occurs, the field S-CODE is renewed as "01-" and such a operand is outputted as a data pair. In addition, a color to be returned is held in the field DATA of the free color packet and its own node number and a color are held in the field DATA of the get color packet. NOP (no operation) is held as an operation code in the get color packet and RETURN is held as an operation code in the free color packet. The operation code RETURN is an operation code for storing the right operand data in the field NODE# and COLOR of the header of the left operand data. Thus, even if any one of the get color packet and the free color packet is first arrived at the FCCM 24 and the firing occurs, at least a timing when such a packet is outputted from the EXE 26, the node number and the color become of those being held in the get color packet and a new color as gotten is held in the data field of the second word.

In addition, as described above, in the IC 30, there is provided with a PE number register as shown in FIG. 6 for storing an inherent number of the processing element, i.e. PE number, and in FIG. 6, "X" is the number of a lateral direction (east-west direction) in the torus mesh network 16 (FIG. 1), that is, the column number and "Y" is the number of a longitudinal direction (south-north direction), that is, the row number. The PE number which inherently identifies each processing element is constructed by "X" and "Y". However, a flag PEACT shown in FIG. 6 is a flag indicative of whether or not the PE number has been set, and the same is "0" if no PE number is set and becomes "1" when the PE number is set.

One aspect of the parallel computer 10 in accordance with the present invention exists in a point that a process mode of the NC 18 is changed in accordance with a state of the flag PEACT shown in FIG. 6. More specifically, the NC 18 sets a specific mode when the flag PEACT is "0", that is, no PE number is set, and a normal mode is set when the flag is "1". In the specific mode, the NC 18 regards all the packets (shown in the previous table IV and inputted from the all the ports) as the packets to be inputted to the NC 18 itself, and inputs the packets in the PR 20 and executes a predetermined process designated by the identification code included in the packet. In addition, in the normal mode, the NC 18 compares the PE number existing in the first word of the packet as arrived at with the PE number being set in the number register (FIG. 6) of itself and inputs the packet to the PR 20 only when the both are coincident with each other. If the both are not coincident, the NC 18 outputs the packet as arrived at to any one of the ports in accordance with a predetermined routing algorithm (described later in detail with reference to FIG. 7).

Now, the NC 18 included in each of the processing elements will be described in detail with reference to FIG. 7. Input shift registers RNI, REI, RSI and RWI and output shift registers RNO, REO, RSO and RWO are connected to the ports N, E, S and W in parallel with each other, respectively. Each of these shift registers is a self-synchronous shift register as shown in FIG. 10, for example, and constructed by 18 bits.times.4 stages. Outputs of the input shift registers RNI, REI, RSI and RWI are given to branch circuits R1, R2, R3 and R4, respectively. Inputs of the output shift registers RNO, REO, RWO and RSO are given from joining circuits M1, M2, M3 and M4, respectively.

The branch circuit R1 outputs an inputted packet from the input shift register RNI to the joining circuit M2, M4 or M5. As similar thereto, a packet inputted from the input shift register REI is outputted from the branch circuit R2 to the joining circuit M1, M3, M4 or M5. A packet from the input shift register RSI is inputted to the joining circuit M1, M2 or M5 through the branch circuit R4. A packet from the input shift register RWI is branched to the joining circuit M1, M2, M4 or M5 in the branch circuit R3.

The joining circuit M1 joins packets respectively given from the branch circuits R2-R5 to give the same to the output shift register RNO, that is, the port N. The joining circuit M2 joins the packets given from the branch circuits R2 and R5 and outputs the same to the output shift register REO, that is, the port E. The joining circuit M3 joins packets from the branch circuits R3, R1, R4 and R5 to output to the output shift register RWO, that is, the port W. The joining circuit M4 joins packets sent from the branch circuits R1, R3, R2 and R5 to input to the output shift register RSO, that is, the port S.

In addition, the branch circuit R5 receives a packet from the OC 34 included in the PR 20 (FIG. 2) and thus the OBUF 36 and outputs to any one of the joining circuits M1-M4. The joining circuit M5 joins packets sent from the branch circuits R1-R4 to output the IC 30 included in the PR 20 (FIG. 2) and thus the IBUF 32.

Presupposing the following condition, a routing algorithm of each of the branch circuits R1-R5 is as set forth in the following items (1)-(5).

A network is a torus mesh as shown in FIG. 1, and a network size thereof is m.times.n, an own PE number is represented by x, y, an addressed PE number is represented by X, Y, and in the NC 18 of each processing element, differences .DELTA.x and .DELTA.y between the addressed PE number and its own PE number are calculated in accordance with following formulas.

where,

In addition, the PE numbers are y=0, 1, 2, . . . , m in sequence in a direction from north (N) to south (S) and x=0, 1, 2, . . . , n in sequence in a direction from west (W) to east (E). Then, a packet having an identification code MODE of "00" is a packet to be transferred to the host computer 12 and a flag PEACT indicates whether or not the PE number has been applied.

(1) Branch circuit R1

When MODE.noteq.00 and (PEACT=0 or .DELTA.y=0), a packet is outputted to the joining circuit M5.

When MODE=00 and PEACT=1 and .DELTA.y=0, a packet is outputted to the joining circuit M2.

Otherwise, a packet is outputted to the joining circuit M4.

(2) Branch circuit R2

When MODE.noteq.00 and (PEACT=0 or .DELTA.x=.DELTA.y=0), a packet is outputted to the joining circuit M5.

When PEACT=1 and .DELTA.x=0 and .DELTA.y>0, a packet is outputted to the joining circuit M4.

When PEACT=1 and .DELTA.x=0 and .DELTA.y<0, a packet is outputted to the joining circuit M1.

Otherwise, a packet is outputted to the joining circuit M3.

(3) Branch circuit R3

When MODE.noteq.00 and (PEACT=0 or .DELTA.x=.DELTA.y=0), a packet is outputted to the joining circuit M5.

When PEACT=1 and .DELTA.x=0 and .DELTA.y>0, a packet is outputted to the joining circuit M4.

When PEACT=1 and .DELTA.x=0 and .DELTA.y<0, a packet is outputted to the joining circuit M1.

Otherwise, a packet is outputted to the joining circuit M2.

(4) Branch circuit R4

When MODE.noteq.00 and (PEACT=0 or .DELTA.y=0), a packet is outputted to the joining circuit M5.

When MODE.noteq.00 and PEACT=1 and .DELTA.y=0, a packet is outputted to the joining circuit M4.

Otherwise, a packet is outputted to the joining circuit M1.

(5) Branch circuit R5

When .DELTA.x=0 and .DELTA.y>0, a packet is outputted to the joining circuit M4.

When .DELTA.x=0 and .DELTA.y<0, a packet is outputted to the joining circuit M4.

When .DELTA.x<0, a packet is outputted to the joining circuit M3.

Otherwise, a packet is outputted to the joining circuit M2.

In addition, packets inputted to the joining circuits M1, M2, M3 and M4 are inputted to the output shift register RNO of the north port of N, output shift register REO of the east port E, output shift register RWO of the west port W and output shift register RSO of the south port R, respectively. In addition, a packet inputted to the joining circuit M5 is inputted to the PR 20 via the input control unit IC (FIG. 2).

In the NC 18 shown in FIG. 7, a routing is executed in accordance with the above described algorithm when the header of the packet is arrived at, and the succeeding data are outputted to the same route until the tail of that packet is arrived at.

Furthermore, in FIG. 7 embodiment, presupposing that the respective processing elements PE are arranged in the torus mesh network 16 as shown in FIG. 1 or FIG. 8, modulo is calculated in accordance with the above described formulas. However, in a case where the respective processing elements PE are arranged in a non-torus mesh network, that is, in a case where respective communication lines RC and CC shown in FIG. 1 are not ring form, such modulo calculation become unnecessary. In addition, the above described algorithm of the branch circuits R1-R5 are only examples, and therefore, modification or change may be possible arbitrarily.

FIG. 8 shows a modified embodiment of FIG. 1 embodiment, which includes 16 (=4.times.4) processing elements PE00-PE33 arranged in a manner of a torus mesh on the network 16. Then, a packet outputted from the host computer 12 is converted into a data format of 4-word construction as shown in FIG. 3D or 3E in the interface 14 to be inputted to the NC 18 (FIG. 2) of each processing element via the input line IN and the network interface NIF.

In FIG. 8 embodiment, at a timing when a power of this parallel computer 10 is turned on, the contents of the PE number registers (FIG. 6) of respective processing elements are indefinite. Therefore, the host computer 12 gives a hardware reset signal RST to all the processing elements PE00-PE33. In response thereto, the flags PEACT (FIG. 6) are reset as "0" in all the processing elements PE00-PE33, and as described above, an operation mode of each processing element is set as a specific mode.

Succeedingly, a load packet having "00" in the filed DATA as shown in FIG. 3E to be loaded in the PE number register from the host computer 12 is outputted. A first load packet is sent to the port W of the NC 18 (FIG. 2) of the processing element existing at a left and upper end in FIG. 8 through the interface 14, input line IN and network interface NIF. At this time, since the branch circuit R3 (FIG. 7) included in the NC of its processing element is set as a specific mode because the flag PEACT is "0", the branch circuit R3 inputs a packet as arrived at to the joining circuit M5, that is, the PR 20 (FIG. 2). Therefore, "00" is set in the number register (FIG. 6) of the processing element as its own PE number, and the flag PEACT is set as "1". Thereafter, the processing element existing at the left and upper end in FIG. 8, that is, in the first row and the first column is identified as "PE00".

Succeedingly, a second load packet in which "01" is held in the field DATA thereof is outputted from the host computer 12. The second load packet is arrived at the port W of the NC 18 of the processing element PE00 as similar to the first load packet, but the flag PEACT of the processing element PE00 has been set as "1", and therefore, in the branch circuit R3 included in the NC 18 of the processing element PE00, in accordance with a condition of .DELTA.x.noteq.0, the second load packet is outputted to the joining circuit M3, i.e. the port E as it is. Therefore, the second load packet is arrived at the port W of the NC 18 of the processing element in the first row and second column in FIG. 8. At this time, since the flag PEACT is "0", as done in the processing element PE00, the PE number "01" is set in its number register (FIG. 6) and the flag PEACT is set as "1". Thereafter, the processing element of the first row and the second column in FIG. 8 is identified as "PE01". As similar thereto, the PE number "02" is set in the processing element of the first row and the third column in FIG. 8 is set and the PE number "03" is set in the processing element of the first row and the fourth column.