|
Description  |
|
|
FIELD OF THE INVENTION
The invention relates generally to the field of digital computer systems, and more particularly to computer systems including a plurality of processors operating generally in parallel.
BACKGROUND OF THE INVENTION
A digital computer system generally comprises three basic elements, namely, a memory element, an input/output element and a processor element. The memory element stores information in addressable storage locations. This information includes
data and instructions for processing the data. The processor element fetches information from the memory element, interprets the information as either an instruction or data, processes the data in accordance with the instructions, and returns the
processed data to the memory element. The input/output element, under control of the processor element, also communicates with the memory element to transfer information, including instructions and the data to be processed, to the memory, and to obtain
processed data from the memory.
Most modern computing systems are considered "von Neumann" machines, since they are generally constructed according to a paradigm attributed to John von Neumann. Von Neumann machines are characterized by having a processing element, a global
memory which stores all information in the system, and a program counter that identifies the location in the global memory of the instruction being executed. The processing element executes one instruction at a time, that is, the instruction identified
by the program counter. When the instruction is executed, the program counter is advanced to identify the location of the next instruction to be processed. In many modern systems, the program counter is actually advanced before the processor has
finished processing the current instruction.
Von Neumann systems are conceptually uncomplicated to design and program, since they do only one operation at a time. A number of advancements have been made to the original von Neumann paradigm to permit the various parts of the system, most
notably the various components of the processor, to operate relatively independently and to achieve a significant increase in processing speed. One such advancement is pipelining of the various steps in executing an instruction, including instruction
fetch, operation code decode (a typical instruction includes an operation code which identifies the operation to be performed, and in most cases one or more operand specifiers, which identify the location in memory of the operands, or data, to be used in
executing the instruction), operand fetch, execution (that is, performing the operation set forth in the operation code on the fetched operands), and storing of processed data, which steps are performed relatively independently by separate hardware in
the processor. In a pipelined processor, the processor's instruction fetch hardware may be fetching one instruction while other hardware is decoding the operation code of another instruction, fetching the operands of still another instruction, executing
yet another instruction, and storing the processed data of a fifth instruction. Since the five steps are performed sequentially, pipelining does not speed up processing of an individual instruction. However, since the processor begins processing of
additional instructions before it has finished processing a current instruction, it can speed up processing of a series of instructions.
A pipelined processor is obviously much more complicated than a simple processor in a von Neumann system, as it requires not only the various circuits to perform each of the operations (in a simple von Neumann processor, many circuits could be
used to perform several operations), but also control circuits to coordinate the activities of the various operational circuits. However, the speed-up of the system can be dramatic.
More recently, some processors have been provided with execution hardware which includes multiple functional units each being optimized to perform a certain type of mathematical operation. For example, some processors have separate functional
units for performing integer arithmetic and floating point arithmetic, since they are processed very differently. Some processors have separate hardware functional units each of which performs one or only several types of mathematical operations,
including addition, multiplication, and division operations, and other operations such as branch control and logical operations, all of which can be operating concurrently. This can be helpful in speeding up certain computations, most particularly those
in which several functional units may be used concurrently for performing parts of a single computation.
In von Neumann processors, including those which incorporate pipelining or multiple functional units (or both, since both may be incorporated into a single processor), a single instruction stream operates on a single data stream. That is, each
instruction operates on data to enable one calculation at a time. Such processors have been termed "SISD," for "single-instruction/single-data." If a program requires a segment of a program to operate on a number of diverse elements of data to produce a
number of calculations, the program causes the processor to loop through that segment for each calculation. In some cases, in which the program segment is short or there are only a few data elements, the time required to perform such a calculation may
not be unduly long.
However, for many types of such programs, SISD processors would require a very long time to perform all of the calculations required. Accordingly, processors have been developed which incorporate a large number of processing elements all of
which may operate concurrently on the same instruction stream, but with each processing element processing a separate data stream. These processors have been termed "SIMD" processors, for "single-instruction/multipledata." An example of such a system is
disclosed in U.S. Pat. No. 4,598,400, issued Jul. 1, 1986, in the name of W. Daniel Hillis, for Method And Apparatus For Routing Message Packets.
SIMD processors are useful in a number of applications, such as image processing, signal processing, artificial intelligence, database operations, and computer simulation of a number of things, such as electronic circuits and fluid dynamics. In
image processing, each processing element may be used to perform processing on a pixel (picture element) of the image to enhance the overall image. In signal processing, the processors concurrently perform a number of the calculations required to
perform such computations as the "Fast Fourier transform" of the data defining the signal. In artificial intelligence, the processors perform searches on extensive rule bases representing the stored knowledge of the particular application. Similarly,
in database operations, the processors perform searches on the data in the database, and may also perform sorting and other operations. In computer simulation of, for example, electronic circuits, each processor may represent one part of the circuit,
and the processor's iterative computations indicate the response of the part to signals from other parts of the circuit. Similarly, in simulating fluid dynamics, which can be useful in a number of applications such as weather prediction and airplane
design, each processor is associated with one point in space, and the calculations provide information about various factors such as fluid flow, temperature, pressure and so forth.
Typical SIMD systems include a SIMD array, which includes the array of processing elements and a router network, a control processor and an input/output component. The input/output component, under control of the control processor, enables data
to be transferred into the array for processing and receives processed data from the array for storage, display, and so forth. The control processor also controls the SIMD array, iteratively broadcasting instructions to the processing elements for
execution in parallel. The router network enables the processing elements to communicate the results of a calculation to other processing elements for use in future calculations.
More recently, in massively parallel computing systems, multiple instruction/multiple data (MIMD) systems have been developed in which a plurality of processors each operates in response to its own instruction stream processes data. In a MIMD
system, each processor can process data based on its individual programming and the results of previous processing, separately from the other processors, which can be an advantage over SIMD systems in connection with some types of problems. However, a
processor often requires the results of processing by other processors, or the processors must synchronize their respective processing statuses, which can be more easily achieved in a SIMD system.
As a result, new computer architectures are being developed, generally known as "S/MIMD" (for "synchronous MIMD"). In an S/MIMD system, a control processor transmits commands to a set of processors, each of which processes data in response to
the command. In response to each command, a processor may execute one or more instructions. S/MIMD systems thus maintain a single point of control, but the control is on a command-by-command basis, rather than on an instruction-by-instruction basis.
The particular instruction or series of instructions executed by a particular processor in response to a command may depend on the command itself, as well as on results of previous processing by the particular processor, and perhaps on results of
previous processing by other processors. In any case, the control processor provides a degree of synchronization for the processors which receive commands therefrom.
SUMMARY OF THE INVENTION
The invention provides a new and improved parallel computer including an arrangement for performing in a parallel manner a context switch operation.
In brief summary, the new arrangement provides in one aspect a parallel computer comprising a plurality of processing elements and a scalar processor all interconnected by a communications network. The communications network further comprises a
data router for transferring data between processors and a control network for transferring program commands, status information, and synchronization signals between processors. The scalar processor and the processing elements process a plurality of
programs, with the scalar processor and the processing elements processing each program in parallel. The scalar processor, while processing each program, generates commands and transfers them to the processing elements over the communications network,
and each of the processing elements processes data associated with a particular program in response to each command received from the communications network. The scalar processor periodically further generates a command to enable the processing elements
to, in parallel, switch processing of a program they are currently processing and begin processing of another program.
In another aspect, the new arrangement provides a parallel computer comprising a plurality of processing elements and a control element. Each of the plurality of processing elements processes user programs in response to program commands. Each
processing element further comprises a context switch program for enabling it, in response to receipt of a context switch command, to switch from processing of a user program it is then processing to processing of another user program. The control
element generates program commands for transfer to the processing elements generally in parallel to enable the processing elements to process the user programs such that all of the processing elements are generally processing the same user program in
parallel. The control element in response to selected conditions transmits context switch commands to the processing elements to enable the processing elements to, in parallel, switch from processing of a user program it is then processing to processing
of another user program.
BRIEF DESCRIPTION OF THE DRAWINGS
This invention is pointed out with particularity in the appended claims. The above and further advantages of this invention may be better understood by referring to the following description taken in conjunction with the accompanying drawings,
in which:
FIG. 1 is a general block diagram of a massively parallel computer system constructed in accordance with the invention;
FIG. 2A and 2B, together with FIG. 2B-1 and 2B-2, are block diagrams useful in understanding the structure and operation of the data router of the computer system of FIG. 1;
FIG. 3 is a diagram depicting the structure of message packets transferred over the data router;
FIG. 4A, together with FIG. 4A-1 through 4A-4, along with FIG. 4B through 4E are block and logic diagrams useful in understanding the structure and operation of the control network of the computer system of FIG. 1;
FIG. 5 is a diagram depicting the structure of message packets transferred over the control network;
FIG. 6 is a general block diagram of a processing element in the computer system depicted in FIG. 1;
FIG. 7A-1 comprises a general block diagram of a data router interface circuit useful in interfacing the processing element depicted in FIG. 6 to the data router of the computer system depicted in FIG. 1, and FIG. 7A-2A and 7A-2B contain
definitions of registers in the data router interface;
FIG. 7B-1 comprises a general block diagram of a control network interface circuit useful in interfacing the processing element depicted in FIG. 7A-1 to the control network of the computer system depicted in FIG. 1, and FIG. 7B-2A and 7B-2B
contain definitions of registers in the control network interface;
FIG. 8SP-1 through 8SP-12, 8PE-1 through 8PE-12, and 9A through 9K detail the operations performed in connection with a context switch operation.
DETAILED DESCRIPTION OF AN ILLUSTRATIVE EMBODIMENT
I. General Description of Computer System
II. General Description of Communications Networks
A. Data Router
B. Control Network
III. General Description of Processing Element
A. General
B. Data Router Interface
C. Control Network Interface
IV. Operations of System in Connection with a Context Switch Operation
I. General Description of Computer System
The invention provides new and improved facilities for controlling a massively-parallel computing system. Prior to describing an illustrative embodiment of the particular invention, it would be helpful to describe in detail one embodiment of a
massively-parallel computing system which makes use of the invention. Further details of the embodiment are disclosed in U.S. patent application Ser. No. 07/592,029, entitled Parallel Computer System, filed Oct. 3, 1990, in the name of David C.
Douglas, et al., now abandoned in favor of several continuations-in-part including U.S. patent application Ser. No. 07/746,035 (see below) and U.S. patent application Ser. No. 07/746,038 (see below); U.S. patent application Ser. No. 07/746,035,
entitled Massively Parallel Computer Partitionable Through Switchable Fat-Tree Control Network, filed Aug. 16, 1991, in the name of David C. Douglas, et al., now matured into U.S. Pat. No. 5,353,412; and U.S. patent application Ser. No. 07/746,038,
entitled Input/Output System For Massively Parallel Computer System, filed Aug. 16, 1991, in the name of David Wells, et al., now matured into U.S. Pat. No. 5,361,363, all of which are assigned to the assignee of the present application.
FIG. 1 is a general block diagram of a massively parallel computer system 10 which makes use of the invention. With reference to FIG. 1, system 10 includes a plurality of processing elements 11(0) through 11(N) (generally identified by reference
numeral 11), scalar processors 12(0) through 12(M) (generally identified by reference numeral 12) and input/output processors 13(0) through 13(K) (generally identified by reference numeral 13). Input/output units (not shown), such as, for example, disk
and tape storage units, video display devices, printers and so forth may be connected to the input/output processors to supply information, including data and program commands, for processing by the processing elements 11 and scalar processors 12 in the
system, and may also receive processed data for storage, display and printing. The scalar processors 12 may also be connected to input/output units including, for example, video display terminals which permit one or more operators to generally control
system 10. The system 10 may also include a plurality of spare processing elements 11s(0) through 11s(J) (generally identified by reference numeral 11s) which may be used as described below. The system 10 further includes a control network 14, a data
router 15 and a diagnostic network 16. The control network 14 permits one or more scalar processors 12 to broadcast program commands to processing elements 11. The processing elements 11 which receive the commands execute them generally concurrently.
The control network 14 also permits processing elements 11 to generate status information which they may supply to the scalar processors 12. The control network 14 is also used by the processing elements 11 to perform selected types of arithmetic
operations, termed "scan" and "reduce" operations, as described below. The control network 14 may also be used to provide status and synchronization information among the processing elements 11.
The data router 15 transfers data among the processing elements 11, scalar processors 12 and input/output processors 13. In particular, under control of the scalar processors 12, the input/output processors 13 retrieve data from the input/output
units and distribute the data to the respective scalar processors 12 and processing elements 11. During processing, the scalar processors 12 and processing elements 11 can transfer data among themselves over the data router 15. In addition, the
processing elements 11 and scalar processors 12 can transfer processed data to the input/output processors 13. Under control of the scalar processors 12, the input/output processors 13 can direct the processed data that they receive from the data router
15 to particular ones of the input/output units for storage, display, printing, or the like. The data router 15 in one particular embodiment is also used to transfer input/output commands from the scalar processors 12 to the input/output processors 13
and input/output status information from the input/output processors 13 to the scalar processors 12.
The diagnostic network 16, under control of a diagnostic processor (not shown in FIG. 1), facilitates testing of other portions of the system 10 to identify, locate and diagnose defects. The diagnostic processor may comprise one or more of the
scalar processors 12. In addition, the diagnostic network 16 may be used to establish selected operating conditions in the other portions of the system 10 as described below. One embodiment of the diagnostic network 16 is described in detail in the
aforementioned Douglas, et al., and Wells, et al., patent applications and will not be repeated here.
The system 10 is synchronous. All of its elements operate in accordance with a global SYS CLK system clock signal provided by a clock circuit 17.
One particular embodiment of system 10 may include hundreds or many thousands of processing elements 11 operating on a single problem in parallel under control of commands broadcast to them by the scalar processors 12. In that embodiment, the
processing elements 11 operate in parallel on the same command on their individual sets of data, thereby forming a parallel computer system.
In addition, the system 10 may be dynamically logically partitioned, by logical partitioning of the control network 14 as described below, into multiple logical subsystems which may concurrently operate on separate problems or separate parts of a
single problem. In that case, each partition includes at least one scalar processor 12 and a plurality of processing elements 11, the scalar processor 12 supplying the commands for processing by the processing elements in its partition. The spare
processing elements 11s, which except for the positions of their connections to the control network 14 and data router 15 are otherwise similar to processing elements 11, may be used to substitute for failed processing elements 11 in a partition as
described below, to augment the number of processing elements in a partition if there are insufficient processing elements 11 to form a partition with a desired number of processing elements 11, or to provide additional processing elements which may
themselves be formed into partitions. In the following, unless otherwise stated explicitly, a reference to a processing element 11, in either the singular or plural, will also be taken as a corresponding singular or plural reference to a spare
processing element 11s; that is, the processing elements 11 and spare processing elements 11s will be jointly referred to herein generally as processing elements 11.
It should be noted from the following description that the partitioning is only in relation to the control network 14, but not the data router 15. This facilitates transfer of data between processing elements of different partitions if they are,
for example, processing different parts of a particular problem, or, more generally, for inter-process communications, if each processing element of the diverse partitions is processing correspondingly diverse, but possibly interacting, processes. This
further facilitates transfer of data from processing elements of any partition to the input/output processors 13 to permit storage or display of data, as well as transfer from the input/output processors 13 of stored data to processing elements of any
partition.
II. General Description of Communications Networks
A. Data Router 15 Before proceeding to a detailed descripti | | |