|
Claims  |
|
|
What is claimed is:
1. A system for processor barrier synchronization of a plurality of processing elements (PEs) in a distributed processing computer, comprising:
a barrier detection program means operative in each PE to determine when each individual PE has reached a processor barrier;
a plurality of boolean DONE outputs from each PE, each output controlled by the barrier detection program means, indicating when a processor barrier was reached by that PE;
a plurality of processor barrier synchronization circuits for detecting which PEs have reached a processor barrier, comprising:
a plurality of boolean logical AND gates connected to corresponding DONE outputs of the plurality of PEs to indicate when all of the PEs reach a processor barrier;
an ALLDONE signal output from the plurality of AND gates for providing an output indicating that all of the PEs have reached a particular processor barrier; and
a plurality of logical fanout devices, with inputs connected to the ALLDONE signal line and outputs connected to the multiplicity of PEs to indicate to every PE in the system that the particular processor barrier was reached; and
a plurality of barrier synchronization registers (BSRs) for storing the status of each barrier synchronization process.
2. A computer system comprising:
a plurality of processing elements (PEs) including a first processing element (PE), wherein said first PE comprises:
a barrier synchronization register one (BSR1), said BSR1 including a first BSR bit and a second BSR bit, wherein said first BSR bit and said second BSR bit are set when said first PE reaches a first barrier point and a second barrier point,
respectively;
a first BSR output; and
a coupler which couples a value representative of said first BSR bit to said first BSR output; and
a barrier synchronization mechanism (BSM) having a first physical barrier synchronization circuit (PBSC), wherein said PBSC comprises a plurality of bypass points including a first bypass point and a second bypass point, wherein each of said
bypass points comprise:
a fanin gate having a plurality of fanin inputs and a fanin output, wherein said fanin output generates an indication of whether said fanin inputs are all set;
a fanout circuit having a fanout input and a fanout output; and
a bypass switch for switchably coupling said fanin output to said fanout input;
wherein one of the fanin inputs of said first bypass point is coupled to said first BSR output, wherein one of the fanin inputs of said second bypass point is coupled to the fanin output of said first bypass point, wherein the fanout input of
said second bypass point is coupled to the fanin output of said second bypass point, wherein the fanout input of said first bypass point is coupled to the fanout output of said second bypass point, and wherein the fanout output of said first bypass point
is coupled to said first PE.
3. The computer system according to claim 2, wherein
said coupler comprises a time multiplexor (TM), wherein said TM time-multiplexes said first BSR bit and said second BSR bit on said first BSR output.
4. The computer system according to claim 2, wherein said BSM further comprises a second PBSC wherein said second PBSC comprises:
a third and a fourth bypass point, wherein each of said bypass points comprise:
a fanin gate having a plurality of fanin inputs and a fanin output which generates an indication of whether said fanin inputs are all set;
a fanout circuit having a fanout input and a fanout output; and
a bypass switch for switchably coupling said fanin output to said fanout input;
wherein said first PE further comprises a second BSR output;
wherein said coupler further couples a value representative of said second BSR bit to said second BSR output; and
wherein one of the fanin inputs of said third bypass point is coupled to said second BSR output, wherein one of the fanin inputs of said fourth bypass point is coupled to the fanin output of said third bypass point, wherein the fanout input of
said fourth bypass point is coupled to the fanin output of said fourth bypass point, wherein the fanout input of said third bypass point is coupled to the fanout output of said fourth bypass point, and wherein the fanout output of said third bypass point
is coupled to said first PE.
5. The computer system according to claim 2, wherein said BSM further comprises a second PBSC, wherein said second PBSC comprises:
a third and a fourth bypass point, wherein each of said bypass points comprise:
a fanin gate having a plurality of fanin inputs and a fanin output which generates an indication of whether said fanin inputs are all set;
a fanout circuit having a fanout input and a fanout output; and
a bypass switch for switchably coupling said fanin output to said fanout input;
wherein said BSR1 further includes a third and a fourth BSR bit which are set when said first PE reaches a third and a fourth barrier point, respectively;
wherein said first PE further comprises a second BSR output;
wherein said coupler comprises a time multiplexor, wherein said TM time-multiplexes said first BSR bit and said second BSR bit on said first BSR output, and wherein said TM time-multiplexes said third BSR bit and said fourth BSR bit on said
second BSR output; and
wherein one of the fanin inputs of said third bypass point is coupled to said second BSR output, wherein one of the fanin inputs of said fourth bypass point is coupled to the fanin output of said third bypass point, wherein the fanout input of
said fourth bypass point is coupled to the fanin output of said fourth bypass point, wherein the fanout input of said third bypass point is coupled to the fanout output of said fourth bypass point, wherein the fanout output of said third bypass point is
coupled to said first PE.
6. A computer system according to claim 2, wherein the fanout output of said first bypass point generates an interrupt to said first PE.
7. A computer system according to claim 2, wherein the fanout output of said first bypass point controls a value which is polled by said first PE.
8. A method for barrier synchronization on a computer system, said computer system having a plurality of processing elements (PEs) including a first PE, and a physical barrier synchronization circuit including a plurality of bypass points each
having a fanin gate and a fanout circuit, wherein said bypass points are arranged in a tree having a plurality of levels, each of said plurality of PEs including a barrier synchronization register one (BSR1) and a first BSR output, each said BSR1 having
a first BSR bit set when its associated PE reaches a first barrier point, the method comprising the steps of:
outputting a value representative of the first BSR bit on the first BSR output of each of said PEs;
fanning-in a value from the first BSR outputs from a first plurality of said PEs to generate a first barrier-completion signal;
selectively bypassing said first barrier completion signal at one of the plurality of levels from the fanin gate to the fanout circuit in order to partition the tree; and
fanning-out said first barrier completion signal to each of said first plurality of PEs.
9. A method according to claim 8, wherein each said BSR1 further includes a second BSR bit set when its associated PE reaches a second barrier point
wherein the step of outputting comprises the step of time multiplexing said first BSR bit and said second BSR bit.
10. A method according to claim 8, wherein each said BSR1 further includes a second, third, and fourth BSR bit set when its associated PE reaches a second, third and fourth barrier point, respectively, and a second BSR output,
wherein the step of outputting comprises the steps of:
time multiplexing said first BSR bit and said second BSR bit to said first BSR output; and
time multiplexing said third BSR bit and said fourth BSR bit to said second BSR output. |
|
|
|
|
Claims  |
|
|
Description  |
|
|
FIELD OF THE INVENTION
The present invention relates generally to massively parallel processing systems, and more particularly to a barrier synchronization mechanism for facilitating synchronization between multiple processors in such a system.
BACKGROUND OF THE INVENTION
Massively parallel processing (MPP) systems are computing systems comprised of hundreds or thousands of processing elements (PEs). While the power of a multiple instruction-multiple data (MIMD) computer system lies in its ability to execute
independent threads of code simultaneously, the inherently asynchronous states of the PEs (with respect to each other) makes it difficult in such a system to enforce a deterministic order of events when necessary. Program sequences involving interaction
between multiple PEs such as coordinated communication, sequential access to shared resources, controlled transitions between parallel regions, etc., may require synchronization of the PEs in order to assure proper execution.
An important synchronization capability in any programming model is the barrier. Barriers are points placed in the code beyond which no processor participating in a computation may proceed before all processors have arrived. Since processors
wait at a barrier until alerted that all PEs have arrived, the latency of the barrier mechanism can be very important. The latency of a barrier mechanism is the propagation time between when the last processor arrives at a barrier, and when all
processors have been notified that the barrier has been satisfied. During this period of time, all PEs may be idle waiting for the barrier to be satisfied. Hence barriers are a serialization point in a parallel code.
Barriers can be implemented entirely by software means, but software schemes are typically encumbered by long latencies and/or limited parallelism restricting how many processors can arrive at the barrier simultaneously without artificial
serialization. Because a barrier defines a serialization point in a program, it is important to keep the latency as low as possible.
Hardware support for barriers, while addressing the latency problems associated with barriers implemented by purely software means, can have other shortcomings that limit the utility of the mechanism in a production computing system. Production
computing systems demand that the barrier resource (like all resources) be partitionable among multiple users while maintaining protective boundaries between users. In addition, the barrier resource must be rich enough to allow division between the
operating system and the user executing within the same partition. Provision must also be made for fault tolerance to insure the robust nature of the barrier mechanism.
Hardware mechanisms may also suffer from an inability to operate synchronously. This inability may require that a PE, upon discovering that a barrier has been satisfied (all PEs have arrived at that point in the program), wait until all PEs have
discovered that the barrier has been reached before it may proceed through the next section of program code. The ability to operate synchronously enables the barrier mechanism to be immediately reusable without fear of race conditions.
Hardware mechanisms may also require that a PE explicitly test a barrier flag to discover when the barrier condition has been satisfied. This can prevent a PE from accomplishing other useful work while the barrier remains unsatisfied, or force
the programmer to include periodic tests of the barrier into the program in order to accomplish other useful work while a barrier is pending. This can limit the usefulness of a barrier mechanism when used as a means of terminating speculative parallel
work (e.g., a database search) when the work has been completed (e.g. the searched-for item has been found).
SUMMARY OF THE INVENTION
To overcome the above described shortcomings in the art and provide key system resources necessary for production computing, the present invention provides a hardware mechanism that facilitates barrier synchronization in a massively parallel
processing system. The present barrier mechanism provides a partitionable, low-latency, immediately reusable, robust mechanism which can be used to alert all PEs in a partition when all of the PEs in that partition have reached a designated point in the
program. The mechanism permits explicit testing of barrier satisfaction, or can alternately interrupt the PEs when a barrier has been satisfied. The present invention also provides an alternate barrier mode that satisfies a barrier when any one PE has
reached a designated point in the program, providing the capability to terminate speculative parallel work. The present barrier mechanism provides multiple barrier resources to a partition to allow pipelining of barriers to hide the latency as well as
offering raw latencies 2-3 orders of magnitude faster than software implementations. The barrier mechanism is also partitionable for multi-users. Barriers are used to bulk-synchronize the processors in a partition between loops where producer/consumer
hazards may exist, or control entry and exit between segments of a program.
BRIEF DESCRIPTION OF THE DRAWINGS
The foregoing and other objects, features and advantages of the invention, as well as the presently preferred embodiments thereof, will become apparent upon reading and understanding the following detailed description and accompanying drawings in
which:
FIG. 1 shows a simplified block diagram of a representative MPP system with which the present barrier mechanism can be used;
FIG. 2 shows a simplified block diagram of a processing element (PE), including a processor, local memory and associated shell circuitry;
FIG. 3 shows the format of the barrier synchronization registers BSR0 and BSR1;
FIG. 4 shows a simplified radix-2 barrier synchronization circuit;
FIG. 5 shows a simplified radix-4 barrier synchronization circuit;
FIG. 6 shows the format of the barrier synchronization function (BSFR) register;
FIG. 7 shows the format of the barrier synchronization mask and interrupt (BSMI) register;
FIG. 8 shows the bypass points in a simplified radix-2 barrier synchronization circuit;
FIG. 9 shows the barrier synchronization circuit of FIG. 8, with some of the bypass points redirected to the fanout block;
FIGS. 10A-E show how a 512 PE system can be partitioned at each level of a barrier synchronization circuit;
FIG. 11 shows how the barrier synchronization mask and interrupt (BSMI) register is used in concert with the bypass points in the barrier synchronization circuits to achieve the desired shape and size of a particular partition;
FIG. 12 shows the format of the barrier timing register BAR.sub.-- TMG register;
FIG. 13 shows a block diagram of a bypass point; and
FIG. 14 shows the hardware state sequencing implementation for a single barrier bit.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT
In the following detailed description of the preferred embodiment, reference is made to the accompanying drawings which form a part hereof, and in which is shown by way of illustration a specific embodiment in which the invention may be
practiced. It is to be understood that the present detailed description is intended as exemplary only, and that other embodiments may be utilized and structural changes made without departing from the spirit and scope of the present invention.
The preferred MPP system, for which the present invention provides hardware support for barrier synchronization, is a MIMD massively parallel multiprocessor with a physically distributed, globally addressable memory. A representative MPP system
100 is shown in FIG. 1. The MPP system 100 contains hundreds or thousands of processors, each accompanied by a local memory and associated support circuitry. Each processor, local memory and support circuity component is called a processing element
(PE). The PE's in the MPP system 100 are linked via an interconnect network.
The preferred MPP system 100 has a physically distributed memory, wherein each processor has a favored, low latency, high bandwidth path to a local memory, and a longer latency lower bandwidth access to the memory banks associated with other
processors over the interconnect network. In the preferred embodiment, the interconnect network is comprised of a 3-dimensional torus which when connected creates a 3-dimensional matrix of PEs. The torus design has several advantages, including speed
of information transfers and the ability to avoid bad communication links. The interconnect network is also scalable in all three dimensions. The interconnect network is described in more detail in the copending and commonly assigned U.S. patent
application Ser. No. 07/983,979, entitled "Direction Order Routing in Multiprocessing Systems", to Gregory M. Thorsen, filed Nov. 30, 1992, which is incorporated herein by reference.
FIG. 2 shows a simplified block diagram of a PE 200. An individual PE includes a high-performance RISC (reduced instruction set computer) microprocessor 202. In the preferred MPP system, microprocessor 202 is the DECChip 21064-AA RISC
microprocessor, available from Digital Equipment Corporation. Each PE is coupled to a local memory 204 that is a distributed portion of the globally-addressable system memory, and includes a shell of circuitry that implements synchronization and
communication functions facilitating interactions between processors.
The shell circuitry includes an interconnection network router 206, used to connect multiple PEs in the three-dimensional toroidal "fabric". The interconnection network carries all data communicated between PEs and memories that are not local.
An address centrifuge and block transfer engine 208 in the PE shell circuitry permits asynchronous (i.e., independent of the local processor) movement of data between the local memory 204 and remote memories associated with other PEs, such as block
transfers, with flexible addressing modes that permit a high degree of control over the distribution of data between the distributed portions of the system memory. The address centrifuge and block transfer engine are described in detail in the copending
and commonly assigned U.S. Patent Application entitled "RECURSIVE ADDRESS CENTRIFUGE FOR DISTRIBUTED MEMORY MASSIVELY PARALLEL PROCESSING SYSTEMS", filed on even date herewith to Fromm, which application is incorporated herein by reference.
The shell circuitry also includes a data prefetch queue 210 which allows the processor 202 to directly initiate data movement between remote memories and the local processor in a way that can hide the access latency and permit multiple remote
memory references to be outstanding.
Synchronization circuits in the shell permit synchronization at various different levels of program or data granularity in order to best match the synchronization method that is "natural" for a given parallelization technique. At the finest
granularity, data-level synchronization is facilitated by an atomic swap mechanism that permits the locking of data on an element-by-element basis. A more coarse grain data-level synchronization primitive is provided by a messaging facility, which
permits a PE to send a packet of data to another PE and cause an interrupt upon message arrival, providing for the management of message queues and low-level messaging protocol in hardware. Control-level synchronization at the program loop level is
provided by a large set of globally accessible fetch-and-increment registers that can be used to dynamically distribute work (in the form of iterations of a loop, for instance) among processors at run time. The present invention provides yet another
form of control-level synchronization, barrier synchronization, which is useful to control transitions between major program blocks (i.e., between loops performing actions on the same data sets).
Barrier Synchronization
The present invention provides hardware support for barrier synchronization which results in a low-latency method of synchronizing all or a portion of the PEs in an MPP system. The barrier synchronization mechanism of the present invention may
be used to perform two types of synchronization: barrier and eureka.
A barrier is a point in program code where, after reaching the barrier, a processor must wait until all other processors participating in the computation have also reached the barrier. After all of the processors reach the barrier, the
processors continue issuing program code.
A programmer may place a barrier in a program between distributed, parallel loops performing operations on the same data. By doing this, the programmer ensures that all of the processors associated with a loop finish the loop (all writes to
shared data complete) before any of the processors continue with other program instructions (access the new values of the shared data).
A eureka is a point in program code where a condition is established that only a single processor need satisfy, thereby causing all processors to proceed beyond the eureka point. To participate in a eureka event, all processors initialize the
eureka barrier mechanism described herein and enable an interrupt, then proceed executing program code to solve for the eureka condition. As soon as any processor completes the computation, it triggers the eureka, thus causing an interrupt to all PEs.
The interrupt indicates that the eureka has been satisfied and all PEs may continue beyond the eureka point.
Eureka synchronization has several uses, including database searches. Using eureka synchronization, a programmer can stop a database search as soon as any processor succeeds in finding the desired data rather than waiting for all of the
processors to exhaust the search.
Logical Barrier Synchronization Circuits
The preferred barrier mechanism has 16 logical barrier synchronization circuits. Each PE in the MPP system has an input to each of the 16 logical barrier synchronization circuits. The multiple barrier synchronization circuits facilitate
partitioning between users and the operating system, as well as providing redundant resources for fault tolerance.
The inputs to the 16 logical barrier synchronization circuits are 16 bits which are contained in two special registers. Each PE contains two 8-bit registers called barrier synchronization register 0 (BSR0) and barrier synchronization register 1
(BSR1). FIG. 3 shows the format of BSR0 and BSR1. Each of the 16 bits which comprise BSR0 and BSR1 is an input to one of the 16 logical barrier synchronization circuits. Thus, each PE has an input to each of the 16 logical barrier synchronization
circuits.
Barrier synchronization register 0 (BSR0) is an 8-bit, readable and writable, general access register. Preferably, BSR0 contains the eight least significant barrier bits for a PE. When read from, the value of BSR0 represents the value of bits
2.sup.7 through 2.sup.0 of BSR0. The remaining bits are not valid. Table 1 shows the bit format of BSR0 and describes each bit of the register.
TABLE 1 ______________________________________ BSR0 Format Bits Name ______________________________________ 2.sup.0 Barrier bit 2.sup.0 2.sup.1 Barrier bit 2.sup.1 2.sup.2 Barrier bit 2.sup.2 2.sup.3 Barrier bit 2.sup.3 2.sup.4 Barrier
bit 2.sup.4 2.sup.5 Barrier bit 2.sup.5 2.sup.6 Barrier bit 2.sup.6 2.sup.7 Barrier bit 2.sup.7 2.sup.63 -2.sup.8 These bits are not used. ______________________________________
Barrier synchronization register 1 (BSR1) is an 8-bit, readable and writable, privileged access register. BSR1 contains the eight most significant barrier bits for a PE. When read from, the value of BSR1 represents the value of bits 2.sup.15
through 2.sup.8 of BSR1 and 2.sup.7 through 2.sup.0 of BSR0. The remaining bits are not valid. Table 2 shows the bit format of BSR1 and describes each bit of the register.
TABLE 2 ______________________________________ BSR1 Format Bits Name ______________________________________ 2.sup.7 -2.sup.0 These bits are not used; however, when BSR1 is read, these bits contain the value of bits 2.sup.7 through 2.sup.0
of BSR0. 2.sup.8 Barrier bit 2.sup.8 2.sup.9 Barrier bit 2.sup.9 2.sup.10 Barrier bit 2.sup.10 2.sup.11 Barrier bit 2.sup.11 2.sup.12 Barrier bit 2.sup.12 2.sup.13 Barrier bit 2.sup.13 2.sup.14 Barrier bit 2.sup.14 2.sup.15 Barrier bit 2.sup.15
2.sup.63 -2.sup.16 These bits are not used. ______________________________________
All 16 of the logical barrier synchronization circuits function identically and operate independently. An example of the operation of the barrier synchronization circuits will now be given, using bit 2.sup.2 of BSR0 when it is used for barrier
synchronization and for eureka synchronization as an example for purposes of illustration. It shall be understood that the remaining bits function in the same way as bit 2.sup.2 in the following discussion.
Each logical barrier synchronization circuit is implemented as an AND-tree and fanout-tree circuit. FIG. 4 shows a barrier synchronization circuit 400 in a simplified MPP system using 2-input AND gates to form the AND-tree. For simplicity of
illustration, the MPP system shown in FIG. 4 contains only eight PEs. It shall be understood, however, that the present barrier mechanism and barrier synchronization circuits can be used with a system having any number of PEs.
Because the network in FIG. 4 is accommodating 8 processors, PE0-PE7, and the AND-tree is implemented with 2-input AND gates, log.sub.2 8=3 levels of AND-tree are required to arrive and a final barrier bit representing the logical product of all
the PEs' barrier bits.
As a starting condition, before barrier synchronization begins, bit 2.sup.2 of BSR0 is reset to 0 in all of the PEs. When a processor satisfies the barrier condition, that processor sets bit 2.sup.2 of its associated BSR0 register to 1. This
action sends a 1 to one of the AND gates in the first layer of the AND-tree.
The first layer of the AND-tree shown in FIG. 4 contains four AND gates. Each AND gate receives signals from two PEs. For example, one AND gate may receive signals from bit 2.sup.2 of BSR0 in PE0 and bit 2.sup.2 of BSR0 in PE 1. When all of
the processors reach the barrier point in the program code (satisfy the barrier) they have each set bit 2.sup.2 of their associated BSR0 to 1, causing the output of each of the four AND gates in the first level of the AND-tree to switch to 1.
The second level of the AND-tree of FIG. 4 contains two AND gates. Each AND gate receives signals from two of the AND gates in the first level of the AND-tree. When the output of all of the AND gates in the first level of the AND-tree are 1,
the output of both the AND gates in the second level of the AND-tree are 1.
The third level of the AND-tree contains the final AND gate. This AND gate receives signals from both AND gates in the second level of the AND-tree. When the output of both AND gates in the second level of the AND-tree are 1, the output of the
final AND gate is 1. The output of the final AND gate sends an input to the fanout-tree circuit. The fanout tree circuit is used to report the result of the AND-tree back to PEs.
The first fanout block in the fanout tree receives a 1 from the single level 3 AND gate. After creating two copies of the 1, the first fanout block sends the 1's to the two fanout blocks in the second level of the fanout tree.
The two fanout blocks in the second level of the fanout tree each create two copies of the 1. The two fanout blocks in the second level of the fanout tree then sends the 1's to four fanout blocks in the first level of the fanout tree.
The four fanout blocks in the first level of the fanout tree each create two copies of the 1. The fanout blocks in the first level of the fanout tree then send the 1's to the eight PEs. This signals all of the PEs in the system that all of the
processors have reached the barrier and the processor in the PE may continue with other program instructions. Because the fanout tree is dependent on the AND-tree, the fanout tree will report that all of the PEs have reached the barrier only when each
PE in the system has set bit 2.sup.2 to 1.
As will be described more fully below in connection with FIG. 14, the barrier mechanism is designed to be immediately reusable. This means that as soon as a processor detects that the barrier bit has cleared (all processors have arrived at the
barrier), it can immediately set the barrier bit again to announce its arrival at the next barrier if applicable. Doing so does not affect the notification of the prior barrier satisfaction to any other PE.
Eureka Synchronization
A logical barrier synchronization circuit 400 may also be used to perform eureka synchronization. One use of the eureka function is for the synchronization of a distributed, parallel data search. Once one of the PEs has located the targeted
information, the successful PE can set a eureka bit to inform the other PEs involved that the search is completed.
To enable the eureka function, each PE contains a register called the barrier synchronization function register (BSFR). FIG. 6 shows the format of the BSFR and how it functions in conjunction with the BSR0 and BSR1 registers. The barrier
synchronization function register (BSFR) is preferably a 16-bit, write-only, system privileged register. The BSFR contains a 16-bit mask that indicates which bits of BSR0 and BSR1 are used for barrier synchronization and which bits are used for eureka
synchronization. Bits 0-7 of the BSFR control the function of BRS0 and bits 8-15 of the BSFR control BSR1. If a bit of the BSFR is set to 1, the corresponding bit of BSR0 or BSR1 is used for eureka synchronization. If a bit of the BSFR is set to 0,
the corresponding bit of BSR0 or BSR1 is used for barrier synchronization. Table 3 shows the bit format of BSFR and describes each bit of the register.
TABLE 3 ______________________________________ BSFR Format Bits Name ______________________________________ 2.sup.7 -2.sup.0 These bits indicate which bits of BSR0 are used for eureka synchronization. For example, as shown in FIG. 6, when
bit 2.sup.2 of the BSFR is set to 1, bit 2.sup.2 of BSR0 is used for eureka synchronization. When bit 2.sup.2 of the BSFR is set to 0, bit 2.sup.2 of BSR0 is used for barrier synchronization. 2.sup.15 -2.sup.8 These bits indicate which bits of BSR1
are used for eureka synchronization. For example, as shown in FIG. 6, when bit 2.sup.13 of the BSFR is set to 1, bit 2.sup.13 of BSR1 is used for eureka synchronization. When bit 2.sup.13 of the BSFR is set to 0, bit 2.sup.13 of BSR1 is used for
barrier synchronization. 2.sup.63 -2.sup.16 These bits are not used. ______________________________________
Because each of the 16 logical barrier synchronization circuits operate completely independently of each other, any of the bits in each of the two barrier synchronization registers BSR0 and BSR1 for a particular PE can be programmed to function
in the conventional barrier mode, or in eureka mode. As will be described below in connection with FIG. 14, in eureka mode the output of the AND-tree is read directly by the PEs with no intervening synchronization logic such as the latch used in
conventional barrier synchronization.
Because in eureka mode the output of the fanout tree is read directly by the PEs, the barrier hardware is not synchronous as it is in conventional barrier mode, nor is a eureka bit immediately reusable. Using a barrier bit in eureka mode
requires the use of a second bit programmed to function in barrier mode in order to prevent race conditions. The conventional barrier is used to synchronize the entry and exit of the PEs into and out of a eureka barrier. This implementation means that
only half of the 16 barrier bits can be used for eureka synchronization.
Reading BSR0 or BSR1 returns the current state of the AND-tree. This allows the AND-tree to be used as an OR tree by using negative logic. After all PEs initialize their barrier bit to a logical 1, the output of the AND-tree can be read as a
one by all PEs. If any PE writes a zero to its barrier bit, the output of the AND-tree will read as a zero, performing the OR function.
A typical eureka sequence begins with all processors initializing the eureka bit to a logical 1 and setting the conventional barrier bit to a logical 1. When the barrier switches to a zero (meaning all processors have initialized their eureka
bit and joined the barrier), the eureka bit is "armed". The processors then begin the parallel data search or other activity that is to be terminated by the eureka. When a processor satisfies the termination conditions, it clears its eureka bit to
alert the other processors and sets its conventional barrier bit to indicate it has observed the eureka event. As each of the other PEs detect that the eureka event has occurred (because the output of the eureka AND-tree drops to a logical 0), it sets
its conventional barrier bit and waits for the barrier to complete. When all PEs have joined the final barrier, they may proceed to arm the next eureka.
Servicing A Barrier
In the preferred embodiment of the present invention, the processor monitors the output of the fanout circuitry using one of two mechanisms: periodically testing the barrier bit (such as with a continuous loop) or enabling a barrier interrupt.
Continuing to use bit 2.sup.2 as an example, in the first mechanism, after the processor sets bit 2.sup.2 of BSR0 to 1, the processor may enter a loop that continuously checks the value of bit 2.sup.2 of BSR0. After receiving a 1 from the fanout
circuitry, the support circuitry in the PE resets bit 2.sup.2 of BSR0 to 0. Because the processor is regularly checking the value of bit 2.sup.2 of BSR0, the processor may continue executing program instructions as soon as it is detected that bit
2.sup.2 of BSR0 is reset to 0.
In the barrier interrupt mechanism, after the processor satisfies the barrier and sets bit 2.sup.2 of BSR0 to 1, the processor enables a barrier interrupt. The processor may then issue program instructions that are not associated with the
barrier. After receiving a 1 from the fanout circuitry, the support circuitry in the PE resets bit 2.sup.2 of BSR0 to 0 and sets the barrier interrupt to the processor. The barrier interrupt indicates to the processor that all of the processors have
reached the barrier, and causes the processor to suspend the unrelated activity and return to executing instructions associated with the barrier. The advantage of the barrier interrupt over continuous polling is the ability to perform other useful work
while the other processors are approaching the barrier.
In the preferred embodiment, the microprocessor enables the barrier hardware interrupt using a hardware interrupt enable register (HIER) in the microprocessor system control register. For more information on the HIER and the system control
register, refer to the DECChip 21064-AA RISC Microprocessor Preliminary Data Sheet, available from Digital Equipment Corporation, which is incorporated herein by reference. The DEC microprocessor has 6 inputs for external hardware interrupts. These
inputs appear in the HIRR (Hardware Interrupt Request Register) in bit positions 5-7 and 10-12. One of these six inputs is designated as the Barrier Interrupt Request bit.
The HIRR inputs can be enabled or disabled by a mask bit located in the HIER (Hardware Interrupt Enable Register) internal to the microprocessor. For more information on the HIRR and HIER registers, refer to pages 3-26 through 3-29 in the
Digital Equipment Corporation publication: EV-3 AND EV-4 SPECIFICATION Version 2.0 May 3, 1991, which is incorporated herein by reference.
Those skilled in the art are aware that all RISC microprocessors provide comparable facilities for allowing and controlling the direct sampling of external hardware interrupt signals, and that the present invention is not limited to use with the
DEC RISC microprocessor described herein.
The interrupt input to the microprocessor is asserted whenever any of the barrier bits selected by the BSMI (barrier synchronization mask and interrupt, discussed below) make the transition from a logical 1 to a logical 0. The interrupt input to
the microprocessor is cleared by writing a bit in the system control register. To assure that no satisfied barrier events are missed, the correct programming sequence would be to clear the interrupt, then read BSR0 and BSR1 to determine which bit(s)
have toggled.
After the support circuity sets the barrier hardware interrupt, the processor must read the BSMI register (as described below) to determine if the interrupt was associated with BSR0 or BSR1. When the processor reads the value of the BSMI
register, the support circuitry clears the interrupt associated with BSR0 and the interrupt associated with BSR1.
If a barrier interrupt occurs while the processor is reading the BSMI register, the interrupt still occurs and is not cleared. The processor must then read the value of the BSMI register again to determine if the interrupt was associated with
BSR0 or BSR1 and to clear the interrupts.
Logical Partitions
Not all of the PEs in a multiprocessing system may need to be part of a barrier or eureka synchronization process. Also, it is often desirable to have several partitions of PEs operational on different tasks simultaneously. To facilitate this,
each PE also contains a barrier synchronization mask and interrupt (BSMI) register which is used to enable or disable a logical barrier synchronization circuit for a PE. The BSMI register contains a mask that indicates which bits of BSR0 and BSR1 are
enabled for the PE, thus defining which partition(s) a PE is a member of, and defining partitioning among the PEs.
The 16 independent barrier synchronization circuits visible to the user in the barrier registers can be assigned by the operating system to different groups or teams of processors. The BSMI allows the barrier bits in any given processor to be
enabled or disabled. Alone, the BSMI permits the division of the barrier resource among up to 16 different partitions with arbitrary membership. Used in conjunction with the barrier network partitioning capability described herein below, the BSMI
allows the flexible subdivision of the barrier resource among a very large number of partitions in a scalable fashion.
FIG. 7 shows the format of the BSMI register. The BSMI is a 16-bit, readable and writable, system privileged register. When written to, the BSMI register controls which bits in BSR0 and BSR1 are enabled for use by the PE. Bits 7-0 of BSMI
enable the bits in BSR0, while bits 15-8 enable the bits in BSR1. If a bit of the BSMI register is set to 1, the corresponding bit of BSR0 or BSR1 is enabled. If a bit of the BSMI register is set to 0, the corresponding bit of BSR0 or BSR1 is disabled.
The BSMI register has a different bit format when written to than it does when it is read from. Table 4 shows the bit format of the BSMI register when it is written to and describes each bit of the register.
A disabled BSR0 or BSR1 bit appears to the logical product network to be always satisfied (a logical "1"). This permits the barrier to function normally for other processors whose barrier bit is enabled. Reading a disabled barrier
synchronization register bit returns a logical "0". Writing a disabled barrier synchronization register bit has no effect.
TABLE 4 ______________________________________ BSMI Register Write Format Bits Name ______________________________________ 2.sup.7 -2.sup.0 These bits indicate which bits of BSR0 are enabled for use by the PE. For example, when bit 2.sup.2
of the BSMI register is set to 1, as shown in FIG. 7, bit 2.sup.2 of BSR0 is enabled for use by the PE. When bit 2.sup.2 of BSR0 is disabled and cannot be used by the PE. 2.sup.15 -2.sup.8 These bits indicate which bits of BSR1 are enabled for use
by the PE. For example, when bit 2.sup.13 of the BSMI is set to 1, as shown in FIG. 7, bit 2.sup.13 of BSR1 is enabled for use by the PE. When bit 2.sup.13 of the BSMI is set to 0, bit 2.sup.13 of BSR1 is disabled and cannot be used by the PE.
2.sup.63 -2.sup.16 These bits are not used. ______________________________________
Table 5 shows the bit format of the BSMI register when it is read from and describes each bit of the register. When read from, bits 2.sup.14 and 2.sup.15 of the BSMI register provide the current state of the barrier interrupts from BSR0 and
BSR1, respectively. After being read the BSMI register is cleared.
TABLE 5 ______________________________________ BSMI Register Read Format Bits Name ______________________________________ 2.sup.13 -2.sup.0 These bits are not valid. 2.sup.14 This bit reflects the current state of the barrier interrupt
associated with bits 2.sup.7 through 2.sup.0 of BSR0. 2.sup.15 This bit reflects the current state of the barrier interrupt associated with bits 2.sup.15 through 2.sup.8 of BSR1. 2.sup.63 -2.sup.16 These bits are not valid.
______________________________________
Software may use the BSMI register to allow any set number of PEs to use one of the logical barrier synchronization circuits, and thus set up a partition among the PEs. For example, software may set bit 2.sup.2 of the BSMI register to 1 in only
four of the PEs in an MPP system. In this case, only the four PEs with bit 2.sup.2 of the BSMI register set to 1 may use the logical barrier synchronization circuit associated with bit 2.sup.2 of BSR0. This creates a logical barrier partition among the
four PEs.
The BSMI and BSFR registers can be used in concert to arrive at several different barrier synchronization mechanisms. Table 6 shows the effect of one bit from the BSMI register and the BSFR on the corresponding bit in BSR0 or BSR1.
TABLE 6 ______________________________________ Barrier Mask and Barrier Function Register: BSMI BSFR Synchro- Bit Bit Writing to Reading from nization Value Value BSR0 or BSR1 BSR0 or BSR1 Type ______________________________________ 0 0
No effect Returns a 1 Disabled 0 1 No effect Returns a 1 Disabled 1 0 Writing 1 Returns a 1 if Barrier indicates the waiting for microprocessor barrier to has reached the complete. Returns barrier. Writing 0 a 0 when barrier has no effect.
is complete. 1 1 Writing 1 Returns a 1 if Eureka indicates the waiting for eureka microprocessor is to occur. Returns ready for eureka a 0 when a eureka synchronization. occurs. Writing 0 indicates the microprocessor has completed the
eureka process. ______________________________________
Physical Barrier Synchronization Circuits
Although each of the 16 bits in the BSR0 and BSR1 registers in each PE represent an input to one of 16 logical barrier synchronization circuits, the preferred embodiment of the present barrier synchronization mechanism does not contain 16
physical barrier synchronization circuits. Instead, in the preferred implementation, the system contains 4 physical barrier synchronization circuits. The 16 bits of BSR0 and BSR1 are time multiplexed into the four physical barrier synchronization
circuits.
One exemplary physical barrier synchronization circuit is shown in FIG. 5. The preferred barrier network is implemented using a radix-4 AND-tree and fanout tree. As shown, a 1024 PE system contains log.sub.4 1024=5 levels in a barrier
synchronization circuit.
Table 7 shows the input to each of the four physical barrier synchronization circuits during each clock period (CP). Four CPs are required for the physical barrier synchronization circuits to receive the input from all 16 bits in BSR0 and BSR1.
The input registers to the logical barrier synchronization circuits (as shown and described below in connection with FIG. 14) are completely parallel, so any number of PEs can set barrier bits in the same clock period without contention. All PEs
are informed that a particular barrier has been reached simultaneously, although due to the time multiplexed nature of the inputs to the circuits different processors may be at different points in a spin-loop testing the barrier bit.
TABLE 7 ______________________________________ Physical Barrier Synchronization Circuit Inputs Circuit First CP Second CP Third CP Fourth CP ______________________________________ 0 Bit 2.sup.0 Bit 2.sup.4 Bit 2.sup.8 Bit 2.sup.12 of
BSR0 of BSR0 of BSR1 of BSR1 1 Bit 2.sup.1 Bit 2.sup.5 Bit 2.sup.9 Bit 2.sup.13 of BSR0 of BSR0 of BSR1 of BSR1 2 Bit 2.sup.2 Bit 2.sup.6 Bit 2.sup.10 Bit 2.sup.14 of BSR0 of BSR0 of BSR1 of BSR1 3 Bit 2.sup.3 Bit 2.sup.7 Bit 2.sup.11 Bit
2.sup.15 of BSR0 of BSR0 of BSR1 of BSR1 ______________________________________
With the preferred radix-4 AND-tree implementation such as that shown in FIG. 5, each level of the tree takes approximately one clock period in logic and another 1 to 1.5 clock periods in wire travel time to accomplish. This assumption allows
the latency for a barrier between 1024 processors to be estimated and compared with known latencies for software barrier techniques.
Table 8 illustrates the number of barrier bits and the estimated number of clock periods at each level of a radix-4 tree such as the one shown in FIG. 5 connecting 1024 PEs. The radix-four tree reduces the barrier bit count by a factor of four
in each level.
TABLE 8 ______________________________________ Barrier bit logical product tree levels Number of barrier bits Radix-four logical product tree level ______________________________________ 1024 Level one, clock period 0 256 Level two, clock
period 2 64 Level three, clock period 4 16 Level four, clock period 6 4 Level five, clock period 8.5 (1.5 clock wire) 1 Level six, clock period 11 (1.5 clock wire) ______________________________________
From Table 8 it can be seen that eleven clock periods are consumed to perform the logical product of the barrier bits from 1024 PEs | | |