|
Description  |
|
|
FIELD OF THE INVENTION
The present invention relates to an architectural solution to the problem
of accomplishing efficient synchronization, scheduling and work allocation
in multiprocessors.
BACKGROUND OF THE INVENTION
The coordination of multiple operations in shared memory multiprocessors
often constitutes a substantial performance bottleneck. Process
synchronization and scheduling are generally performed by software, and
managed via shared memory. Execution of parallel programs on a
shared-memory, speedup-oriented multiprocessor necessitates a means for
synchronizing the activities of the individual processors. This necessity
arises due to precedence constraints within algorithms: When one
computation is dependent upon the result of other computations, it must
not commence before they finish. In the general case, such constraints are
projected onto an algorithm's parallel decomposition, and reflected as
precedence relations among its execution threads.
Synchronization is only one aspect of a broad activity, which may be termed
parallel operation coordination, whose other aspects are scheduling and
work allocation. Scheduling is selecting an execution order for the
operations of a program, out of a space of execution orders which are
feasible under the given architecture and precedence constraints, as
described in the paper entitled "The Effect of Operation Scheduling on the
Performance of a Data Flow Computer," M. Gransky et al, IEEE Trans. on
Computers, Vol. C-36 No. 9, September 1987, pp. 1019-1029. While
scheduling deals with the point of view of the tasks to be computed, work
allocation deals with the point of view of the processors which carry out
the tasks. Thus, the distinction between scheduling an allocation is not
clear-cut, and some researchers use these terms interchangeably. The
decisive questions may be posed as follows: "which ready-to-run piece of
work should be executed first ?" which is a matter of scheduling policy;
questions of the sort "to which processor should a given piece of work be
allocated ?" or "how much work should be allocated at once to a given
processor ?", are considered to be a matter of allocation policy.
Scheduling and allocation may be static, i.e. determined before program
run-time.
In fully dynamic systems, all these coordination activities are not an
inherent part of the actual computation, but are rather designed to
support it. Since they consume computational resources, they are
considered as overhead. Coordination or synchronization efficiency, refers
to the efficiency of parallel operation coordination activity itself,
excluding the indirect effects of scheduling policy.
The overall multiprocessor performance is influenced significantly by the
efficiency of coordination, as described in the book entitled
"High-Performance Computer Architecture", H. S. Stone, Addison-Wesley,
1987, and in the papers entitled "Execution of Parallel Loops on Parallel
Processor Systems," C. D. Polychronopoulos et al, Proc. Int. Conf. on
Parallel Processing, 1986, pp. 519-527: "A Technique for Reducing
Synchronization Overhead in Large Scale Multiprocessors", Z. Li et al.
Proc. of the 12th Symp. on Computer Architecture, 1985, pp. 284-291; "The
Piecewise Data Flow Architecture: Architectural Concepts," J. E. Requa et
al. IEEE Trans. on Computers, Vol. C-32 No. 5, May 1983, pp. 425-438; "A
Case Study in the Application of a Tightly Coupled Multiprocessor to
Scientific Computations," N. S. Ostlund et al, Parallel Computations, G.
Rodrigue, editor, Academic Press, 1982, pp. 315-364; "Synchronized and
Asynchronous Parallel Algorithms for Multiprocessors," H. T. Kung,
Algorithms and Complexity, Academic Press, 1976, pp. 153-200; and "A
Survey of Synchronization Methods for Prallel Computers," A Dinning, IEEE
Computer, Vol. 20 No. 19, January 1987, pp. 100- 109.
Inefficiencies in these processes are manifested in overhead-activity and
overhead-idling. The former is the activity which is required, once a task
has been computed, to obtain a new piece of productive work, while the
latter is due to contention of synchronization resources, which are
system-global by nature.
Overhead-idling is principally caused by insufficient synchronization rate
capability. As noted in the text by H. S. Stone supra, this capability
(expressed in MSYPS, Millions of Synchronizations Per Second) constitutes
an independent architectural measure; in particular, it is not necessarily
proportionate to the system's overall raw processing power, as expressed
MIPS and MFLOPS. Decompositing a given algorithm into ever finer
granularity levels will yield an ever increasing demand for
synchronization rate, and an ever bigger ratio of overhead-activity to
productive computation. Thus, at some level of granularity,
synchronization may become a bottleneck, thereby practically limiting the
exploitable level of parallelism. Consequently, it is desirable to search
for means to increase the synchronization rate capability and to reduce
the coordination overhead activity of multiprocessor systems.
Synchronization methods for multiprocessors were born out of mutual
exclusion methods, prevalent in multiprogrammed uniprocessors. Still,
synchronization is usually implemented around special synchronization data
in main memory, as described in the paper entitled "Synchronization,
Coherence, and Event Ordering in Multiprocessors," M. Dubois et al, IEEE
Computer, Vol. 21 No. 2, February 1988, pp. 9-22. These synchronization
data are either stand-alone (e.g. locks and semaphores), or attached to
regular data objects (such as presence bits). A variety of synchronization
primitives, such as Test & Set or Fetch & Add. serve to establish access
to synchronization variables and to manipulate them, as described in the
paper entitled "The NYU Ultracomputer--Designing an MIMD shared Memory
Parallel Processor," A. Gottlieb et al. IEEE Trans. on Computers, February
1983, pp. 175-89. The implementation of these primitives is based on some
special hardware support, whether rudimentary or massive. Yet the
essential levels of parallel operation coordination are implemented in
software. Some examples of prominent commercial and research
multiprocessors which are included in this framework are described in the
following papers: "Cm*--A modular multi-microprocessor," R. J. Swan et al,
AFIPS Conf. Proc., 1977 National Computer Conference, pp. 637-644;
"Architecture and Applications of the HEP Multiprocessor Computer System,"
B. J. Smith, Real Time Signal Processing IV, Proceedings of SPIE, August
1981, pp. 241-248; "The IMB RP3 Introduction and Architecture," G. F.
Pfister et al. Proc. Int. Conf. on Parallel Processing, August 1985, pp.
764-771; "Cedar", D. Gajski et al, Report No. UIUCDCS-R-83-1123.
Department of Computer Science, University of Illinois, Urbana, February
1983, pp. 1-25; "Synchronization Scheme and its Applications for Large
Multiprocessor Systems," C. Q. Zhu Proc. 4th Int. Conf. on Distributed
Computing Systems, 1984, pp. 486-493; and "The Butterfly Parallel
Processor," W. Crowther et al. Newsletter of the Computer Architecture
Technical Committee (IEEE Computer Society), September/December 1985, pp.
18-45. Within this framework, efforts are aimed at improving
synchronization efficiency were routed to the following directions:
Development of enhanced hardware support for synchronization primitives
(most notably - NYU Ultracomputer's combining network, as described in the
paper by Gottlieb, supra.); development of more powerful synchronization
primitives as described in the paper by C. Q. Zhu et al supra, and the
paper by J. R. Goodman entitled "Efficient Synchronization Primitives for
Large-Scale Cache-Coherent Multiprocessors," Proc. of the Conf. on
Architectural Support for Programming Languages and Operating Systems,
ASPLOS-III, 1989, pp. 64-75; development of inherently asynchronous
parallel algorithms, as described in the paper by H. T. Kung supra; and
development of various techniques for synchronization minimization, as
described in the paper by Z. Li et al, and in the paper entitled "Guided
Self-Scheduling: A Practical Scheduling Scheme for Parallel
Supercomputers," C. D. Polychronopoulos et al, IEEE Trans. on Computers,
Vol. C-36 No. 12, December 1987, pp. 1425-1439.
A recent survey of synchronization methods contained in the paper by
Dinning supra, describes in detail the synchronization mechanisms of seven
machines. While giving a classification for prevalent synchronization
methods, the paper by Dinning supra confirms the central and basic role of
protocols for synchronized access to shared data in all these methods
(except in "puristic" message passing).
Synchronization mechanisms which exceed the framework described above,
while promoting the role of hardware, have been proposed by various
researchers. Some of these proposals are aimed at hardware implementations
of barrier synchronization or synchronized wait, as described in the
papers entitled "A Controllable MIMD Architecture," S. F. Lundstrom et al,
Proceedings of the 1980 International Conference on Parallel Processing,
pp. 19-27 and "The Fuzzy Barrier: A Mechanism for High speed
Synchronization of Processors," R. Gupta, Proc. of the Conf. on
Architectural Support for Programming Languages and Operating Systems,
ASPLOS-III, 1989, pp. 54-63. A more general hardware mechanism, which is
aimed at arbitrary parallelism patterns, is based on routing of control
tokens, but is oriented towards essentially static work allocation, is
proposed in the paper entitled "A Hardware Task Scheduling Mechanism for
Real-Time Multi-Microprocessor Architecture," A. D. Hurt et al,
Proceedings of the 1982 Real-Time Systems Symposium, pp. 113-123. A
centralized synchronization/scheduling facility, targeted at arbitrary
parallelism patterns and at dynamic allocation and scheduling, was argued
for in the paper by D. Gajski supra, but no specific architecture was
proposed.
Therefore, it would be desirable to provide a global
synchronization/scheduling unit which is capable of dynamic allocation and
scheduling in a multiprocessor system.
SUMMARY OF THE INVENTION
Accordingly, it is a principal object of the present invention to overcome
the above-mentioned disadvantages of the prior art, and provide apparatus
for accomplishing a high synchronization/scheduling rate, adequate for
massively parallel multiprocessors.
It is another object of the invention to provided the
synchronization/scheduling apparatus with the capability of fast detection
of events where dormant computational tasks become allowed for execution.
It is still a further object of the invention to provide a global
synchronization/scheduling subsystem which offloads
synchronization/scheduling-related overhead activity from the processors.
In accordance with a preferred embodiment of the present invention, there
is provided a hardware synchronization/scheduling apparatus for performing
synchronization/scheduling in a multiprocessor system by controlling,
during program run-time, a process of monitoring and detecting which
computational tasks are allowed for execution and allocating computational
tasks to processors, the tasks being represented by instructions and data
accessible to the processors via instruction and data storage hardware,
said synchronization/scheduling apparatus comprising:
means for monitoring and detecting which computational tasks are allowed
for execution, said monitoring/detecting means containing a task map
describing the precedence relations among the computational tasks of the
program; and
communication/distribution means for distributing, to the processors,
information on computational tasks detected by said monitoring/detecting
means to be allowed for execution and for forwarding, to said
monitoring/detecting means, information on termination of execution of
computational tasks at the processors,
Said communication/distribution means comprising a network of nodes
processing both the capability of decomposing information on a pack of
allocated computational tasks into messages of finer sub-packs of
allocated computational tasks to be sent toward the processors, and
possessing the capability of unifying packs of information on termination
of computational tasks into a more comprehensive pack, to be sent to said
means for monitoring/detecting of allowed computational tasks.
In the preferred embodiment, the proposed architecture adds a
synchronization/scheduling subsystem to the multiprocessor subsystem. This
consists of a central unit and an active distribution network controlling
the multiple processors. The synchronization/scheduling subsystem is
programmed with the complete synchronization and scheduling information
for the executed parallel algorithm.
The synchronization/scheduling subsystem comprises a task map which
contains dependencies between tasks to be performed by the processors. The
code of the tasks is loaded in the shared memory space, whereas the
topology of the task map is held by the synchronizer/scheduler. The latter
uses the task map for allocating tasks to the processors. While doing so,
it continuously monitors the employment state of the processors, and makes
allocations dynamically and on the basis of processor availability. A task
is allocated by signalling its identification information across the link
between the synchronizer/scheduler and the designated processor. Further
parameters, or data to be processed by the task, may be acquired from the
shared memory.
When allocated a task, a processor is caused to restart, with the first
instruction fetch address determined by the task's identity. The processor
then proceeds in executing instructions fetched from memory, until
encountering an instruction signifying the end of the task. The processor
then enters a halted state, exporting an indication of its new state to
the synchronizer/scheduler. This indication, when received by the
synchronizer/scheduler, serves a twofold function: First, the processor is
marked free and ready for another allocation. Secondly, the event of the
current task's termination is internally marked, and may cause the
enabling of other tasks which depend on the current one. A task is said to
be enabled when it is recognized by the system to be allowed for
execution. The enabling of a dependent task takes place if all its other
input dependencies have already been activated in a similar manner. (OR
relations between input dependencies are also possible, and discussed
further herein with regard to the architecture's underlying programming
model).
In addition to a task map, the synchronizer/scheduler is supplied with the
system configuration data. This includes such details as the number of
processors, the capabilities of each processor (if processors are not
a-priori identical), etc.
Given a set of enabled tasks, as well as processor availability data, the
synchronizer/scheduler then performs scheduling of those tasks. Any
non-random scheduling policy must rely upon some heuristics: Even when
task execution times are known in advance, finding an optimal schedule for
a program represented as a dependency graph is an NP-complete problem.
Most scheduling heuristics are bases on the critical path method, and
thereby belong to the class of list scheduling policies; i.e., policies
that rely on a list of fixed task priorities. List scheduling can be
supported by the inventive scheme described herein, by embedding task
priorities in the task map load-module submitted to the
synchronizer/scheduler. Whenever an allocation takes place, the allocated
tasks are those which have highest priorities amongst the current
selection of enabled tasks.
The general architectural concepts described so far may be implemented in
multiple alternative ways. The processors may range from minute processing
elements to large scientific processors. They are not limited to any
specific type, and are not confined to the von-Neumann model. They can
also be compound processing units. The architecture may also be applied to
non-homogeneous systems. The shared memory may consist of physically
shared storage, possibly accessed through an interconnection network, or
be distributed over the processors, as long as common memory space is
preserved, at least in part.
Other features and advantages of the invention will become apparent from
the following drawings and description.
BRIEF DESCRIPTION OF THE DRAWINGS
For a better understanding of the invention with regard to the embodiments
thereof, reference is made to the accompanying drawings in which like
numerals designate corresponding elements or sections throughout, and in
which:
FIG. 1 shows a multiprocessor system architecture featuring a
synchronizer/scheduler apparatus constructed and operated in accordance
with the principles of the present invention;
FIG. 2 is a graph illustrating bounds on the overall slowdown in program
execution as a function of the available synchronization/scheduling rate
capability;
FIGS. 3a-f show graphical notations for programming features associated
with a programming model useful in constructing a task map for
multiprocessor control;
FIG. 4 is a schematic diagram of the synchronizer/scheduler apparatus
architecture, featuring a central synchronization/scheduling unit and a
distribution network;
FIG. 5 is a schematic diagram of an interface through which the central
synchronization/scheduling unit communicates with the distribution network
of FIG. 4, and through which nodes of the distribution network communicate
with each other;
FIG. 6 is a schematic diagram of an example standard input dependency
structure; and
FIG. 7 is a schematic diagram architecture of a preferred embodiment of the
central synchronization/scheduling unit of the invention.
DETAILED DESCRIPTION OF A PREFERRED EMBODIMENT
Reference is now made to FIGS. 1-7, showing various aspects of the proposed
architecture of a synchronization/scheduling subsystem for a
multiprocessor. The proposed synchronizer/scheduler subsystem consists of
a central scheduling unit (CSU) and an active distribution network
controlling the multiple processors. The proposed subsystem is programmed
with the synchronization and scheduling information for the executed
parallel algorithm. The next section contains a discussion of the general
system architecture, its performance and its goals. The architecture's
underlying programming model is then described. Following this, the
general architecture of the synchronizer/scheduler subsystem is discussed,
and finally a detailed discussion of the architecture of the central unit
of the subsystem is presented.
(I)
SYSTEM ARCHITECTURE AND EXPECTED PERFORMANCE
A program intended for execution on a multiprocessor which incorporates the
synchronization/scheduling scheme described herein, must be represented by
a dependency graph. The dependency graph is called the program's task map;
Its nodes represent tasks, and its (directed) edges represent task
interdependencies. Tasks are granules of computation, of any desired size
(e.g. they may embrace any number of machine instructions). The graph may
contain cycles. The task map is submitted to the hardware, and used during
run-time. This dependency-graph-driven mode of computation, attributed by
non-elementary granularity, is referred to in the paper to Gajski et al
supra as macro dataflow. Yet according to a terminology introduced by
Treleaven et al, in the paper entitled "Combining Data Flow and Control
Flow Computing," Computer Journal, Vol. 25 No. 2, 1982, pp. 207-217, it
may rather be referred to as multi-threaded control flow. That is because
the data communication mechanism (namely, the shared memory) is distinct
here from the synchronization/scheduling mechanism, and dependency arcs do
not necessarily "carry data" but denote control flow.
There is a distribution between a task, which is a quantum of program code
and a task map object, and a task instantiation, which is an execution
process derived from a task. The reason for this seemingly subtle
distinction will be made clear in a later section. Until then, for the
sake of simplicity, this distinction is ignored.
The multiprocessor architecture is illustrated in FIG. 1. As can be seen,
the parallel operation coordination subsystem (synchronizer/scheduler 10)
forms an appendage to a conventional configuration of a shared-memory 12
and processors 14.
The synchronization/scheduling subsystem comprises a task map which
contains dependencies between tasks to be performed by the processors 14.
The code of the tasks is loaded in memory, whereas the topology of the
task map is held by the synchronizer/scheduler 10. The latter uses the
task map for allocating tasks to processors 14. While doing so, it
continuously monitors the processors 14 employment state, and makes
allocations dynamically and on the basis of processors availability. A
task is allocated by signalling its identification information across the
link 16 between the synchronizer/scheduler and the designated processor.
Further parameters, or data to be processed by the task, are acquired from
the shared memory 12 via link 18.
When allocated a task, a processor 14 is caused to restart, with the first
instruction fetch address determined by the task's identity. The processor
14 then proceeds in executing instructions fetched from main memory, until
encountering an instruction signifying the end of the task. The processor
14 then enters a halted state, exporting an indication of its new state to
synchronizer/scheduler 10. This indication, when received by
synchronizer/scheduler 10, serves a twofold function: First, the processor
14 is marked free and ready for another allocation. Secondly, the event of
the current task's termination is internally marked, and may cause the
enabling of other tasks which depend on the current one. The enabling of a
dependent task takes place if all its other input dependencies have
already been activated in a similar manner. (OR relations between input
dependencies are also possible, and discussed further herein with regard
to the architecture's underlying programming model).
In addition to a task map, synchronizer/scheduler 10 is supplied with the
system configuration data. This includes such details as the number of
processors, the capabilities of each processor (if the processors are not
a-priori identical), etc.
Given a set of enabled tasks, as well as processor availability data,
synchronizer/scheduler 10 then performs scheduling of those tasks. Any
non-random scheduling policy must rely upon some heuristics: Even when
task execution times are known in advance, finding an optimal schedule for
a program represented as a dependency graph is an NP-complete problem, as
described in the paper entitled "NP-Complete Scheduling Problems," J. D.
Ullman, J. Comput. Syst. Sci., Vol. 10, June 1975, pp. 384-393. Most
scheduling heuristics are bases on the critical path method, and thereby
belong to the class of list scheduling policies; i.e., policies that rely
on a list of fixed task priorities as described in the paper by Gransky et
al supra, and the text entitled "Computer and Job-Shop Scheduling and
Theory," E. Coffman, Wiley Publishers, New York, 1976. List scheduling can
be supported by the inventive scheme described herein, by embedding task
priorities in the task map load-module submitted to the
synchronizer/scheduler. Whenever an allocation takes place, the allocated
tasks are those which have highest priorities amongst the current
selection of enabled tasks.
The general architectural concepts described so far may be implemented in
multiple alternative ways. The processors may range from minute processing
elements to large scientific processors. They are not limited to any
specific type, and are not confined to the von-Neumann model. They can
also be compound processing units. The architecture may also be applied to
non-homogeneous systems. The shared memory may consist of physically
shared storage, possibly accessed through an interconnection network, or
be distributed over the processors, as long as the common memory space is
preserved, at least in part.
Performance Bounds
An immediate merit of the herein described scheme is that any parallel
operation coordination overhead is offloaded from the processors. This
activity is shifted to the special purpose hardware, and is performed in
parallel with productive computation. This carries the potential for a
significant shrink in overhead per synchronization point. Another merit is
the optimal load balancing, attained due to the fact that allocations are
performed dynamically, on a global basis, and are driven by processors
availability. Optimal load balancing means that no situation can occur
where enabled computational work becomes committed to a specific portion
of the system which cannot accomodate it at that moment, while a processor
not belonging to that portion is idling. It is clear that the conditions
specified above ensure optimal load balancing by definition.
Synchronization rate is measured here as the total flow-rate of task
initiation (namely, synchronization flow-rate) across the
synchronizer/scheduler's ports. If the synchronizer/scheduler provides too
low a synchronization flow-rate, a synchronization bottleneck may result.
In considering the question of whether the synchronizer/scheduler's
flow-rate capability constitutes a bottleneck in comparison to the
requirements of the parallel program, the graph of FIG. 2 may be used.
The horizontal axis depicts the given maximal flow-rate of
synchronizer/scheduler apparatus 10, scaled in terms of a measure which is
called the canonical flow-rate, or fc, which is a parameter of the program
being executed, and is the only such parameter involved in the analysis.
It is defined as the average flow-rate developed when the program is run
on a system containing an infinite number of processors and an ideal
synchronizer/scheduler 10 apparatus, one having infinite flow-rae
capability. An equivalent definition would be the ratio between the total
number of task executions that must occur, and the length of the critical
path on the program's task map. In the ideal execution process, the
momentary flow-rate may sharply deviate from the average fc.
The vertical axis of the graph of FIG. 2 depicts the overall slowdown in
program execution, incurred by the given limitation on the
synchronizer/scheduler's flow-rate, still under the assumption that the
number of processors is unlimited. This assumption established a
"worst-case" condition; its relaxation implies a potential decrease in the
demand for flow-rate. A lower bound and an upper bound on slowdown are
depicted. The lower bound reflects the fact that the time needed to
complete the execution for a program can be not shorter than the minimal
time needed for allocating all of its tasks. The lower bound is valid for
any synchronization/scheduling mechanism whatsoever, whether based upon
hardware or upon software. The upper bound is valid only under the
assumption that the processors are relieved from any
synchronization/scheduling-related overhead activity, as happens in this
invention. The upper bound can be proven mathematically, based on the
assumption that the flow-rate is only semi-finite, in the sense of having
only one direction of limitation. The mathematical proof and rationale for
this assumption are included in the material by N. Bayer entitled, "A
Hardware-Synchronized/Scheduled Multiprocessor Model," submitted as a M.
Sc. Thesis, EE Department, Technion, Israel Institute of Technology,
January 1989 (as yet unpublished).
In order to sustain high flow-rate, it is also important to attain low
enabling latency. This parameter reflects the time which elapses from the
termination of the last task which prohibits the enabling of another task,
until the latter can be allocated. Low enabling latency is desirable in
order to allow efficient parallelism even in the more difficult and
challenging cases, when the program's degree of parallelism is roughly the
same as the number of processors, i.e. when there is no large reservoir of
enabled tasks.
(II)
THE UNDERLYING PROGRAMMING MODEL
The programming model is the collection of rules and options serving for
the construction of task maps, which is directly supported by the
hardware. A task map coded according to this programming model will
closely correspond to the load-module submitted to the
synchronizer/scheduler. Preparation of the ultimate load-module will not
include any essential transformation of the program.
This layer may serve as the basis for the definition of higher layers.
Tools such as compilers and macro-expanders can be developed, which accept
more powerful and abstract software constructs and translate them into
task maps of the kind directly supported by the hardware.
Consolidation of the programming model includes software aspect related
considerations, associated with an assessment of its computational power
vs. the hardware investment needed. In the following description, the
details of the programming model for the high flow-rate
synchronizer/scheduler architecture are presented by review of the
programming features which are illustrated by graphic notations shown in
FIG. 3.
FIG. 3a shows the AND/OR relations between task input dependencies, with
the standard task input dependency mechanism implementing a
product-of-sums logic. Arrows entering a task symbol denote AND related
dependencies, in accordance with the common notation convention for
dependency graphs. Arrows approaching a task symbol via a circle sign
denote OR related dependencies.
FIG. 3b shows pre-enabled task notation, with each program task being
initialized as enable or non-enabled. Those initialized as enable are
called pre-enable, and must be specifically declared so.
FIG. 3c shows dummy (or degenerated) tasks, noted as D-tasks, which when
enabled, are not subjected to allocation; instead they are immediately
declared as terminated, internally to the synchronizer/scheduler. D-tasks
serve to express non-standard input dependency schemes and to manipulate
dependency structures.
FIG. 3d shows reset tasks, noted as R-tasks which, similar to D-tasks, are
also treated internally within the synchronizer/scheduler. However, an
R-task does have an execution body: It rests all input dependencies of the
tasks governed by it to a non-active state. It is useful for purging of
"control tokens."
FIG. 3e shows conditioning tasks, which is the mechanism underlying global
conditioning (task-local conditioning is implemented using the processor's
branching instructions). The global conditioning mechanism is based upon a
scalar boolean value, named termination condition (t.sub.-- cond), which
is returned to the synchronizer/scheduler upon the termination of each
task. When a task begins, its t.sub.-- cond is automatically initialized
to a "1" value. The task is allowed access to the t.sub.-- cond as a
variable, and rests it to "0".
If a task is denoted t.sub.o, each output dependency of t.sub.o may be of
type ".phi.", "0" or "1". Dependencies of types "0" and "1" are activated
upon termination of t.sub.o only in conjunction with the appropriate
t.sub.-- cond value. A task having at least one non-".phi." output
dependency is termed a conditioning task, and must be explicitly declared
so. The ".phi." signs are omitted in the graphic notation from output
dependencies not belonging to conditioning tasks.
FIG. 3f shows duplicable tasks, which constitute a mechanism for supporting
a particular form of dynamic process generation. Let <task.sub.-- id> be a
duplicate task. The enabling of <task.sub.-- id> generates <inst.sub.--
quota> instantiations pending for allocation. Execution of these
instantiations is in SPMD style, as described in the paper "Programming
for Parallelism", A. H. Karp, IEEE Computer, Vol. 20 No. 5, May 1987, pp.
43-57. All processors receiving an instantiation of <task.sub.-- id>
execute the same code, by under the modification of the instance number
transmitted by the synchronizer/scheduler. The event of <task.sub.-- id>'s
termination is identified with the termination of its last instantiation.
The number of instantiations <inst.sub.-- quota> is initialized at
compile-time, but may be updated at run-time by the processors. For this
purpose, the internal register within the synchronizer/scheduler dedicated
to <task.sub.-- id> is subject to external access (write only), as if it
were a global memory cell. A duplicate task cannot be a conditioning one.
The introduction of duplicable tasks necessitates the following refinements
in terminology: the term "computational task" refers to pieces of
computational work in general. However, in the context of this embodiment,
computational tasks performed by the processor are referred to a "task
instantiations" or briefly, instantiations, while the term "task" is
reserved for the task-met objects themselves.
(III)
GENERAL ARCHITECTURE OF THE SYNCHRONIZER/SCHEDULER
FIG. 4 illustrates the synchronizer/scheduler apparatus architecture. It is
divided into two modules: A central synchronization/scheduling unit (CSU)
and a distribution network. The distribution network mediates between the
CSU, which constitutes the heart of the synchronizer/scheduler, and the
processors. Its function is not the mere passive data transfer, but as
further described herein, it creates an effect of amplifying the
synchronizer/scheduler apparatus 10 flow-rate, in comparison with the
flow-rate of the CSU alone. As this distribution network shares some
common features with combining networks for shared memory access as
described in the paper to Gottlieb et al supra, they will compared at the
end of this section.
While the internal implementation of the CSU constitutes the theme of the
next section, this section discusses the architecture and operation of the
synchronizer/scheduler as a whole.
The proposed structure is founded upon the inclusion of duplicable tasks in
the programming model. The existence of duplicable tasks in a program
helps make the total number of task enablings lower than the total number
of task-instantiations allocated to processors (the enabling of a
duplicable task is considered a single enabling). Thus, the average rate
of task enablings, denoted f.sub.e, is liable to be smaller than the
flow-rate of allocating task-instantiations to processors, denoted
f.sub.a. The ratio f.sub.r =f.sub.a /f.sub.e is equal to the average
number of instantiations per task (the average calculation includes also
the regular tasks, which release exactly one instantiation per enabling,
but does not include D-tasks and R-tasks). The factor f.sub.r which is a
property of the program in combination with its input, is likely to reach
orders of magnitude of tens, hundreds, or even more, as indicated by
benchmark examples. Examples: instantiation quotas of duplicate tasks
correspond to the sizes of blocks within the complete matrix; in
particle-system simulation programs, the instantiation quotas of
duplicable tasks may correspond to the number of interaction/maintaining
particle pairs.
The interface between the distribution network and the processors carries
individual allocation and termination messages, whereas the interface
between the CSU and the distribution network carries allocation packs and
termination packs. A pack contains one instantiation or more out of the
collection of instantiations released by a single task-enabling. If the
pack contains all the instantiations which were released, it is called a
complete pack; otherwise it is called a partial pack. The instantiation
indices belonging to an allocation pack must form a continuous sequence.
The coding of packs employ a certain form of compression, such that the
coding format employs a fixed number of bits: The task's identity is
always coded; in an allocation pack the sequence of indices is also coded,
e.g. as a pair, incorporating the first index and the sequence length. In
a termination pack, the instantiation indices have no importance, and only
their quantity is coded. For the purpose of discussing communications
flow-rates, and due to this manner of coding, packs and individual
messages will be counted according to the same measuring-rod.
The task map is concentrated in the CSU, which monitors the enabling of
tasks. Allocation packs sent by the CSU are decomposed during their
passage through the distribution network, and delivered to the processors
as individual instantiations. The opposite operation, termed herein merge,
is performed on termination messages. In this way, the communications
flow-rate between the processors and distribution network may be amplified
in comparison to the flow-rate of communications between the distribution
network and the CSU. Namely, f.sub.csu <f.sub.a where f.sub.csu denotes
the total communications flow-rate across the CSU interfaces.
It should be noted that f.sub.csu is not always the same as f.sub.e.
Consider operating conditions where the collection of processors functions
as a sink, i.e. it is willing to absorb any instantiation of any enabled
task immediately, and there is a large reservoir of enabled tasks which is
steadily reproduced. Under such a situation, the CSU sends and receives
co | | |