WikiPatents - Community Patent Review
Create Free Account  |  License or Sell Your Patent  |  WikiPatents Marketplace  |  WikiPatents Blog
Username:  Password:  
    
Advanced Search
High flow-rate synchronizer/scheduler apparatus and method for multiprocessors    
United States Patent5202987   
Link to this pagehttp://www.wikipatents.com/5202987.html
Inventor(s)Bayer; Nimrod (7 Gordon Street, Givataim, IL); Ginosar; Ran (Nofit 104 (near Tivon), IL)
AbstractA high flow-rate synchronizer/scheduler apparatus for a mutiprocessor system during program run-time, comprises a connection matrix for monitoring and detecting computational tasks which are allowed for execution containing a task map and a network of nodes for distributing to the processors information or computational tasks detected to be enabled by the connection matrix. The network of nodes possesses the capability of decomposing information on a pack of allocated computational tasks into messages of finer sub-packs to be sent toward the processors, as well as the capability of unifying packs of information on termination of computational tasks into a more comprehensive pack. A method of performing the synchronization/scheduling in a multiprocessor system of this apparatus is also described.



 Title Information Submit all comments and votes
 
Patent Text Patent PDF Print Page Summary File History
Plain text PDF images Print Summary File History
Drawing from US Patent 5202987
High flow-rate synchronizer/scheduler apparatus and method for

     multiprocessors - US Patent 5202987 Drawing
High flow-rate synchronizer/scheduler apparatus and method for multiprocessors
Inventor     Bayer; Nimrod (7 Gordon Street, Givataim, IL); Ginosar; Ran (Nofit 104 (near Tivon) Nofit 104
Owner/Assignee    
Patent assignment
All assignments
Publication Date     April 13, 1993
Application Number     07/641,250
PAIR File History     Application Data   Transaction History
Image File Wrapper   Patent Term   Fees
Litigation
Filing Date     January 15, 1991
US Classification     718/102 713/375 718/107
Int'l Classification     G06F 015/16
Examiner     Heckler; Thomas M.
Assistant Examiner    
Attorney/Law Firm     Morgan & Finnegan
Address
Parent Case    
Priority Data     Feb 01, 1990[IL]093239
USPTO Field of Search     364/DIG. 1 395/650
Patent Tags     high flow-rate synchronizer/scheduler for multiprocessors
   
Enter a comma (,) or semicolon (;) between multiple tag words/phrases.
Describe this patent:
 Amusing   
 Clever   
 Complex   
 Efficient   
 Historic   
 Important   
 Innovative   
 Interesting   
 Practical   
 Simple   
[no votes]
Patent WIKI

Share information and news about this patent, including information and news about the technology, inventors, company, ligation and licensing.

 References Submit all comments and votes
 
*references marked with an asterisk below are user-added references
 U.S. References
 
Add a new US reference:  
ReferenceRelevancyCommentsReferenceRelevancyComments
4985831
Dulong
718/106
Jan,1991

[0 after 0 votes]
4967326
May
712/21
Oct,1990

[0 after 0 votes]
4954945
Inoue
718/105
Sep,1990

[0 after 0 votes]
4779194
Jennings
718/106
Oct,1988

[0 after 0 votes]
4590555
Bourrez
718/103
May,1986

[0 after 0 votes]
4333144
Whiteside
718/102
Jun,1982

[0 after 0 votes]
 Foreign References
 Other References
 Market Review Submit all comments and votes
   
Market Size
Estimate the gross annual revenues of the relevant market sector:
> $10B
$5B - $10B
$2B - $5B
$500M - $2B
$100M - $500M
$10M - $100M
$1M - $10M
$500K - $1M
$100K - $500K
< $100K
[No votes]
$0
 
$0   $2.5B   $5B   $7.5B   $10B
Market Share
Estimate the percentage of the relevant market sector this invention will capture:
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Reasonable Royalty
What percentage of gross sales should the inventor or assignee be paid?
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Public's "Guesstimation" of Royalty Value
Market SizeN/A[No votes]
xMarket ShareN/A[No votes]
xReasonable RoyaltyN/A[No votes]

N/A

License Availablity
If you are NOT the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
License Availablity
If you ARE the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
Competitive Advantage
Does this invention have a significant competitive advantage over similar technologies?
Yes

No



[No votes]
Most helpful competitive advantage comment
[No comments]

Commercial Alternatives
Are there viable commercial alternatives for this invention?
Yes

No



[No votes]
Most helpful commercial alternative comment
[No comments]

 Technical Review Submit all comments and votes
 Claims Submit all comments and votes
 


We claim:

1. A hardware synchronization/scheduling apparatus for performing synchronization/scheduling in a multiprocessor system by controlling, during program run-time, a process of monitoring and detecting which computational tasks are allowed for execution and allocating computational tasks to processors, the tasks being represented by instructions and data accessible to the processors via instruction and data storage hardware, said synchronization/scheduling apparatus comprising:

means for monitoring and detecting which computational tasks are allowed for execution, said monitoring/detecting means containing a task map describing the precedence relations among the computational tasks of the program, and

communication/distribution means connected to said monitoring/detecting means, and to said processors for distributing, to the processors, information on computational tasks detected by said monitoring/detecting means to be allowed for execution in a processor and for forwarding, to said monitoring/detecting means, information on termination of execution of computational tasks at the processors, said communication/distribution means comprising a network of nodes possessing both the capability of decomposing information on a pack of allocated computational tasks into messages of finer partial packs of allocated computational tasks to be sent toward the processors and possessing the capability of unifying packs of information on termination of computational tasks into a more comprehensive pack, to be sent to said means for monitoring/detecting of allowed computational tasks.

2. The apparatus of claim 1 wherein said monitoring/detecting means comprises a connection matrix having a set of connections between rows and columns thereof, said set of connections representing said task map and being programmable, an enabling cell attached to said connection matrix detecting a specific computational task allowed for execution in a processor.

3. The apparatus of claim 1 wherein said communication/distribution means comprises a modular distribution network configured in modular fashion from a set of distribution units according to a desired configuration.

4. A method of performing synchronization/scheduling in a multiprocessor system by controlling, during run-time, a process of monitoring and detecting which computational tasks are allowed for execution and allocating computational tasks to processors, the tasks being represented by instructions and data accessible to the processors via instruction and data storage hardware, said method comprising the steps of:

monitoring and detecting which computational tasks are allowed for execution in accordance with a task map describing the precedence relations among the computational tasks of the program; and

distributing to the processors information on computational tasks detected in said monitoring and detecting step to be allowed for execution in a processor and forwarding information on termination of execution of computational tasks at the processors, said distributing step being performed in a network comprising nodes possessing both the capability of decomposing information on a pack of allocated computational tasks into messages of finer partial packs of allocated computational tasks to be sent toward the processors and possessing the capability of unifying packs of information on termination of computational tasks into a more comprehensive pack, provided in said monitoring and detecting step with respect to allowed computational tasks.

5. The method of claim 4 wherein said monitoring and detecting step is performed by a monitoring and detecting apparatus separate from said processors.

6. The method of claim 4 wherein global conditioning is performed based on termination conditions produced by the processors and transmitted via said network as part of said information on termination of computational tasks, without requiring conditioning computations or accessing of said data storage hardware during said monitoring and detecting step.

7. The method of claim 4, wherein a task of said task map embraces a multiplicity of instantiations, including terminated instantiations, the number of instantiations being controllable by the processors via direct access to registers maintained by said monitoring and detection apparatus.

8. The method of claim 4 wherein as part of said distributing step, forwarding information on termination of execution of computational tasks at the processors comprises separate forwarding of termination packs containing a quantity of terminated instantiations and forwarding of messages expressing quantities of processors which have entered a halted state.

9. The method of claim 4 wherein configuration data of said network is maintained in distributed fashion and processor employment data are also distributed.

10. The method of claim 4 wherein as part of said distributing step, decomposing information on a pack of allocated computational tasks into messages of finer partial packs of allocated computational tasks to be sent toward the processors is performed in an adaptive fashion, involving decisions local to a specific node of said network, based on processor availability, and using allocation advances through storage of allocation packs in said node.
 Description Submit all comments and votes
 


FIELD OF THE INVENTION

The present invention relates to an architectural solution to the problem of accomplishing efficient synchronization, scheduling and work allocation in multiprocessors.

BACKGROUND OF THE INVENTION

The coordination of multiple operations in shared memory multiprocessors often constitutes a substantial performance bottleneck. Process synchronization and scheduling are generally performed by software, and managed via shared memory. Execution of parallel programs on a shared-memory, speedup-oriented multiprocessor necessitates a means for synchronizing the activities of the individual processors. This necessity arises due to precedence constraints within algorithms: When one computation is dependent upon the result of other computations, it must not commence before they finish. In the general case, such constraints are projected onto an algorithm's parallel decomposition, and reflected as precedence relations among its execution threads.

Synchronization is only one aspect of a broad activity, which may be termed parallel operation coordination, whose other aspects are scheduling and work allocation. Scheduling is selecting an execution order for the operations of a program, out of a space of execution orders which are feasible under the given architecture and precedence constraints, as described in the paper entitled "The Effect of Operation Scheduling on the Performance of a Data Flow Computer," M. Gransky et al, IEEE Trans. on Computers, Vol. C-36 No. 9, September 1987, pp. 1019-1029. While scheduling deals with the point of view of the tasks to be computed, work allocation deals with the point of view of the processors which carry out the tasks. Thus, the distinction between scheduling an allocation is not clear-cut, and some researchers use these terms interchangeably. The decisive questions may be posed as follows: "which ready-to-run piece of work should be executed first ?" which is a matter of scheduling policy; questions of the sort "to which processor should a given piece of work be allocated ?" or "how much work should be allocated at once to a given processor ?", are considered to be a matter of allocation policy. Scheduling and allocation may be static, i.e. determined before program run-time.

In fully dynamic systems, all these coordination activities are not an inherent part of the actual computation, but are rather designed to support it. Since they consume computational resources, they are considered as overhead. Coordination or synchronization efficiency, refers to the efficiency of parallel operation coordination activity itself, excluding the indirect effects of scheduling policy.

The overall multiprocessor performance is influenced significantly by the efficiency of coordination, as described in the book entitled "High-Performance Computer Architecture", H. S. Stone, Addison-Wesley, 1987, and in the papers entitled "Execution of Parallel Loops on Parallel Processor Systems," C. D. Polychronopoulos et al, Proc. Int. Conf. on Parallel Processing, 1986, pp. 519-527: "A Technique for Reducing Synchronization Overhead in Large Scale Multiprocessors", Z. Li et al. Proc. of the 12th Symp. on Computer Architecture, 1985, pp. 284-291; "The Piecewise Data Flow Architecture: Architectural Concepts," J. E. Requa et al. IEEE Trans. on Computers, Vol. C-32 No. 5, May 1983, pp. 425-438; "A Case Study in the Application of a Tightly Coupled Multiprocessor to Scientific Computations," N. S. Ostlund et al, Parallel Computations, G. Rodrigue, editor, Academic Press, 1982, pp. 315-364; "Synchronized and Asynchronous Parallel Algorithms for Multiprocessors," H. T. Kung, Algorithms and Complexity, Academic Press, 1976, pp. 153-200; and "A Survey of Synchronization Methods for Prallel Computers," A Dinning, IEEE Computer, Vol. 20 No. 19, January 1987, pp. 100- 109.

Inefficiencies in these processes are manifested in overhead-activity and overhead-idling. The former is the activity which is required, once a task has been computed, to obtain a new piece of productive work, while the latter is due to contention of synchronization resources, which are system-global by nature.

Overhead-idling is principally caused by insufficient synchronization rate capability. As noted in the text by H. S. Stone supra, this capability (expressed in MSYPS, Millions of Synchronizations Per Second) constitutes an independent architectural measure; in particular, it is not necessarily proportionate to the system's overall raw processing power, as expressed MIPS and MFLOPS. Decompositing a given algorithm into ever finer granularity levels will yield an ever increasing demand for synchronization rate, and an ever bigger ratio of overhead-activity to productive computation. Thus, at some level of granularity, synchronization may become a bottleneck, thereby practically limiting the exploitable level of parallelism. Consequently, it is desirable to search for means to increase the synchronization rate capability and to reduce the coordination overhead activity of multiprocessor systems.

Synchronization methods for multiprocessors were born out of mutual exclusion methods, prevalent in multiprogrammed uniprocessors. Still, synchronization is usually implemented around special synchronization data in main memory, as described in the paper entitled "Synchronization, Coherence, and Event Ordering in Multiprocessors," M. Dubois et al, IEEE Computer, Vol. 21 No. 2, February 1988, pp. 9-22. These synchronization data are either stand-alone (e.g. locks and semaphores), or attached to regular data objects (such as presence bits). A variety of synchronization primitives, such as Test & Set or Fetch & Add. serve to establish access to synchronization variables and to manipulate them, as described in the paper entitled "The NYU Ultracomputer--Designing an MIMD shared Memory Parallel Processor," A. Gottlieb et al. IEEE Trans. on Computers, February 1983, pp. 175-89. The implementation of these primitives is based on some special hardware support, whether rudimentary or massive. Yet the essential levels of parallel operation coordination are implemented in software. Some examples of prominent commercial and research multiprocessors which are included in this framework are described in the following papers: "Cm*--A modular multi-microprocessor," R. J. Swan et al, AFIPS Conf. Proc., 1977 National Computer Conference, pp. 637-644; "Architecture and Applications of the HEP Multiprocessor Computer System," B. J. Smith, Real Time Signal Processing IV, Proceedings of SPIE, August 1981, pp. 241-248; "The IMB RP3 Introduction and Architecture," G. F. Pfister et al. Proc. Int. Conf. on Parallel Processing, August 1985, pp. 764-771; "Cedar", D. Gajski et al, Report No. UIUCDCS-R-83-1123. Department of Computer Science, University of Illinois, Urbana, February 1983, pp. 1-25; "Synchronization Scheme and its Applications for Large Multiprocessor Systems," C. Q. Zhu Proc. 4th Int. Conf. on Distributed Computing Systems, 1984, pp. 486-493; and "The Butterfly Parallel Processor," W. Crowther et al. Newsletter of the Computer Architecture Technical Committee (IEEE Computer Society), September/December 1985, pp. 18-45. Within this framework, efforts are aimed at improving synchronization efficiency were routed to the following directions: Development of enhanced hardware support for synchronization primitives (most notably - NYU Ultracomputer's combining network, as described in the paper by Gottlieb, supra.); development of more powerful synchronization primitives as described in the paper by C. Q. Zhu et al supra, and the paper by J. R. Goodman entitled "Efficient Synchronization Primitives for Large-Scale Cache-Coherent Multiprocessors," Proc. of the Conf. on Architectural Support for Programming Languages and Operating Systems, ASPLOS-III, 1989, pp. 64-75; development of inherently asynchronous parallel algorithms, as described in the paper by H. T. Kung supra; and development of various techniques for synchronization minimization, as described in the paper by Z. Li et al, and in the paper entitled "Guided Self-Scheduling: A Practical Scheduling Scheme for Parallel Supercomputers," C. D. Polychronopoulos et al, IEEE Trans. on Computers, Vol. C-36 No. 12, December 1987, pp. 1425-1439.

A recent survey of synchronization methods contained in the paper by Dinning supra, describes in detail the synchronization mechanisms of seven machines. While giving a classification for prevalent synchronization methods, the paper by Dinning supra confirms the central and basic role of protocols for synchronized access to shared data in all these methods (except in "puristic" message passing).

Synchronization mechanisms which exceed the framework described above, while promoting the role of hardware, have been proposed by various researchers. Some of these proposals are aimed at hardware implementations of barrier synchronization or synchronized wait, as described in the papers entitled "A Controllable MIMD Architecture," S. F. Lundstrom et al, Proceedings of the 1980 International Conference on Parallel Processing, pp. 19-27 and "The Fuzzy Barrier: A Mechanism for High speed Synchronization of Processors," R. Gupta, Proc. of the Conf. on Architectural Support for Programming Languages and Operating Systems, ASPLOS-III, 1989, pp. 54-63. A more general hardware mechanism, which is aimed at arbitrary parallelism patterns, is based on routing of control tokens, but is oriented towards essentially static work allocation, is proposed in the paper entitled "A Hardware Task Scheduling Mechanism for Real-Time Multi-Microprocessor Architecture," A. D. Hurt et al, Proceedings of the 1982 Real-Time Systems Symposium, pp. 113-123. A centralized synchronization/scheduling facility, targeted at arbitrary parallelism patterns and at dynamic allocation and scheduling, was argued for in the paper by D. Gajski supra, but no specific architecture was proposed.

Therefore, it would be desirable to provide a global synchronization/scheduling unit which is capable of dynamic allocation and scheduling in a multiprocessor system.

SUMMARY OF THE INVENTION

Accordingly, it is a principal object of the present invention to overcome the above-mentioned disadvantages of the prior art, and provide apparatus for accomplishing a high synchronization/scheduling rate, adequate for massively parallel multiprocessors.

It is another object of the invention to provided the synchronization/scheduling apparatus with the capability of fast detection of events where dormant computational tasks become allowed for execution.

It is still a further object of the invention to provide a global synchronization/scheduling subsystem which offloads synchronization/scheduling-related overhead activity from the processors.

In accordance with a preferred embodiment of the present invention, there is provided a hardware synchronization/scheduling apparatus for performing synchronization/scheduling in a multiprocessor system by controlling, during program run-time, a process of monitoring and detecting which computational tasks are allowed for execution and allocating computational tasks to processors, the tasks being represented by instructions and data accessible to the processors via instruction and data storage hardware, said synchronization/scheduling apparatus comprising:

means for monitoring and detecting which computational tasks are allowed for execution, said monitoring/detecting means containing a task map describing the precedence relations among the computational tasks of the program; and

communication/distribution means for distributing, to the processors, information on computational tasks detected by said monitoring/detecting means to be allowed for execution and for forwarding, to said monitoring/detecting means, information on termination of execution of computational tasks at the processors,

Said communication/distribution means comprising a network of nodes processing both the capability of decomposing information on a pack of allocated computational tasks into messages of finer sub-packs of allocated computational tasks to be sent toward the processors, and possessing the capability of unifying packs of information on termination of computational tasks into a more comprehensive pack, to be sent to said means for monitoring/detecting of allowed computational tasks.

In the preferred embodiment, the proposed architecture adds a synchronization/scheduling subsystem to the multiprocessor subsystem. This consists of a central unit and an active distribution network controlling the multiple processors. The synchronization/scheduling subsystem is programmed with the complete synchronization and scheduling information for the executed parallel algorithm.

The synchronization/scheduling subsystem comprises a task map which contains dependencies between tasks to be performed by the processors. The code of the tasks is loaded in the shared memory space, whereas the topology of the task map is held by the synchronizer/scheduler. The latter uses the task map for allocating tasks to the processors. While doing so, it continuously monitors the employment state of the processors, and makes allocations dynamically and on the basis of processor availability. A task is allocated by signalling its identification information across the link between the synchronizer/scheduler and the designated processor. Further parameters, or data to be processed by the task, may be acquired from the shared memory.

When allocated a task, a processor is caused to restart, with the first instruction fetch address determined by the task's identity. The processor then proceeds in executing instructions fetched from memory, until encountering an instruction signifying the end of the task. The processor then enters a halted state, exporting an indication of its new state to the synchronizer/scheduler. This indication, when received by the synchronizer/scheduler, serves a twofold function: First, the processor is marked free and ready for another allocation. Secondly, the event of the current task's termination is internally marked, and may cause the enabling of other tasks which depend on the current one. A task is said to be enabled when it is recognized by the system to be allowed for execution. The enabling of a dependent task takes place if all its other input dependencies have already been activated in a similar manner. (OR relations between input dependencies are also possible, and discussed further herein with regard to the architecture's underlying programming model).

In addition to a task map, the synchronizer/scheduler is supplied with the system configuration data. This includes such details as the number of processors, the capabilities of each processor (if processors are not a-priori identical), etc.

Given a set of enabled tasks, as well as processor availability data, the synchronizer/scheduler then performs scheduling of those tasks. Any non-random scheduling policy must rely upon some heuristics: Even when task execution times are known in advance, finding an optimal schedule for a program represented as a dependency graph is an NP-complete problem. Most scheduling heuristics are bases on the critical path method, and thereby belong to the class of list scheduling policies; i.e., policies that rely on a list of fixed task priorities. List scheduling can be supported by the inventive scheme described herein, by embedding task priorities in the task map load-module submitted to the synchronizer/scheduler. Whenever an allocation takes place, the allocated tasks are those which have highest priorities amongst the current selection of enabled tasks.

The general architectural concepts described so far may be implemented in multiple alternative ways. The processors may range from minute processing elements to large scientific processors. They are not limited to any specific type, and are not confined to the von-Neumann model. They can also be compound processing units. The architecture may also be applied to non-homogeneous systems. The shared memory may consist of physically shared storage, possibly accessed through an interconnection network, or be distributed over the processors, as long as common memory space is preserved, at least in part.

Other features and advantages of the invention will become apparent from the following drawings and description.

BRIEF DESCRIPTION OF THE DRAWINGS

For a better understanding of the invention with regard to the embodiments thereof, reference is made to the accompanying drawings in which like numerals designate corresponding elements or sections throughout, and in which:

FIG. 1 shows a multiprocessor system architecture featuring a synchronizer/scheduler apparatus constructed and operated in accordance with the principles of the present invention;

FIG. 2 is a graph illustrating bounds on the overall slowdown in program execution as a function of the available synchronization/scheduling rate capability;

FIGS. 3a-f show graphical notations for programming features associated with a programming model useful in constructing a task map for multiprocessor control;

FIG. 4 is a schematic diagram of the synchronizer/scheduler apparatus architecture, featuring a central synchronization/scheduling unit and a distribution network;

FIG. 5 is a schematic diagram of an interface through which the central synchronization/scheduling unit communicates with the distribution network of FIG. 4, and through which nodes of the distribution network communicate with each other;

FIG. 6 is a schematic diagram of an example standard input dependency structure; and

FIG. 7 is a schematic diagram architecture of a preferred embodiment of the central synchronization/scheduling unit of the invention.

DETAILED DESCRIPTION OF A PREFERRED EMBODIMENT

Reference is now made to FIGS. 1-7, showing various aspects of the proposed architecture of a synchronization/scheduling subsystem for a multiprocessor. The proposed synchronizer/scheduler subsystem consists of a central scheduling unit (CSU) and an active distribution network controlling the multiple processors. The proposed subsystem is programmed with the synchronization and scheduling information for the executed parallel algorithm. The next section contains a discussion of the general system architecture, its performance and its goals. The architecture's underlying programming model is then described. Following this, the general architecture of the synchronizer/scheduler subsystem is discussed, and finally a detailed discussion of the architecture of the central unit of the subsystem is presented.

(I)

SYSTEM ARCHITECTURE AND EXPECTED PERFORMANCE

A program intended for execution on a multiprocessor which incorporates the synchronization/scheduling scheme described herein, must be represented by a dependency graph. The dependency graph is called the program's task map; Its nodes represent tasks, and its (directed) edges represent task interdependencies. Tasks are granules of computation, of any desired size (e.g. they may embrace any number of machine instructions). The graph may contain cycles. The task map is submitted to the hardware, and used during run-time. This dependency-graph-driven mode of computation, attributed by non-elementary granularity, is referred to in the paper to Gajski et al supra as macro dataflow. Yet according to a terminology introduced by Treleaven et al, in the paper entitled "Combining Data Flow and Control Flow Computing," Computer Journal, Vol. 25 No. 2, 1982, pp. 207-217, it may rather be referred to as multi-threaded control flow. That is because the data communication mechanism (namely, the shared memory) is distinct here from the synchronization/scheduling mechanism, and dependency arcs do not necessarily "carry data" but denote control flow.

There is a distribution between a task, which is a quantum of program code and a task map object, and a task instantiation, which is an execution process derived from a task. The reason for this seemingly subtle distinction will be made clear in a later section. Until then, for the sake of simplicity, this distinction is ignored.

The multiprocessor architecture is illustrated in FIG. 1. As can be seen, the parallel operation coordination subsystem (synchronizer/scheduler 10) forms an appendage to a conventional configuration of a shared-memory 12 and processors 14.

The synchronization/scheduling subsystem comprises a task map which contains dependencies between tasks to be performed by the processors 14. The code of the tasks is loaded in memory, whereas the topology of the task map is held by the synchronizer/scheduler 10. The latter uses the task map for allocating tasks to processors 14. While doing so, it continuously monitors the processors 14 employment state, and makes allocations dynamically and on the basis of processors availability. A task is allocated by signalling its identification information across the link 16 between the synchronizer/scheduler and the designated processor. Further parameters, or data to be processed by the task, are acquired from the shared memory 12 via link 18.

When allocated a task, a processor 14 is caused to restart, with the first instruction fetch address determined by the task's identity. The processor 14 then proceeds in executing instructions fetched from main memory, until encountering an instruction signifying the end of the task. The processor 14 then enters a halted state, exporting an indication of its new state to synchronizer/scheduler 10. This indication, when received by synchronizer/scheduler 10, serves a twofold function: First, the processor 14 is marked free and ready for another allocation. Secondly, the event of the current task's termination is internally marked, and may cause the enabling of other tasks which depend on the current one. The enabling of a dependent task takes place if all its other input dependencies have already been activated in a similar manner. (OR relations between input dependencies are also possible, and discussed further herein with regard to the architecture's underlying programming model).

In addition to a task map, synchronizer/scheduler 10 is supplied with the system configuration data. This includes such details as the number of processors, the capabilities of each processor (if the processors are not a-priori identical), etc.

Given a set of enabled tasks, as well as processor availability data, synchronizer/scheduler 10 then performs scheduling of those tasks. Any non-random scheduling policy must rely upon some heuristics: Even when task execution times are known in advance, finding an optimal schedule for a program represented as a dependency graph is an NP-complete problem, as described in the paper entitled "NP-Complete Scheduling Problems," J. D. Ullman, J. Comput. Syst. Sci., Vol. 10, June 1975, pp. 384-393. Most scheduling heuristics are bases on the critical path method, and thereby belong to the class of list scheduling policies; i.e., policies that rely on a list of fixed task priorities as described in the paper by Gransky et al supra, and the text entitled "Computer and Job-Shop Scheduling and Theory," E. Coffman, Wiley Publishers, New York, 1976. List scheduling can be supported by the inventive scheme described herein, by embedding task priorities in the task map load-module submitted to the synchronizer/scheduler. Whenever an allocation takes place, the allocated tasks are those which have highest priorities amongst the current selection of enabled tasks.

The general architectural concepts described so far may be implemented in multiple alternative ways. The processors may range from minute processing elements to large scientific processors. They are not limited to any specific type, and are not confined to the von-Neumann model. They can also be compound processing units. The architecture may also be applied to non-homogeneous systems. The shared memory may consist of physically shared storage, possibly accessed through an interconnection network, or be distributed over the processors, as long as the common memory space is preserved, at least in part.

Performance Bounds

An immediate merit of the herein described scheme is that any parallel operation coordination overhead is offloaded from the processors. This activity is shifted to the special purpose hardware, and is performed in parallel with productive computation. This carries the potential for a significant shrink in overhead per synchronization point. Another merit is the optimal load balancing, attained due to the fact that allocations are performed dynamically, on a global basis, and are driven by processors availability. Optimal load balancing means that no situation can occur where enabled computational work becomes committed to a specific portion of the system which cannot accomodate it at that moment, while a processor not belonging to that portion is idling. It is clear that the conditions specified above ensure optimal load balancing by definition.

Synchronization rate is measured here as the total flow-rate of task initiation (namely, synchronization flow-rate) across the synchronizer/scheduler's ports. If the synchronizer/scheduler provides too low a synchronization flow-rate, a synchronization bottleneck may result. In considering the question of whether the synchronizer/scheduler's flow-rate capability constitutes a bottleneck in comparison to the requirements of the parallel program, the graph of FIG. 2 may be used.

The horizontal axis depicts the given maximal flow-rate of synchronizer/scheduler apparatus 10, scaled in terms of a measure which is called the canonical flow-rate, or fc, which is a parameter of the program being executed, and is the only such parameter involved in the analysis. It is defined as the average flow-rate developed when the program is run on a system containing an infinite number of processors and an ideal synchronizer/scheduler 10 apparatus, one having infinite flow-rae capability. An equivalent definition would be the ratio between the total number of task executions that must occur, and the length of the critical path on the program's task map. In the ideal execution process, the momentary flow-rate may sharply deviate from the average fc.

The vertical axis of the graph of FIG. 2 depicts the overall slowdown in program execution, incurred by the given limitation on the synchronizer/scheduler's flow-rate, still under the assumption that the number of processors is unlimited. This assumption established a "worst-case" condition; its relaxation implies a potential decrease in the demand for flow-rate. A lower bound and an upper bound on slowdown are depicted. The lower bound reflects the fact that the time needed to complete the execution for a program can be not shorter than the minimal time needed for allocating all of its tasks. The lower bound is valid for any synchronization/scheduling mechanism whatsoever, whether based upon hardware or upon software. The upper bound is valid only under the assumption that the processors are relieved from any synchronization/scheduling-related overhead activity, as happens in this invention. The upper bound can be proven mathematically, based on the assumption that the flow-rate is only semi-finite, in the sense of having only one direction of limitation. The mathematical proof and rationale for this assumption are included in the material by N. Bayer entitled, "A Hardware-Synchronized/Scheduled Multiprocessor Model," submitted as a M. Sc. Thesis, EE Department, Technion, Israel Institute of Technology, January 1989 (as yet unpublished).

In order to sustain high flow-rate, it is also important to attain low enabling latency. This parameter reflects the time which elapses from the termination of the last task which prohibits the enabling of another task, until the latter can be allocated. Low enabling latency is desirable in order to allow efficient parallelism even in the more difficult and challenging cases, when the program's degree of parallelism is roughly the same as the number of processors, i.e. when there is no large reservoir of enabled tasks.

(II)

THE UNDERLYING PROGRAMMING MODEL

The programming model is the collection of rules and options serving for the construction of task maps, which is directly supported by the hardware. A task map coded according to this programming model will closely correspond to the load-module submitted to the synchronizer/scheduler. Preparation of the ultimate load-module will not include any essential transformation of the program.

This layer may serve as the basis for the definition of higher layers. Tools such as compilers and macro-expanders can be developed, which accept more powerful and abstract software constructs and translate them into task maps of the kind directly supported by the hardware.

Consolidation of the programming model includes software aspect related considerations, associated with an assessment of its computational power vs. the hardware investment needed. In the following description, the details of the programming model for the high flow-rate synchronizer/scheduler architecture are presented by review of the programming features which are illustrated by graphic notations shown in FIG. 3.

FIG. 3a shows the AND/OR relations between task input dependencies, with the standard task input dependency mechanism implementing a product-of-sums logic. Arrows entering a task symbol denote AND related dependencies, in accordance with the common notation convention for dependency graphs. Arrows approaching a task symbol via a circle sign denote OR related dependencies.

FIG. 3b shows pre-enabled task notation, with each program task being initialized as enable or non-enabled. Those initialized as enable are called pre-enable, and must be specifically declared so.

FIG. 3c shows dummy (or degenerated) tasks, noted as D-tasks, which when enabled, are not subjected to allocation; instead they are immediately declared as terminated, internally to the synchronizer/scheduler. D-tasks serve to express non-standard input dependency schemes and to manipulate dependency structures.

FIG. 3d shows reset tasks, noted as R-tasks which, similar to D-tasks, are also treated internally within the synchronizer/scheduler. However, an R-task does have an execution body: It rests all input dependencies of the tasks governed by it to a non-active state. It is useful for purging of "control tokens."

FIG. 3e shows conditioning tasks, which is the mechanism underlying global conditioning (task-local conditioning is implemented using the processor's branching instructions). The global conditioning mechanism is based upon a scalar boolean value, named termination condition (t.sub.-- cond), which is returned to the synchronizer/scheduler upon the termination of each task. When a task begins, its t.sub.-- cond is automatically initialized to a "1" value. The task is allowed access to the t.sub.-- cond as a variable, and rests it to "0".

If a task is denoted t.sub.o, each output dependency of t.sub.o may be of type ".phi.", "0" or "1". Dependencies of types "0" and "1" are activated upon termination of t.sub.o only in conjunction with the appropriate t.sub.-- cond value. A task having at least one non-".phi." output dependency is termed a conditioning task, and must be explicitly declared so. The ".phi." signs are omitted in the graphic notation from output dependencies not belonging to conditioning tasks.

FIG. 3f shows duplicable tasks, which constitute a mechanism for supporting a particular form of dynamic process generation. Let <task.sub.-- id> be a duplicate task. The enabling of <task.sub.-- id> generates <inst.sub.-- quota> instantiations pending for allocation. Execution of these instantiations is in SPMD style, as described in the paper "Programming for Parallelism", A. H. Karp, IEEE Computer, Vol. 20 No. 5, May 1987, pp. 43-57. All processors receiving an instantiation of <task.sub.-- id> execute the same code, by under the modification of the instance number transmitted by the synchronizer/scheduler. The event of <task.sub.-- id>'s termination is identified with the termination of its last instantiation. The number of instantiations <inst.sub.-- quota> is initialized at compile-time, but may be updated at run-time by the processors. For this purpose, the internal register within the synchronizer/scheduler dedicated to <task.sub.-- id> is subject to external access (write only), as if it were a global memory cell. A duplicate task cannot be a conditioning one.

The introduction of duplicable tasks necessitates the following refinements in terminology: the term "computational task" refers to pieces of computational work in general. However, in the context of this embodiment, computational tasks performed by the processor are referred to a "task instantiations" or briefly, instantiations, while the term "task" is reserved for the task-met objects themselves.

(III)

GENERAL ARCHITECTURE OF THE SYNCHRONIZER/SCHEDULER

FIG. 4 illustrates the synchronizer/scheduler apparatus architecture. It is divided into two modules: A central synchronization/scheduling unit (CSU) and a distribution network. The distribution network mediates between the CSU, which constitutes the heart of the synchronizer/scheduler, and the processors. Its function is not the mere passive data transfer, but as further described herein, it creates an effect of amplifying the synchronizer/scheduler apparatus 10 flow-rate, in comparison with the flow-rate of the CSU alone. As this distribution network shares some common features with combining networks for shared memory access as described in the paper to Gottlieb et al supra, they will compared at the end of this section.

While the internal implementation of the CSU constitutes the theme of the next section, this section discusses the architecture and operation of the synchronizer/scheduler as a whole.

The proposed structure is founded upon the inclusion of duplicable tasks in the programming model. The existence of duplicable tasks in a program helps make the total number of task enablings lower than the total number of task-instantiations allocated to processors (the enabling of a duplicable task is considered a single enabling). Thus, the average rate of task enablings, denoted f.sub.e, is liable to be smaller than the flow-rate of allocating task-instantiations to processors, denoted f.sub.a. The ratio f.sub.r =f.sub.a /f.sub.e is equal to the average number of instantiations per task (the average calculation includes also the regular tasks, which release exactly one instantiation per enabling, but does not include D-tasks and R-tasks). The factor f.sub.r which is a property of the program in combination with its input, is likely to reach orders of magnitude of tens, hundreds, or even more, as indicated by benchmark examples. Examples: instantiation quotas of duplicate tasks correspond to the sizes of blocks within the complete matrix; in particle-system simulation programs, the instantiation quotas of duplicable tasks may correspond to the number of interaction/maintaining particle pairs.

The interface between the distribution network and the processors carries individual allocation and termination messages, whereas the interface between the CSU and the distribution network carries allocation packs and termination packs. A pack contains one instantiation or more out of the collection of instantiations released by a single task-enabling. If the pack contains all the instantiations which were released, it is called a complete pack; otherwise it is called a partial pack. The instantiation indices belonging to an allocation pack must form a continuous sequence. The coding of packs employ a certain form of compression, such that the coding format employs a fixed number of bits: The task's identity is always coded; in an allocation pack the sequence of indices is also coded, e.g. as a pair, incorporating the first index and the sequence length. In a termination pack, the instantiation indices have no importance, and only their quantity is coded. For the purpose of discussing communications flow-rates, and due to this manner of coding, packs and individual messages will be counted according to the same measuring-rod.

The task map is concentrated in the CSU, which monitors the enabling of tasks. Allocation packs sent by the CSU are decomposed during their passage through the distribution network, and delivered to the processors as individual instantiations. The opposite operation, termed herein merge, is performed on termination messages. In this way, the communications flow-rate between the processors and distribution network may be amplified in comparison to the flow-rate of communications between the distribution network and the CSU. Namely, f.sub.csu <f.sub.a where f.sub.csu denotes the total communications flow-rate across the CSU interfaces.

It should be noted that f.sub.csu is not always the same as f.sub.e. Consider operating conditions where the collection of processors functions as a sink, i.e. it is willing to absorb any instantiation of any enabled task immediately, and there is a large reservoir of enabled tasks which is steadily reproduced. Under such a situation, the CSU sends and receives co