|
Description  |
|
|
BACKGROUND OF THE INVENTION
In the traditional von Neumann model of computing, instructions are
executed one at a time in sequence. A central feature of this model is the
instruction pointer. It points to the instruction that is currently
executing. Each instruction has a unique successor which is usually at the
following address that immediately follows the current instruction pointer
address. To execute the next instruction in a sequence, the instruction
pointer is usually incremented. Execution of that next instruction is then
imperative. The sequence of instructions executed in a program is often
referred to as a thread of computation.
Parallel processing machines are not limited to performing one thread at a
time. They perform a number of threads simultaneously. They achieve this
because they comprise multiple sequential von Neumann machines, each of
which performs a single thread of computation.
SUMMARY OF THE INVENTION
In accordance with one aspect of the present invention, at least one data
processing element is included in a data processing system. Multiple
processing elements are preferred to provide added parallel processing
capability. Each data processing element is itself capable of processing
individual imperative control flow instructions in performing plural
threads of computation. Successive instructions executed by a data
processing element may be from different threads of computation. Each data
processing element is preferably pipelined for high processing speed.
These processing elements are preferably reduced instruction set
processing elements so as to provide for easy implementation and high
processing speed.
Each processing element includes a thread descriptor storage such as a FIFO
queue for storing thread descriptors. Each thread descriptor (also
referred to as a token or a continuation) identifies the next instruction
to be processed in a particular thread by noting an instruction pointer
value. It also identifies a frame of storage locations on which the next
instruction acts. This frame is indicated by a frame pointer value that
points to the beginning of the frame. Multiple thread descriptors may
refer to the same frame.
The data processing system responds to load instructions to retrieve data
from global memory by generating a request message. The load instructions
are executed regardless of the current state of the global memory
location. This request message includes a thread descriptor that need not
be retained by the processing element. The request message is sent to a
memory controller that generates a response message. The response message
includes the data requested and a thread descriptor. The thread descriptor
is stored in the thread descriptor storage for further processing. The
response messages for multiple load requests may be returned to the
processor in an order different from the order in which the corresponding
requests are issued, and multiple requests may be issued before a response
is received.
One of the instructions of the at least one processing element is
preferably the fork instruction. The fork instruction produces two thread
descriptors from a single thread of computation.
A joining instruction is provided to generate a single thread descriptor
from two threads so as to join two threads into a single thread. The
joining instruction specifies a memory location. The first of the two
threads that reaches the join instruction marks the specified location
full and stops computing. The other thread marks that location empty when
it encounters a join instruction and continues computation.
In accordance with another aspect of the invention, a working memory is
provided. This working memory is preferably a cache. It is comprised of
storage locations upon which the at least one processing element operates.
Frames of storage held in local memory are transferred to the working
memory so as to be processed. The size of the frames of storage is not
fixed a priori.
When the processing elements are pipelined, a clocking means is preferably
included to clock thread descriptors out of the thread descriptor storage
into the pipeline. The clock may also be used to clock successive stages
of the pipeline such that each stage of the pipeline may operate on a
separate thread of computation.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 illustrates the general structure of the data processing system of
the present invention,
FIG. 2 illustrates the organization of a processing element.
FIG. 3 illustrates the frames held in local memory and the cache for said
frames.
FIG. 4 illustrates how token descriptors are used in instruction execution.
FIG. 5 illustrates the operation of a load instruction.
FIG. 6 illustrates the operation of a load and continue instruction.
FIG. 7 illustrates the operation of a store instruction.
FIG. 8 illustrates the effect of both a fork instruction and a join
instruction.
FIG. 9 illustrates a flow chart of the steps performed in a join
instruction.
FIG. 10 illustrates a sample program using a fork instruction and a join
instruction.
FIG. 11 illustrates a sample data flow diagram.
FIG. 12 illustrates the operation of a start instruction.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT
The preferred embodiment of the present invention concerns a multithreaded
data processing system. This data processing system can operate on a
plurality of threads independently. The basic structure of this system is
shown in FIG. 1. As can be seen in that figure, the system employs a
plurality of processing elements 10. As will be described in more detail
below, each processing element acts in parallel with the other processing
elements 10. Global heap memory elements 20 are provided as a global
memory space. The numerous processing elements and heap memory elements
are connected via an innerconnection network 14. The interconnection
network 14 routes messages amongst processing elements 10 and heap memory
elements 20. A memory controller 17 is provided for controlling access to
each heap memory element.
The processing elements 10 are preferably reduced instruction set (RISC)
processors. The use of RISC processors allows for quick operation, as well
as for a simplified hardware architecture. The reduced instruction set of
the present invention, however, is not that which is typically found in
RISC processors, rather it is an extended set of instructions comprised of
the traditional set of instructions plus additional more powerful
instructions. The additional instructions comprise the fork, join and
start instructions discussed below. Further, the load instruction is
different from traditional reduced instruction sets.
FIG. 2 shows the basic structure of a processing element 10. Each
processing element 10 is capable of processing a plurality of threads of
computation independently. It is able to achieve this independence by
using a pipelining structure 36. The pipeline 36 is comprised of four
stages. The first stage is an instruction fetch stage 32. As its name
implies, the basic task of this stage is to retrieve instructions from
code section 26 held in local processing element memory.
The operand fetch stage 30 is the second stage of this pipeline 36. It
fetches operands that are necessary to perform the fetched instruction. To
fetch these operands, it is necessary for the operand fetch stage 30 to
access frames of memory locations 24 where the operands are found. If the
fetched instruction is a load, store or start instruction, the fetched
instruction is processed specially at 38. Load, store and start
instructions will be discussed in more detail below.
Frames 24 serve the role formerly served by register sets in early
architectures. Specifically, operands are stored in the frames of storage
24 for ready access. The frames 24, however, are superior to registers
because they are not limited to a fixed size. The size of a frame can be
adapted to fit the needs of the threads that access it. Further, employing
a frame pointer to specify the frame location provides the potential for
allocation of a seemingly limitless number of frames having varying sizes.
Hence, the number and sizes of frames available is not limited by the
number of register sets.
The total frame memory 24 is likely too large to provide for sufficiently
rapid memory access; thus, it is desirable to employ a cache 3 such as
shown in FIG. 3. The size of the cache 3 may vary but should hold at least
a couple of frames. Frames can be loaded into the cache 3 when needed. The
cache 3 should be situated as shown in FIG. 3 between local frame storage
24 and the pipeline 36.
If the instruction is neither a load, store nor start instruction, then the
processing of the particular instruction continues in the third stage of
the pipeline, the function units stage 28. This stage represents the
traditional arithmetic and logical unit (ALU) of a conventional processor.
Once the desired functions have been performed in the functions unit stage
28, the processing of the instruction passes to the fourth stage of the
pipeline 36, the operand store stage 34. This stage stores in the frames
24 the results of computations performed by the function units 28 when
instructed by the fetched instruction.
Each processing element 10 also includes a token queue 22. The token queue
22 stores thread descriptors (also known as tokens) for threads of
computation to be performed by the processing element 10. When thread
descriptors held in the token queue 22 are placed into the processing
pipeline 36, the next instruction in the thread of computation is
performed. Hence, the processing elements 10 of the preferred embodiment
of the present invention operate by feeding thread descriptors into the
pipeline 36 every clock cycle. Clock cycles are generated by a processing
element clock 101. Independence in processing of threads is achieved by
having different threads of computation being placed into the pipeline 36
on successive clock cycles.
The thread descriptors held in the token queue 22 and processed by the
pipeline 36 uniquely identify each thread of computation. They are
referred to as tokens or continuations. Each token is comprised of two
pointers: an instruction pointer (42B in FIG. 4) and a frame pointer 42A.
The instruction pointer 42B points to a particular instruction held in the
code section 26 of the local memory of the processing element. The frame
pointer 42A, on the other hand, points to the beginning of a particular
frame of memory locations in the frames section 24 of a processor element
local memory.
FIG. 4 shows how the tokens are used by the processing pipeline 36. When
the token 42 enters the pipeline 36, its instruction pointer 42B is used
to locate a particular instruction 44 held in the code section 26 of local
memory. This is performed in the previously described instruction fetch
stage 32 of the processing pipeline 36. The fetched instruction is
subsequently executed by the pipeline 36 as will be described below. As
such, the data processing system of the present invention operates on an
imperative set of instructions. In other words, whether an instruction
executes is determined solely by whether the instruction pointer 42B
points to it. There is no state of an operand which controls execution,
only a value which is operated upon.
Another way of characterizing this aspect of the present invention is to
classify the present invention as a control flow system as opposed to a
data flow system. The control mechanism, not the data, dicates whether
execution of an instruction takes place. The present system, thus, may be
contrasted with data flow systems wherein the state of the data and the
validity of the data is typically taken into account. The execution of
instructions in the present system is dictated by the value of the
instruction pointer. Once an instruction is fetched, execution is
imperative.
Having already located the instruction, the processing pipeline 36 next
seeks to obtain the appropriate operands for that instruction. To locate
the operands, the frame pointer 42A of the token 42 is used to locate a
particular frame 46 held in the frames portion 24 of local memory. As
noted above, the frame pointer 42A points to the beginning of the
specified frame 46. The addresses held in the operand field of the fetched
instruction 44 are used as offsets to locate the memory locations where
the desired operands are held.
For example, for the addition instruction 44 shown in FIG. 4, the a and b
fields of that instruction 44 are used as offsets to locate the two
operands that are to be added. The first operand is held at an address
equal to the address at the beginning of the specified frame 46 plus a.
The second operand to be added is, likewise, located in the same manner.
It is located at the beginning address of the specified frame 46 plus an
offset of b.
The processing pipeline 36 next performs the desired function specified by
the instruction 44. In the example case, the two operands are added. The
addition specified by the instruction of the example case takes place in
the third stage of the pipeline 36, the function units stage 28. Thus, in
the example illustrated in FIG. 4 the function unit stage 28 adds 20 and
30. The resulting sum (i.e. 50) then, must be stored in the appropriate
frame memory location. The operand store stage 34 of the pipeline 36
performs this storage. It locates the appropriate memory location by first
using the frame pointer 42A to point to the beginning of the specified
frame 46 and then adding the offset indicated by the instruction. In the
example of FIG. 4, the value of "50" is stored at location c from the
beginning of the specified frame 46. After the operand is stored, a new
continuation for the same thread is placed in the token queue 22. This
continuation is comprised of the current instruction pointer plus 1 (IP+1)
and the current frame pointer (FP).
As already discussed, it is preferable for the processing elements 10 to
execute a reduced instruction set. This reduced instruction set includes
typical arithmetic and logic operations, such as, addition, subtraction,
multiplication, etc. In this preferred implementation, all arithmetic
logical operations are three address operations. In other words, the
operands specify three addresses. The plus instruction, used as an example
in FIG. 4, falls within this class of instructions. The three addresses
specified by that instruction are the addresses of the two operands to be
added and the address where the sum of these operands is to be stored.
Similarly, an instruction such as "compare s1 s2 d" compares the values at
the addresses specified by the frame pointer plus s1 and the frame pointer
plus s2. It then stores a condition code at the address specified by the
frame pointer plus d.
Conventional control instructions such as jump and conditional jump
instructions are also included within the reduced instruction set of the
processing elements. Control instructions control the order and flow of
program execution. For a jump instruction, the continuation emerging from
the processing pipeline 36 is the instruction pointer value specified as
an operand in the jump instruction and the current frame pointer. For a
conditional jump instruction, a location specified as an operand in the
instruction is tested for some condition, and, depending on the outcome,
one of two possible continuations emerges out of the processing pipeline
36. It contains either the current instruction pointer plus 1 or a new
instruction pointer specified as an operand in the instruction. In either
case, contains the current frame pointer.
The frames are typically stored, as previously noted, in a frames portion
24 of local processing element memory. A difficulty arises, however, when
data values held in a frame need to be shared by more than one processing
element. To overcome this difficulty, the preferred embodiment of the
present invention utilizes the global memory available in the heap memory
elements 20 (FIG. 1). In particular, shared data values are stored in the
global heap memory elements 20. To operate on these shared data values,
the processing elements 10 must first bring the data values into local
frame storage 24. This can be done using a load instruction. Results of
these operations may be returned to the global heap memory elements 20
using a store instruction.
The load and store instructions are the only instructions for moving data
in and out of local memory of processing elements 10. They do not perform
any arithmetic operations. They move data between frames 24 of local
memory and heap memory elements 20.
FIG. 5 shows the load instruction in more detail. For illustrative
purposes, suppose that the processing element 10 is executing a given
thread of computation 48. Included within this thread 48 is a local
instruction to which the instruction pointer (IP) points. When the load
instruction is placed into the processing pipeline 36, the processing
element 10 attempts to execute it. The instruction specified in FIG. 5
tells the processing element to load a value held in a heap memory element
20 location to a memory location held within a local frame 24. In the
operand fetch stage 30 of the processing pipeline 36, the processing
element 10 looks for the heap memory location address by looking at the
memory location specified by the frame pointer plus a. That location
contains a pointer to a specific heap memory element 20 location.
Processing then continues outside the pipeline 36 at the load or store
component 38 of the processing element 10. This component generates a
message 50 that is sent through the network 14 to the heap memory elements
20.
The message 50 is comprised of several fields. Specifically, a first field
tells the memory controller 17 that a read is desired of the heap memory
element 20 location specified in the second field of the message 50. This
second field includes the address taken from the frame as previously
discussed. In the example case of FIG. 5, the pointer was held at frame
pointer plus a. The third field of the message specifies a continuation
which is a thread to be initiated after the read value has been loaded.
The fourth and final field denoted as "d" specifies the offset of the
memory location within the frame 24 where the read value is to be loaded.
The memory controller 17 of the heap memory element responds to this
message by sending a response message to the processing element 10. The
response message 52 is also comprised of several fields. In particular,
the first field specifies that this is a start message. The second field
is the value taken from address [a] of the heap memory to be loaded into
the memory location at frame pointer plus d in the processing element 10.
The third field is the continuation that was passed in the request message
50. The fourth field is the offset field which also was sent in the
request message 50.
When the processing element 10 receives this message 52, it does two
things. First, it loads the value read from heap memory element 20 to the
memory location specified by the frame pointer plus the offset In FIG. 5
the value V is loaded into the memory location pointed to by the frame
pointer plus d. Second, it forwards the thread descriptor specified in the
third field of the response message 52 to the token queue 22.
An interesting feature of the load instruction is that it terminates the
current thread in which it is contained (i.e. no continuation is issued).
A special load and continue instruction is included to allow the thread to
continue operating after the load instruction is performed. The operation
of the load and continue instruction (denoted as loadc) is illustrated in
FIG. 6. It generates the same request message 50 as the load instruction,
and the memory controller 17 responds with the same response message 52.
The loadc instruction differs from the load instruction in two ways.
First, the instruction pointer in the request that is sent to memory is
not the current instruction pointer plus 1 but a different instruction
pointer (designated as IP.sub.v in FIG. 6) that is an operand of the loadc
instruction. Second, instead of issuing no continuation, it issues a
continuation designated as the current instruction pointer plus 1 and the
current frame pointer. As such, the loadc instruction allows a
continuation of the thread such as 48A shown in FIG. 6.
The load and loadc instructions defer the task of retrieving the heap
memory element data to the memory controller 17. By doing this, they
overcome the delay that usually occurs waiting for access to memory. In
typical data processing systems (both sequential and parallel), the
processor must wait during the time period in which memory is accessed.
The waiting time period is usually quite large relative to the time
required to perform other processor operations. This memory latency is
overcome with the load and loadc instructions by not requiring a processor
to idle while memory is being accessed. After deferring the task of
accessing memory, the processor takes another continuation from the token
queue 22 to process so that it does not idle. The load instruction thereby
allows the processor to switch to another thread of computation if it
would idle otherwise. The token queue 22 assures that another thread will
generally be available and thus, facilitates maximum use of the processor
despite memory latency.
There is no requirement of a temporal correlation between the request for a
read from the heap memory elements 20 and a response generated from the
request. Rather, loads can be considered as split-phase transactions
comprised of a request phase and a response phase. Performing the loads in
this manner enables the preferred embodiment of the present invention to
issue multiple load requests before receiving any response to a particular
one of these requests. Moreover, responses can be generated in an order
different from the order in which requests are made. A continuation for
the thread to which the response is targeted is specified in the response
message 52.
A store instruction is similar to a load instruction except that the data
held in the local frame of memory 24 is written into the heap memory
element location 20. FIG. 7 illustrates a sample operation of a store
instruction. The store instruction has two operands. The first operand "a"
specifies an address in local memory where a value to be stored is held,
and the second operand "x" specifies an address in local memory that
contains the heap memory element address where the value is to be stored.
In the example of FIG. 7, the value is stored at the local memory address
specified by the frame pointer plus a. The heap memory location address is
stored at a local memory address specified by the frame pointer plus x.
Once the specified value and heap memory element address are found, the
processing element issues a request message 54 to the memory controller
17.
The request message 54 specifies that a write is requested. It notes the
heap memory element address where write will take place and the value to
be stored there. In response to this message, the memory controller 17
writes the value into the heap memory element address. The processing
element 10, after issuing this message 54, issues a continuation
designated by the instruction pointer plus one and the frame pointer. No
response message is generated.
New control instructions in the present invention are the fork, join and
start instructions which control operation of the processing system. A
fork instruction produces two continuations from a current operating
thread. One continuation is the next instruction in the thread in which
the fork instruction is executed. The other continuation represents an
entirely different thread that operates on the same frame. Hence, if the
boxes shown in FIG. 8 represent instructions associated with certain
threads of computation, when a fork 60 occurs, the continuation is not a
straight path as would be found absent the fork. Instead, two concurrent
thread descriptors are passed to the token queue 22. The first thread
descriptor is associated with A[i] 62 which contains the next instruction
to be executed in the current thread. The other thread descriptor is
associated with B[i] 64 which contains an instruction of an entirely
different thread of computation.
The fork instruction thus provides an ability to initiate a new thread that
can execute concurrently with the current thread. Any synchronization that
needs to be achieved between the threads can be instituted using a join
instruction.
The join instruction (66 in FIG. 8) performs the opposite operation of the
fork instruction. It causes two independently running threads of
computation to produce only a single continuation. In other words, it
takes two threads and produces a single thread from it.
FIG. 9 shows a flow chart of how the join operation joins together two
threads. The syntax of the join operation is "join w", where w specifies a
memory location. The single join instruction must be executed by both
threads in order for a join to occur. The join instruction acts on the
specified memory location w which is marked as either all ones or all
zeroes. When the join instruction is executed, the value at location w is
toggled to the other state of ones or zeroes that is opposite the current
state (Step 68). Thus, for instance, if w were all ones, it would be
switched to all zeroes and vice versa. If a thread is the first thread to
execute the join instruction for memory location w, the memory location w
is marked as all ones. The system then looks to see if the memory location
is marked all ones (Step 70). If the memory location is already all ones,
the thread that caused it to be marked all ones issues no continuation
(Step 74).
On the other hand, if the thread is not the first thread to execute the
join instruction but is instead the second thread to execute the join
instruction, then, the memory location w is marked all zeroes, and a
continuation issues so that the processing of a new thread is initiated
beginning at the next instruction (Step 76). Hence, the rule of the join
operation is that both threads terminate and a new thread is initiated
after the arrival of the second one.
FIG. 10 shows an example of a program where fork instructions and join
instructions are used. In particular, presume the instruction pointer
starts off with a value of IP. At instruction 90, a fork to IP.sub.t is
performed. Thus, expressions e1 and e2 will subsequently be performed
independently in two separate threads that share the same frame pointer.
The thread having an instruction pointer value of IP.sub.t next executes a
jump to the instruction specified by IP.sub.join. The second thread
accordingly next executes the instruction specified by IP.sub.join.
The other thread which executed e1 92 proceeds to perform the instruction
at IP.sub.join. This instruction 98 is a join instruction specifying
memory location r.sub.j. It does not matter which thread executes the join
first. The one that executes it first will issue no continuation. The
other will, in contrast, issue a continuation. The thread that issues a
continuation specifies e3 as the next instruction 100 to be executed. The
join assures that both e1 and e2 have been performed by the two threads
respectively prior to e3 being performed.
The start instruction is used to terminate a thread of computation in the
current frame and to initiate a new thread of computation in a different
frame. A start instruction has three operands v, c and d where v, c and d
specify memory locations in the current frame. The location designated by
c contains a new continuation, that is, a new frame pointer and a new
instruction pointer. The location designated by d specifies an offset in
the frame designated by the new frame pointer. The new continuation may be
on a different data processing element or it may be on the same one.
FIG. 12 shows the execution of the start instruction in more detail. For
illustrative purposes, suppose that a start instruction "start v c d" is
placed in the processing pipeline 36 of data processing element A 200. In
the operand fetch stage 38, the three operands [v], [c] and [d] are read
from the current frame. Processing then continues in the load/store
component 38. Here, a message 202 is generated and sent to data processing
element B 201. The format of the message 202 is "<start, [v], [c], [d]>".
The format of the message is identical to the format of the previously
described response messages 52 from a memory controller to a data
processing element. Data processing element A 200 may continue to perform
instructions from other threads taken from its thread descriptor storage
22.
When the message arrives at the store component 40 of data processing
element B 201, it is treated in exactly the same way as the previously
described response messages 52 from a memory controller. In data
processing element B 201, the value [v] is stored at offset [d] in the
frame specified in the continuation [c], and the continuation [c] is
placed in the token queue 22. As usual, data processing element B 201 may
then execute instructions from the thread designated by the continuation
[c].
The start instruction provides a way to simultaneously do two things:
deliver a data value from one data processing element (A) to another (B),
and initiate a thread of computation in B.
The present invention can also be used in a producer/consumer environment.
In that case it may be desirable to implement the heap memory elements as
I-structures. The I-structure organization of the heap memory elements 20
prevents potential conflicts for memory locations that could occur
otherwise. When the heap memory elements 20 are organized as such, two new
types of messages are desirable for t | | |