|
|
|
| United States Patent | 6339819 |
| Link to this page | http://www.wikipatents.com/6339819.html |
| Inventor(s) | Huppenthal; Jon M. (Colorado Springs, CO);
Leskar; Paul A. (Colorado Springs, CO) |
| Abstract | An enhanced memory algorithmic processor ("MAP") architecture for
multiprocessor computer systems comprises an assembly that may comprise,
for example, field programmable gate arrays ("FPGAs") functioning as the
memory algorithmic processors. The MAP elements may further include an
operand storage, intelligent address generation, on board function
libraries, result storage and multiple input/output ("I/O") ports. The MAP
elements are intended to augment, not necessarily replace, the high
performance microprocessors in the system and, in a particular embodiment
of the present invention, they may be connected through the memory
subsystem of the computer system resulting in it being very tightly
coupled to the system as well as being globally accessible from any
processor in a multiprocessor computer system. |
|
|
|
Title Information  |
|
|
|
|
|
Drawing from US Patent 6339819 |
|
|
Multiprocessor with each processor element accessing operands in loaded
input buffer and forwarding results to FIFO output buffer |
|
|
|
|
|
| Publication Date |
January 15, 2002 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| Parent Case |
CROSS REFERENCE TO RELATED PATENT APPLICATIONS
The present invention is a continuation-in-part application of U.S. patent
application Ser. No. 09/481,902 filed Jan. 12, 2000, now U.S. Pat. No.
6,247,110, which is a continuation of U.S. patent application Ser. No.
08/992,763 filed Dec. 17, 1997, now U.S. Pat. No. 6,076,152, for:
"Multiprocessor Computer Architecture Incorporating a Plurality of Memory
Algorithm Processors in the Memory Subsystem", assigned to SRC Computers,
Inc., Colorado Springs, Colo. assignee of the present invention, the
disclosures of which are herein specifically incorporated by this
reference. |
|
|
|
|
|
|
|
|
|
|
|
|
|
Title Information  |
|
|
Claims  |
|
|
What is claimed is:
1. A computer system including at least one data processor for operating on
user data in accordance with program instructions, said computer further
including at least one memory array coupled to said at least one data
processor by read and write buses, said computer system comprising:
at least one memory algorithmic processor element including a user array
for performing at least one algorithm on an operand and an address
generator for accessing said at least one memory array;
an input data buffer associated with said at least one memory algorithmic
processor element for coupling said write bus to said user array, wherein
said operand is transferred to said at least one memory algorithmic
processor element after loading of said input data buffer; and
an output FIFO associated with said at least one memory algorithmic
processor element for coupling said user array to said read bus.
2. The computer system of claim 1 further comprising:
a chain port coupled to said user array for receiving input from an
additional one of said at least one memory algorithmic processors in said
computer system.
3. The computer system of claim 2 further comprising:
said chain port coupled to said output FIFO for providing output to another
one of said memory algorithmic processors in said computer system.
4. The computer system of claim 1 wherein said user array comprises at
least one FPGA.
5. The computer system of claim 4 wherein said user array comprises four
FPGAs.
6. The computer system of claim 1 further comprising:
a control block coupled to said input data buffer, said user array and said
output FIFO for operationally controlling a function of said at least one
memory algorithmic processor element.
7. The computer system of claim 1, wherein during said loading operands are
received out-of-order.
8. The computer system of claim 7, wherein said loading comprises cache
line transfers of said received operands.
9. A computer system including at least one data processor for operating on
user data in accordance with program instructions, said computer further
including at least one memory array coupled to said at least one data
processor by read and write buses, said computer system comprising:
at least one memory algorithmic processor element including a user array
for performing at least one algorithm on an operand;
an input data buffer associated with said at least one memory algorithmic
processor element for coupling said write bus to said user array;
an output FIFO associated with said at least one memory algorithmic
processor element for coupling said user array to said read bus; and
address generator operative for supporting a data gather function from said
input data buffer or said at least one memory array, wherein said address
generator is configured to receive address bits from said at least one
memory algorithmic processor element and combine said address bits with
additional control signals prior to issuance to said input data buffer.
10. The computer system of claim 7, further comprising:
a chain port coupled to said user array for receiving input from an
additional one of said at least one memory algorithmic processors in said
computer system.
11. The computer system of claim 10 further comprising:
said chain port coupled to said output FIFO for providing output to another
one of said memory algorithmic processors in said computer system.
12. The computer system of claim 9 wherein said user array comprises at
least one FPGA.
13. The computer system of claim 12 wherein said user array comprises four
FPGAs.
14. The computer system of claim 9 further comprising:
a control block coupled to said input data buffer, said user array and said
output FIFO for operationally controlling a function of said at least one
memory algorithmic processor element.
15. The computer system of claim 9, wherein said address generator is
configured to receive a start command including a start address, a stop
address, and a stride and based on said start command, to access said
input data buffer beginning at said start address and continuing at
increments of said stride until said stop address is reached.
16. A computer system including at least one data processor for operating
on user data in accordance with program instructions, said computer
further including at least one memory array coupled to said at least one
data processor by read and write buses, said computer system comprising:
at least one memory algorithmic processor element including a user array
for performing at least one algorithm on an operand and an address
generator for accessing said at least one memory array;
an input data buffer associated with said at least one memory algorithmic
processor element for coupling said write bus to said user array;
an output FIFO associated with said at least one memory algorithmic
processor element for coupling said user array to said read bus; and
a memory device associated with said at least one memory algorithmic
processor element for at least storing said algorithm.
17. The computer system of claim 15 further comprising:
a chain port coupled to said user array for receiving input from an
additional one of said at least one memory algorithmic processors in said
computer system.
18. The computer system of claim 17 further comprising:
said chain port coupled to said output FIFO for providing output to another
one of said memory algorithmic processors in said computer system.
19. The computer system of claim 15 wherein said user array comprises at
least one FPGA.
20. The computer system of claim 19 wherein said user array comprises four
FPGAs.
21. The computer system of claim 15 further comprising:
a control block coupled to said input data buffer, said user array and said
output FIFO for operationally controlling a function of said at least one
memory algorithmic processor element.
22. The computer system of claim 15 wherein said memory device is
selectively reprogrammable by said at least one memory algorithmic
processor element for storing another algorithm.
23. A computer system including at least one data processor for operating
on user data in accordance with program instructions, said computer
further including at least one memory array coupled to said at least one
data processor by read and write buses, said computer system comprising:
a plurality of memory algorithmic processor elements, each including a user
array for performing at least one algorithm on an operand and an address
generator for accessing said at least one memory array;
an input data buffer associated with each of said plurality of memory
algorithmic processor elements for coupling said write bus to said user
array;
an output FIFO associated with each of said plurality of memory algorithmic
processor elements for coupling said user array to said read bus; and
a chain port coupled to each user array for receiving input from an
adjacent one of said memory algorithmic processor elements and coupled to
said output FIFO for passing output to another adjacent one of said memory
algorithmic processor elements, wherein said output consists of operands.
24. The computer system of claim 23 wherein said user array comprises at
least one FPGA.
25. The computer system of claim 24 wherein said user array comprises four
FPGAs.
26. The computer system of claim 23 further comprising:
a control block coupled to said input data buffer, said user array and said
output FIFO for operationally controlling a function of each of said
plurality of memory algorithmic processor elements.
27. The computer system of claim 23 wherein said chain port allows at least
a subset of said plurality of memory algorithmic processor elements to
perform said at least one algorithm.
28. The computer system of claim 23 wherein said chain port allows for
dynamic configuration of said plurality of memory algorithmic processor
elements into at least two subsets thereof to separately perform differing
ones of said at least one algorithms.
29. A computer system including at least one data processor for operating
on user data in accordance with program instructions, said computer
further including at least one memory array coupled to said at least one
data processor by read and write buses, said computer system comprising:
at least one memory algorithmic processor element including a user array
for performing at least one algorithm on an operand and an address
generator for accessing said at least one memory array;
an input data buffer associated with said at least one memory algorithmic
processor element for coupling said write bus to said user array; and
an output FIFO associated with said at least one memory algorithmic
processor element for coupling said user array to said read bus, said
output FIFO operative to store output data to allow said memory
algorithmic processor element to continue operation if one of said at
least one data processors intended to read said output data is currently
unavailable to perform said read operation.
30. The computer system of claim 29 further comprising:
a chain port coupled to said user array for receiving input from an
additional one of said at least one memory algorithmic processors in said
computer system.
31. The computer system of claim 30 further comprising:
said chain port coupled to said output FIFO for providing said output data
to another one of said memory algorithmic processors in said computer
system.
32. A computer system including at least one data processor for operating
on user data in accordance with program instructions, said computer
further including at least one memory array coupled to said at least one
data processor by read and write buses, said computer system comprising:
at least one memory algorithmic processor element including a user array
for performing at least one algorithm on an operand and an address
generator for accessing said at least one memory array;
an input data buffer associated with said at least one memory algorithmic
processor element for coupling said write bus to said user array, said
input data buffer operative to store one or more of said operands for
access by an additional one of said at least one memory algorithmic
processor elements; and
an output FIFO associated with said at least one memory algorithmic
processor element for coupling said user array to said read bus.
33. The computer system of claim 32 further comprising:
a chain port coupled to said user array for receiving input from said
additional one of said at least one memory algorithmic processors in said
computer system.
34. The computer system of claim 33 further comprising:
said chain port coupled to said output FIFO for providing output to another
one of said memory algorithmic processors in said computer system, wherein
said input and said output passed through said chain port consists of
operands.
35. A computer system including at least one data processor for operating
on user data in accordance with program instructions, said computer
further including at least one memory array coupled to said at least one
data processor by read and write buses, said computer system comprising:
at least one memory algorithmic processor element including a user array
for performing at least one algorithm on an operand and an address
generator for accessing said at least one memory array;
an input data buffer associated with said at least one memory algorithmic
processor element for coupling said write bus to said user array;
an output FIFO associated with said at least one memory algorithmic
processor element for coupling said user array to said read bus;
a chain port coupled to said user array for receiving input from an
additional one of said at least one memory algorithmic processors in said
computer system, said chain port being also coupled to said FIFO for
providing output to another one of said memory algorithmic processors in
said computer system; and
a dedicated port coupled between said user array and said input data buffer
to enable said user array to receive said operand from said chain port or
said input data buffer, wherein said dedicated port enables said user
array to receive said operand through said chain port while substantially
concurrently accessing data stored in said input data buffer.
36. A computer system including a plurality of data processors for
operating on user data in accordance with program instructions, said
computer further including at least one memory array being selectively
coupled through a switching element to said plurality of data processors
by read and write buses, said computer system comprising:
at least one memory algorithmic processor element including a user array
for performing at least one algorithm on an operand and an address
generator for accessing said at least one memory array, said at least one
memory algorithmic processor element being globally accessible by each of
said plurality of data processors in said computer system;
an input data buffer associated with said at least one memory algorithmic
processor element for coupling said write bus to said user array; and
an output FIFO associated with said at least one memory algorithmic
processor element for coupling said user array to said read bus.
37. The computer system of claim 36 further comprising:
a chain port coupled to said user array for receiving input from an
additional one of said at least one memory algorithmic processors in said
computer system.
38. The computer system of claim 37 further comprising:
said chain port coupled to said output FIFO for providing output to another
one of said memory algorithmic processors in said computer system.
39. A computer system including at least one data processor for operating
on user data in accordance with program instructions, said computer
further including at least one memory array coupled to said at least one
data processor by read and write buses, said computer system comprising:
at least one memory algorithmic processor element including a user array
for performing at least one algorithm on an operand and an address
generator for accessing said at least one memory array, said memory
algorithmic processor element further comprising a configuration memory
for storing information controlling an operation of said memory
algorithmic processor element, wherein said configuration memory is
capable of being updated during operation of said computer system to
change said operation of said memory algorithmic processor element;
an input data buffer associated with said at least one memory algorithmic
processor element for coupling said write bus to said user array; and
an output FIFO associated with said at least one memory algorithmic
processor element for coupling said user array to said read bus.
40. The computer system of claim 39 further comprising:
a chain port coupled to said user array for receiving input from an
additional one of said at least one memory algorithmic processors in said
computer system.
41. The computer system of claim 40 further comprising:
said chain port coupled to said output FIFO for providing output to another
one of said memory algorithmic processors in said computer system.
42. A computer system including at least one data processor for operating
on user data in accordance with program instructions, said computer
further including at least one memory array coupled to said at least one
data processor by read and write buses, said computer system comprising:
at least one memory algorithmic processor element including a user array
for performing at least one algorithm on an operand and an address
generator for accessing said at least one memory array, said at least one
memory algorithmic processor being coupled to said at least one memory
array to provide direct memory access to commands issued by said at least
one data processor;
an input data buffer associated with said at least one memory algorithmic
processor element for coupling said write bus to said user array; and
an output FIFO associated with said at least one memory algorithmic
processor element for coupling said user array to said read bus.
43. The computer system of claim 42 further comprising:
a chain port coupled to said user array for receiving input from an
additional one of said at least one memory algorithmic processors in said
computer system.
44. The computer system of claim 43 further comprising:
said chain port coupled to said output FIFO for providing output to another
one of said memory algorithmic processors in said computer system.
45. The computer system of claim 42 wherein said user array comprises at
least one FPGA.
46. The computer system of claim 45 wherein said user array comprises four
FPGAs.
47. The computer system of claim 42 further comprising:
a control block coupled to said input data buffer, said user array and said
FIFO for operationally controlling a function of said at least one memory
algorithmic processor element. |
|
|
|
|
Claims  |
|
|
Description  |
|
|
BACKGROUND OF THE INVENTION
The present invention relates, in general, to the field of computer
architectures incorporating multiple processing elements. More
particularly, the present invention relates to a multiprocessor computer
architecture incorporating a number of memory algorithmic processors
("MAP") in the memory subsystem or closely coupled to the processing
elements to significantly enhance overall system processing speed.
As commodity microprocessors increase in capability there is an ever
increasing push to use them in high performance multiprocessor systems
capable of performing trillions of calculations per second at
significantly lower cost than those made from custom counterparts.
However, many of these processors lack specific features common to systems
in this category that employ much more expensive custom processors. One
such feature is the ability to perform vector processing.
In this form of processing, a data register or buffer is filled with
operands forming what is called a vector. All of these operands are then
passed one after the other through a functional unit capable of performing
operations such as multiplication. This functional unit will output one
result every clock cycle. This type of processing does require that the
same operation be performed on all operands in the input vector and it is,
therefore, widely used in that it exhibits much higher processing rates
than the traditional scalar method of computation used in most
microprocessors.
Nevertheless, neither vector nor scalar processors perform very well when
required to perform bit manipulation as is required, for example, in
matrix arithmetic. One such function is a bit matrix multiply operation in
which two matrices of different sizes are multiplied together to form a
third matrix. Another shortfall of both vector and scalar processing is
their inability to quickly perform pattern searches such as those used in
a variety of pattern recognition programs.
A solution to all of these deficiencies can be found by building a high
performance computer which contains numbers of commodity microprocessors
to reduce the system cost together with MAP elements developed by SRC
Computers, Inc., assignee of the present invention, to provide the
deficient functions at very low cost. The MAP architecture and specific
features thereof is disclosed in the aforementioned patent applications,
the disclosures of which are herein specifically incorporated by this
reference.
SUMMARY OF THE INVENTION
The enhanced memory algorithmic processor architecture for multiprocessor
computer systems of the present invention is an assembly that not only
contains, for example, field programmable gate arrays functioning as the
memory algorithmic processors, but also an operand storage, intelligent
address generation, on board function libraries, result storage and
multiple I/O ports. Like the original MAP architecture disclosed in the
aforementioned patent applications, this architecture differs from other
so called "reconfigurable" computers in many ways.
First, its function is intended to be altered every few seconds
distinguishing itself from other systems with very long reconfiguration
times primarily intended for a single function. Secondly, it contains
dedicated hardware to provide for large data set operand storage (on the
order of 16 Mbytes or more) allowing the MAP element to function
autonomously from its host system once operands are loaded. Thirdly, it
contains dedicated data ports to allow, but not require, multiple MAP
elements to be chained together to perform very large operations. As
currently contemplated, it is intended that typically 32 to 512 or more
MAP sections can be connected in a single system.
Further, the MAP element is intended to augment, not replace, the high
performance microprocessors in the system. As such, in a particular
embodiment of the present invention, it may be connected through the
memory subsystem of the computer system resulting in it being very tightly
coupled to the system as well as being globally accessible from any
processor in the system. This technique was developed by SRC Computers,
Inc. and distinguishes the MAP architecture from all other so called
"attached array processor" systems that may exist today. While such
"attached array processor" systems may bear some superficial similarities
to MAP based systems, they are entirely separate units connected to the
host computer through relatively slow interconnects resulting in lost
system performance.
The MAP architecture developed by SRC Computers, Inc. as defined in the
aforementioned patent applications overcomes many of the limitations of
such "attached array processor" systems. Because of the particular
limitations in the exemplary architecture disclosed therein surrounding
the attachment of input storage and chaining capabilities, certain vector
processing functions may not have been optimally implemented unlike
relatively smaller algorithms.
Through the addition of these and other features to the MAP architecture, a
much more powerful multiprocessor computer system is provided. Moreover,
while, as originally disclosed, another feature of the MAP architecture
was its ability to perform direct memory access ("DMA") into the common
the memory of the system, enhancements disclosed herein have expanded the
potential utilization of this feature.
Particularly disclosed herein is a Memory Algorithmic Processor ("MAP")
assembly (or element) comprising reconfigurable field programmable gate
array ("FPGA") circuitry, an intelligent address generator, input data
buffers, output first-in, first-out ("FIFO") devices and ports to allow
connection to a memory array and chaining of multiple MAP assemblies for
the purpose of augmenting the capability of a microprocessor in a high
performance computer.
Further disclosed herein is a MAP assembly comprising an intelligent
address generator capable of supporting a data gather function from its
associated input buffer or common memory. The MAP assembly may also
comprise circuitry to allow the reconfigurable elements to reprogram their
on-board configuration read only memory ("ROM") devices to cause
alterations in the functionality of the reconfigurable circuitry.
Still further disclosed herein is a MAP assembly comprising dedicated input
and output ports for the purpose of allowing an infinite number of MAP
elements to be chained together to accomplish a single function. The MAP
assembly may also incorporate provisions to create a single MAP chain or
multiple independent MAP chains automatically based on the contents of the
reconfigurable circuitry.
Further disclosed herein is a MAP assembly comprising output FIFOs for the
purpose of holding output data and allowing the MAP element to not stall
in the event the processor reading these results is delayed due to outside
factors such as workload or crossbar switch conflicts. The MAP assembly
may further comprise relatively large dedicated input storage buffers to
allow for optimization of operand transfer as well as allow multiple
accesses to an operand without requiring external processor intervention.
Still further disclosed herein is a MAP assembly comprising a dedicated
port for connection to an input buffer so that the MAP element can
simultaneously receive operands via the chained input (chain) port and the
input buffer. This allows the MAP element to perform mathematical
processing at the maximum possible rate while also allowing the MAP
element to accept operands via the chain port while accessing reference
data in the input buffer (such as reciprocal look up tables) to allow the
MAP element to perform operations such as division at the fastest possible
rate.
Also further disclosed herein is a MAP assembly which may comprise
connections to the memory subsystem of a high performance computer for the
purpose of providing global access to it from all processors in a
multiprocessor high performance computer system. The MAP assembly
incorporates the capability to update multiple on board function ROMs
under program control while in the system and may also include connections
to the memory subsystem of a high performance computer utilizing DMA to
accept commands from a microprocessor.
BRIEF DESCRIPTION OF THE DRAWINGS
The aforementioned and other features and objects of the present invention
and the manner of attaining them will become more apparent and the
invention itself will be best understood by reference to the following
description of a preferred embodiment taken in conjunction with the
accompanying drawings, wherein:
FIG. 1 is a simplified, high level, functional block diagram of a
multiprocessor computer architecture employing memory algorithmic
processors ("MAP") in accordance with the disclosure of the aforementioned
patent applications in an alternative embodiment wherein direct memory
access ("DMA") techniques may be utilized to send commands to the MAP
elements in addition to data;
FIG. 2 is a simplified logical block diagram of a possible computer
application program decomposition sequence for use in conjunction with a
multiprocessor computer architecture utilizing a number of MAP elements
located, for example, in the computer system memory space, in accordance
with a particular embodiment of the present invention;
FIG. 3 is a more detailed functional block diagram of an exemplary
individual one of the MAP elements of the preceding figures and
illustrating the bank control logic, memory array and MAP assembly
thereof;
FIG. 4 is a more detailed functional block diagram of the control block of
the MAP assembly of the preceding illustration illustrating its
interconnection to the user FPGA thereof in a particular embodiment;
FIG. 5 is a functional block diagram of an alternative embodiment of the
present invention wherein individual MAP elements are closely associated
with individual processor boards and each of the MAP elements comprises
independent chain ports for coupling the MAP elements directly to each
other;
FIG. 6 is a functional block diagram of an individual MAP element wherein
each comprises on board memory and a control block providing common memory
DMA capabilities;
FIG. 7 is an additional functional block diagram of an individual MAP
element illustrating the on board memory function as an input buffer and
output FIFO portions thereof;
FIG. 8 is a more detailed functional block diagram of an individual MAP
element as illustrated in FIGS. 6 and 7;
FIG. 9 is a user array interconnect diagram illustrating, for example, four
user FPGAs interconnected through horizontal, vertical and diagonal buses
to allow for expansion in designs that exceed the capacity of a single
FPGA;
FIG. 10 is a functional block diagram of another alternative embodiment of
the present invention wherein individual MAP elements are closely
associated with individual memory arrays and each of the MAP elements
comprises independent chain ports for coupling the MAP elements directly
to each other; and
FIGS. 11A and 11B are timing diagrams respectively input and output timing
in relationship to the system clock ("Sysclk") signal.
DESCRIPTION OF A PREFERRED EMBODIMENT
With reference now to FIG. 1, a multiprocessor computer 10 architecture in
accordance with one embodiment of the present invention is shown. The
multiprocessor computer 10 incorporates N processors 120.sub.0 through
12.sub.N which are bi-directionally coupled to a memory interconnect
fabric 14. The memory interconnect fabric 14 is then also coupled to M
memory banks comprising memory bank subsystems 16.sub.0 (Bank 0) through
16M (Bank M). N number of memory algorithmic processors ("MAP") 112.sub.0
through 112.sub.N are also coupled to the memory interconnect fabric 14 as
will be more fully described hereinafter.
With reference now to FIG. 2, a representative application program
decomposition for a multiprocessor computer architecture 100 incorporating
a plurality of memory algorithm processors in accordance with the present
invention is shown. The computer architecture 100 is operative in response
to user instructions and data which, in a coarse grained portion of the
decomposition, are selectively directed to one of (for purposes of example
only) four parallel regions 102.sub.1 through 102.sub.4 inclusive. The
instructions and data output from each of the parallel regions 102.sub.1
through 102.sub.4 are respectively input to parallel regions segregated
into data areas 104.sub.1 through 104.sub.4 and instruction areas
106.sub.1 through 106.sub.4. Data maintained in the data areas 104.sub.1
through 104.sub.4 and instructions maintained in the instruction areas
106.sub.1 through 106.sub.4 are then supplied to, for example,
corresponding pairs of processors 108.sub.1, 108.sub.2 (P1 and P2);
108.sub.3, 108.sub.4 (P3 and P4); 108.sub.5, 108.sub.6 (P5 and P6); and
108.sub.7, 108.sub.8 (P7 and P8) as shown. At this point, the medium
grained decomposition of the instructions and data has been accomplished.
A fine grained decomposition, or parallelism, is effectuated by a further
algorithmic decomposition wherein the output of each of the processors
108.sub.1 through 108.sub.8, is broken up, for example, into a number of
fundamental algorithms 1101.sub.1A, 110.sub.1B, 110.sub.2A, 110.sub.2B
through 110.sub.8B as shown. Each of the algorithms is then supplied to a
corresponding one of the MAP elements 112.sub.1A, 112.sub.1B, 112.sub.2A,
112.sub.2B, through 112.sub.8B which may be located in the memory space of
the computer architecture 100 for execution therein as will be more fully
described hereinafter.
With reference additionally now to FIG. 3, an exemplary implementation of a
memory bank 120 in a MAP system computer architecture 100 of the present
invention is shown for a representative one of the MAP elements 112
illustrated in the preceding figure. Each memory bank 120 includes a bank
control logic block 122 bi-directionally coupled to the computer system
trunk lines, for example, a 72 line bus 124. The bank control logic block
122 is coupled to a bi-directional data bus 126 (for example 256 lines)
and supplies addresses on an address bus 128 (for example 17 lines) for
accessing data at specified locations within a memory array 130.
The data bus 126 and address bus 128 are also coupled to a MAP element 112.
The MAP element 112 comprises a control block 132 coupled to the address
bus 128. The control block 132 is also bi-directionally coupled to a user
field programmable gate array ("FPGA") 134 by means of a number of signal
lines 136. The user FPGA 134 is coupled directly to the data bus 126. In a
particular embodiment, the FPGA 134 may be provided as a Lucent
Technologies OR3T80 device.
The computer architecture 100 comprises a multiprocessor system employing
uniform memory access across common shared memory with one or more MAP
elements 112 which may be located in the memory subsystem, or memory
space. As previously described, each MAP element 112 contains at least one
relatively large FPGA 134 that is used as a reconfigurable functional
unit. In addition, a control block 132 and a preprogrammed or dynamically
programmable configuration ROM (as will be more fully described
hereinafter) contains the information needed by the reconfigurable MAP
element 112 to enable it to perform a specific algorithm. It is also
possible for the user to directly download a new configuration into the
FPGA 134 under program control, although in some instances this may
consume a number of memory accesses and might result in an overall
decrease in system performance if the algorithm was short-lived.
FPGAs have particular advantages in the application shown for several
reasons. First, commercially available FPGAs now contain sufficient
internal logic cells to perform meaningful computational functions.
Secondly, they can operate at speeds comparable to microprocessors, which
eliminates the need for speed matching buffers. Still further, the
internal programmable routing resources of FPGAs are now extensive enough
that meaningful algorithms can now be programmed without the need to
reassign the locations of the input/output ("1/0") pins.
By, for example, placing the MAP element 112 in the memory subsystem or
memory space, it can be readily accessed through the use of memory read
and write commands, which allows the use of a variety of standard
operating systems. In contrast, other conventional implementations may
propose placement of any reconfigurable logic in or near the processor,
however these conventional implementations are generally much less
effective in a multiprocessor environment because, unlike the system and
method of the present invention, only one processor has rapid access to
it. Consequently, reconfigurable logic must be placed by every processor
in a multiprocessor system, which increases the overall system cost. In
addition, MAP element 112 can access the memory array 130 itself, referred
to as Direct Memory Access ("DMA"), allowing it to execute tasks
independently and asynchronously of the processor. In comparison, were it
placed near the processor, it would have to compete with the processors
for system routing resources in order to access memory, which
deleteriously impacts processor performance. Because MAP element 112 has
DMA capability, (allowing it to write to memory), and because it receives
its operands via writes to memory, it is possible to allow a MAP element
112 to feed results to another MAP element 112. This is a very powerful
feature that allows for very extensive pipelining and parallelizing of
large tasks, which permits them to complete faster.
Many of the algorithms that may be implemented will receive an operand and
require many clock cycles to produce a result. One such example may be a
multiplication that takes 64 clock cycles. This same multiplication may
also need to be performed on thousands of operands. In this situation, the
incoming operands would be presented sequentially so that while the first
operand requires 64 clock cycles to produce results at the output, the
second operand, arriving one clock cycle later at the input, will show
results one clock cycle later at the output. Thus, after an initial delay
of 64 clock cycles, new output data will appear on every consecutive clock
cycle until the results of the last operand appears. This is called
"pipelining".
In a multiprocessor system, it is quite common for the operating system to
stop a processor in the middle of a task, reassign it to a higher priority
task, and then return it, or another, to complete the initial task. When
this is combined with a pipelined algorithm, a problem arises (if the
processor stops issuing operands in the middle of a list and stops
accepting results) with respect to operands already issued but not yet
through the pipeline. To handle this issue, a solution involving the
combination of software and hardware is disclosed herein.
To make use of any type of conventional reconfigurable hardware, the
programmer could embed the necessary commands in his application program
code. The drawback to this approach is that a program would then have to
be tailored to be specific to the MAP hardware. The system of the present
invention eliminates this problem. Multiprocessor computers often use
software called parallelizers. The purpose of this software is to analyze
the user's application code and determine how best to split it up among
the processors. The present invention provides significant advantages over
a conventional parallelizer and enables it to recognize portions of the
user code that represent algorithms that exist in MAP elements 112 for
that system and to then treat the MAP element 112 as another computing
element. The parallelizer then automatically generates the necessary code
to utilize the MAP element 112. This allows the user to write the
algorithm directly in his code, allowing it to be more portable and
reducing the knowledge of the system hardware that he has to have to
utilize the MAP element 112.
With reference additionally now to FIG. 4, a block diagram of the MAP
control block 132 is shown in greater detail. The control block 132 is
coupled to receive a number of command bits (for example, 17) from the
address bus 128 at a command decoder 150. The command decoder 150 then
supplies a number of register control bits to a group of status registers
iS2 on an eight bit bus 154. The command decoder 150 also supplies a
single bit last operand flag on line 156 to a pipeline counter 158. The
pipeline counter 158 supplies an eight bit output to an equality
comparitor 160 on bus 162. The equality comparitor 160 also receives an
eight bit signal from the FPGA 134 on bus 136 indicative of the pipeline
depth. When the equality comparitor 160 determines that the pipeline is
empty, it provides a single bit pipeline empty flag on line 164 for input
to the status registers 152. The status registers 152 are also coupled to
receive an eight bit status signal from the FPGA 134 on bus 136 and it
produces a sixty four bit status word output on bus 166 in response to the
signals on bus 136, 154 and line 164.
The command decoder 150 also supplies a five bit control signal on line 168
to a configuration multiplexer ("MUX") 170 as shown. The configuration MUX
170 receives a single bit output of a 256 bit parallel-serial converter
172 on line 176. The inputs of the 256 bit parallel-to-serial converter
172 are coupled to a 256 bit user configuration pattern bus 174. The
configuration MUX 170 also receives sixteen single bit inputs from the
configuration ROMs (illustrated as ROM 182) on bus 178 and provides a
single bit configuration file signal on line 180 to the user FPGA 134 as
selected by the control signals from the command decoder 150 on the bus
168.
In operation, when a processor 108 is halted by the operating system, the
operating system will issue a last operand command to the MAP element 112
through the use of command bits embedded in the address field on bus 128.
This command is recognized by the command decoder 150 of the control block
132 and it initiates a hardware pipeline counter 158. When the algorithm
was initially loaded into the FPGA 134, several output bits connected to
the control block 132 were configured to display a binary representation
of the number of clock cycles required to get through its pipeline (i.e.
pipeline "depth") on bus 136 input to the equality comparitor 160. After
receiving the last operand command, the pipeline counter 158 in the
control block 132 counts clock cycles until its count equals the pipeline
depth for that particular, algorithm. At that point, the equality
comparitor 160 in the control block 132 de-asserts a busy bit on line 164
in an internal group of status registers 152. After issuing the last
operand signal, the processor 108 will repeatedly read the status
registers 152 and accept any output data on bus 166. When the busy flag is
de-asserted, the task can be stopped and the MAP element 112 utilized for
a different task. It should be noted that it is also possible to leave the
MAP element 112 configured, transfer the program to a different processor
108 and restart the task where it left off.
In order to evaluate the effectiveness of the use of the MAP element 112 in
a given application, some form of feedback to the use is required.
Therefore, the MAP element 112 may be equipped with internal registers in
the control block 132 that allow it to monitor efficiency related factors
such as the number of input operands versus output data, the number of
idle cycles over time and the number of system monitor interrupts received
over time. One of the advantages that the MAP element 112 has is that
because of its reconfigurable nature, the actual function and type of
function that are monitored can also change as the algorithm changes. This
provides the user with an almost infinite number of possible monitored
factors without having to monitor all factors all of the time.
With reference additionally now to FIG. 5, a functional block diagram of a
portion of an alternative embodiment of a computer system 20 in accordance
with the of the present invention is shown. In the computer system 20
illustrated, individual MAP elements 112.sub.A, 112.sub.B etc. are each
closely associated with individual processor boards 22.sub.A, 22.sub.B
respectively. As depicted, each of the MAP elements 112 comprises
independent chain ports 24 for coupling the MAP elements 112 directly to
each other.
Individual ones of the MAP elements 112 are coupled between the processor
board 22 write trunk 26 and read trunk 28 of each processor board 22 in
addition to their coupling to each other by means of the chain ports 24. A
switch couples the write trunk 26 and read trunk 28 of any given processor
board to any other memory subsystem bank 16.sub.A, 16.sub.B etc. As
generally illustrated, each of the memory subsystem banks 16 includes a
control block 122 and one or more memory arrays 130.
With reference additionally now to FIG. 6, a functional block diagram of an
individual MAP element 112 is shown wherein each MAP element 112 comprises
an on board memory 40 and a control block 46 providing common memory DMA
capabilities. Briefly, the write trunk 26 and read trunk 28 are coupled to
the control block 46 from the common memory switch which provides
addresses to the memory 40 and receives addresses from the user array 42
on address lines 48. Data supplied on the write trunk | | |