WikiPatents - Community Patent Review
Create Free Account  |  License or Sell Your Patent  |  WikiPatents Marketplace  |  WikiPatents Blog
Username:  Password:  
    
Advanced Search
Multiprocessor with each processor element accessing operands in loaded input buffer and forwarding results to FIFO output buffer    
United States Patent6339819   
Link to this pagehttp://www.wikipatents.com/6339819.html
Inventor(s)Huppenthal; Jon M. (Colorado Springs, CO); Leskar; Paul A. (Colorado Springs, CO)
AbstractAn enhanced memory algorithmic processor ("MAP") architecture for multiprocessor computer systems comprises an assembly that may comprise, for example, field programmable gate arrays ("FPGAs") functioning as the memory algorithmic processors. The MAP elements may further include an operand storage, intelligent address generation, on board function libraries, result storage and multiple input/output ("I/O") ports. The MAP elements are intended to augment, not necessarily replace, the high performance microprocessors in the system and, in a particular embodiment of the present invention, they may be connected through the memory subsystem of the computer system resulting in it being very tightly coupled to the system as well as being globally accessible from any processor in a multiprocessor computer system.



 Title Information Submit all comments and votes
 
Patent Text Patent PDF Print Page Summary File History
Plain text PDF images Print Summary File History
Drawing from US Patent 6339819
Multiprocessor with each processor element accessing operands in loaded

     input buffer and forwarding results to FIFO output buffer - US Patent 6339819 Drawing
Multiprocessor with each processor element accessing operands in loaded input buffer and forwarding results to FIFO output buffer
Inventor     Huppenthal; Jon M. (Colorado Springs, CO); Leskar; Paul A. (Colorado Springs, CO)
Owner/Assignee     SRC Computers, Inc. (Colorado Springs, CO)
Patent assignment
All assignments
Publication Date     January 15, 2002
Application Number     09/563,561
PAIR File History     Application Data   Transaction History
Image File Wrapper   Patent Term   Fees
Litigation
Filing Date     May 3, 2000
US Classification     712/16 326/39 326/41 712/37
Int'l Classification     G06F 015/16
Examiner     Kim; Kenneth S.
Assistant Examiner    
Attorney/Law Firm     Kubida; William J. Lembke; Kent A. , Hogan & Hartson LLP
Address
Parent Case     CROSS REFERENCE TO RELATED PATENT APPLICATIONS The present invention is a continuation-in-part application of U.S. patent application Ser. No. 09/481,902 filed Jan. 12, 2000, now U.S. Pat. No. 6,247,110, which is a continuation of U.S. patent application Ser. No. 08/992,763 filed Dec. 17, 1997, now U.S. Pat. No. 6,076,152, for: "Multiprocessor Computer Architecture Incorporating a Plurality of Memory Algorithm Processors in the Memory Subsystem", assigned to SRC Computers, Inc., Colorado Springs, Colo. assignee of the present invention, the disclosures of which are herein specifically incorporated by this reference.
Priority Data    
USPTO Field of Search     326/39 326/41 712/16 712/37
Patent Tags     multiprocessor each processor element accessing operands loaded input buffer forwarding results fifo output buffer
   
Enter a comma (,) or semicolon (;) between multiple tag words/phrases.
Describe this patent:
 Amusing   
 Clever   
 Complex   
 Efficient   
 Historic   
 Important   
 Innovative   
 Interesting   
 Practical   
 Simple   
[no votes]
Patent WIKI

Share information and news about this patent, including information and news about the technology, inventors, company, ligation and licensing.

 References Submit all comments and votes
 
*references marked with an asterisk below are user-added references
 U.S. References
 
Add a new US reference:  
ReferenceRelevancyCommentsReferenceRelevancyComments
6023755
Casselman
712/37
Feb,2000

[0 after 0 votes]
5892962
Cloutier

Apr,1999

[0 after 0 votes]
5737766
Tan

Apr,1998

[0 after 0 votes]
5570040
Lytle
326/41
Oct,1996

[0 after 0 votes]
 Foreign References
 Other References
 Market Review Submit all comments and votes
   
Market Size
Estimate the gross annual revenues of the relevant market sector:
> $10B
$5B - $10B
$2B - $5B
$500M - $2B
$100M - $500M
$10M - $100M
$1M - $10M
$500K - $1M
$100K - $500K
< $100K
[No votes]
$0
 
$0   $2.5B   $5B   $7.5B   $10B
Market Share
Estimate the percentage of the relevant market sector this invention will capture:
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Reasonable Royalty
What percentage of gross sales should the inventor or assignee be paid?
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Public's "Guesstimation" of Royalty Value
Market SizeN/A[No votes]
xMarket ShareN/A[No votes]
xReasonable RoyaltyN/A[No votes]

N/A

License Availablity
If you are NOT the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
License Availablity
If you ARE the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
Competitive Advantage
Does this invention have a significant competitive advantage over similar technologies?
Yes

No



[No votes]
Most helpful competitive advantage comment
[No comments]

Commercial Alternatives
Are there viable commercial alternatives for this invention?
Yes

No



[No votes]
Most helpful commercial alternative comment
[No comments]

 Technical Review Submit all comments and votes
 Claims Submit all comments and votes
 


What is claimed is:

1. A computer system including at least one data processor for operating on user data in accordance with program instructions, said computer further including at least one memory array coupled to said at least one data processor by read and write buses, said computer system comprising:

at least one memory algorithmic processor element including a user array for performing at least one algorithm on an operand and an address generator for accessing said at least one memory array;

an input data buffer associated with said at least one memory algorithmic processor element for coupling said write bus to said user array, wherein said operand is transferred to said at least one memory algorithmic processor element after loading of said input data buffer; and

an output FIFO associated with said at least one memory algorithmic processor element for coupling said user array to said read bus.

2. The computer system of claim 1 further comprising:

a chain port coupled to said user array for receiving input from an additional one of said at least one memory algorithmic processors in said computer system.

3. The computer system of claim 2 further comprising:

said chain port coupled to said output FIFO for providing output to another one of said memory algorithmic processors in said computer system.

4. The computer system of claim 1 wherein said user array comprises at least one FPGA.

5. The computer system of claim 4 wherein said user array comprises four FPGAs.

6. The computer system of claim 1 further comprising:

a control block coupled to said input data buffer, said user array and said output FIFO for operationally controlling a function of said at least one memory algorithmic processor element.

7. The computer system of claim 1, wherein during said loading operands are received out-of-order.

8. The computer system of claim 7, wherein said loading comprises cache line transfers of said received operands.

9. A computer system including at least one data processor for operating on user data in accordance with program instructions, said computer further including at least one memory array coupled to said at least one data processor by read and write buses, said computer system comprising:

at least one memory algorithmic processor element including a user array for performing at least one algorithm on an operand;

an input data buffer associated with said at least one memory algorithmic processor element for coupling said write bus to said user array;

an output FIFO associated with said at least one memory algorithmic processor element for coupling said user array to said read bus; and

address generator operative for supporting a data gather function from said input data buffer or said at least one memory array, wherein said address generator is configured to receive address bits from said at least one memory algorithmic processor element and combine said address bits with additional control signals prior to issuance to said input data buffer.

10. The computer system of claim 7, further comprising:

a chain port coupled to said user array for receiving input from an additional one of said at least one memory algorithmic processors in said computer system.

11. The computer system of claim 10 further comprising:

said chain port coupled to said output FIFO for providing output to another one of said memory algorithmic processors in said computer system.

12. The computer system of claim 9 wherein said user array comprises at least one FPGA.

13. The computer system of claim 12 wherein said user array comprises four FPGAs.

14. The computer system of claim 9 further comprising:

a control block coupled to said input data buffer, said user array and said output FIFO for operationally controlling a function of said at least one memory algorithmic processor element.

15. The computer system of claim 9, wherein said address generator is configured to receive a start command including a start address, a stop address, and a stride and based on said start command, to access said input data buffer beginning at said start address and continuing at increments of said stride until said stop address is reached.

16. A computer system including at least one data processor for operating on user data in accordance with program instructions, said computer further including at least one memory array coupled to said at least one data processor by read and write buses, said computer system comprising:

at least one memory algorithmic processor element including a user array for performing at least one algorithm on an operand and an address generator for accessing said at least one memory array;

an input data buffer associated with said at least one memory algorithmic processor element for coupling said write bus to said user array;

an output FIFO associated with said at least one memory algorithmic processor element for coupling said user array to said read bus; and

a memory device associated with said at least one memory algorithmic processor element for at least storing said algorithm.

17. The computer system of claim 15 further comprising:

a chain port coupled to said user array for receiving input from an additional one of said at least one memory algorithmic processors in said computer system.

18. The computer system of claim 17 further comprising:

said chain port coupled to said output FIFO for providing output to another one of said memory algorithmic processors in said computer system.

19. The computer system of claim 15 wherein said user array comprises at least one FPGA.

20. The computer system of claim 19 wherein said user array comprises four FPGAs.

21. The computer system of claim 15 further comprising:

a control block coupled to said input data buffer, said user array and said output FIFO for operationally controlling a function of said at least one memory algorithmic processor element.

22. The computer system of claim 15 wherein said memory device is selectively reprogrammable by said at least one memory algorithmic processor element for storing another algorithm.

23. A computer system including at least one data processor for operating on user data in accordance with program instructions, said computer further including at least one memory array coupled to said at least one data processor by read and write buses, said computer system comprising:

a plurality of memory algorithmic processor elements, each including a user array for performing at least one algorithm on an operand and an address generator for accessing said at least one memory array;

an input data buffer associated with each of said plurality of memory algorithmic processor elements for coupling said write bus to said user array;

an output FIFO associated with each of said plurality of memory algorithmic processor elements for coupling said user array to said read bus; and

a chain port coupled to each user array for receiving input from an adjacent one of said memory algorithmic processor elements and coupled to said output FIFO for passing output to another adjacent one of said memory algorithmic processor elements, wherein said output consists of operands.

24. The computer system of claim 23 wherein said user array comprises at least one FPGA.

25. The computer system of claim 24 wherein said user array comprises four FPGAs.

26. The computer system of claim 23 further comprising:

a control block coupled to said input data buffer, said user array and said output FIFO for operationally controlling a function of each of said plurality of memory algorithmic processor elements.

27. The computer system of claim 23 wherein said chain port allows at least a subset of said plurality of memory algorithmic processor elements to perform said at least one algorithm.

28. The computer system of claim 23 wherein said chain port allows for dynamic configuration of said plurality of memory algorithmic processor elements into at least two subsets thereof to separately perform differing ones of said at least one algorithms.

29. A computer system including at least one data processor for operating on user data in accordance with program instructions, said computer further including at least one memory array coupled to said at least one data processor by read and write buses, said computer system comprising:

at least one memory algorithmic processor element including a user array for performing at least one algorithm on an operand and an address generator for accessing said at least one memory array;

an input data buffer associated with said at least one memory algorithmic processor element for coupling said write bus to said user array; and

an output FIFO associated with said at least one memory algorithmic processor element for coupling said user array to said read bus, said output FIFO operative to store output data to allow said memory algorithmic processor element to continue operation if one of said at least one data processors intended to read said output data is currently unavailable to perform said read operation.

30. The computer system of claim 29 further comprising:

a chain port coupled to said user array for receiving input from an additional one of said at least one memory algorithmic processors in said computer system.

31. The computer system of claim 30 further comprising:

said chain port coupled to said output FIFO for providing said output data to another one of said memory algorithmic processors in said computer system.

32. A computer system including at least one data processor for operating on user data in accordance with program instructions, said computer further including at least one memory array coupled to said at least one data processor by read and write buses, said computer system comprising:

at least one memory algorithmic processor element including a user array for performing at least one algorithm on an operand and an address generator for accessing said at least one memory array;

an input data buffer associated with said at least one memory algorithmic processor element for coupling said write bus to said user array, said input data buffer operative to store one or more of said operands for access by an additional one of said at least one memory algorithmic processor elements; and

an output FIFO associated with said at least one memory algorithmic processor element for coupling said user array to said read bus.

33. The computer system of claim 32 further comprising:

a chain port coupled to said user array for receiving input from said additional one of said at least one memory algorithmic processors in said computer system.

34. The computer system of claim 33 further comprising:

said chain port coupled to said output FIFO for providing output to another one of said memory algorithmic processors in said computer system, wherein said input and said output passed through said chain port consists of operands.

35. A computer system including at least one data processor for operating on user data in accordance with program instructions, said computer further including at least one memory array coupled to said at least one data processor by read and write buses, said computer system comprising:

at least one memory algorithmic processor element including a user array for performing at least one algorithm on an operand and an address generator for accessing said at least one memory array;

an input data buffer associated with said at least one memory algorithmic processor element for coupling said write bus to said user array;

an output FIFO associated with said at least one memory algorithmic processor element for coupling said user array to said read bus;

a chain port coupled to said user array for receiving input from an additional one of said at least one memory algorithmic processors in said computer system, said chain port being also coupled to said FIFO for providing output to another one of said memory algorithmic processors in said computer system; and

a dedicated port coupled between said user array and said input data buffer to enable said user array to receive said operand from said chain port or said input data buffer, wherein said dedicated port enables said user array to receive said operand through said chain port while substantially concurrently accessing data stored in said input data buffer.

36. A computer system including a plurality of data processors for operating on user data in accordance with program instructions, said computer further including at least one memory array being selectively coupled through a switching element to said plurality of data processors by read and write buses, said computer system comprising:

at least one memory algorithmic processor element including a user array for performing at least one algorithm on an operand and an address generator for accessing said at least one memory array, said at least one memory algorithmic processor element being globally accessible by each of said plurality of data processors in said computer system;

an input data buffer associated with said at least one memory algorithmic processor element for coupling said write bus to said user array; and

an output FIFO associated with said at least one memory algorithmic processor element for coupling said user array to said read bus.

37. The computer system of claim 36 further comprising:

a chain port coupled to said user array for receiving input from an additional one of said at least one memory algorithmic processors in said computer system.

38. The computer system of claim 37 further comprising:

said chain port coupled to said output FIFO for providing output to another one of said memory algorithmic processors in said computer system.

39. A computer system including at least one data processor for operating on user data in accordance with program instructions, said computer further including at least one memory array coupled to said at least one data processor by read and write buses, said computer system comprising:

at least one memory algorithmic processor element including a user array for performing at least one algorithm on an operand and an address generator for accessing said at least one memory array, said memory algorithmic processor element further comprising a configuration memory for storing information controlling an operation of said memory algorithmic processor element, wherein said configuration memory is capable of being updated during operation of said computer system to change said operation of said memory algorithmic processor element;

an input data buffer associated with said at least one memory algorithmic processor element for coupling said write bus to said user array; and

an output FIFO associated with said at least one memory algorithmic processor element for coupling said user array to said read bus.

40. The computer system of claim 39 further comprising:

a chain port coupled to said user array for receiving input from an additional one of said at least one memory algorithmic processors in said computer system.

41. The computer system of claim 40 further comprising:

said chain port coupled to said output FIFO for providing output to another one of said memory algorithmic processors in said computer system.

42. A computer system including at least one data processor for operating on user data in accordance with program instructions, said computer further including at least one memory array coupled to said at least one data processor by read and write buses, said computer system comprising:

at least one memory algorithmic processor element including a user array for performing at least one algorithm on an operand and an address generator for accessing said at least one memory array, said at least one memory algorithmic processor being coupled to said at least one memory array to provide direct memory access to commands issued by said at least one data processor;

an input data buffer associated with said at least one memory algorithmic processor element for coupling said write bus to said user array; and

an output FIFO associated with said at least one memory algorithmic processor element for coupling said user array to said read bus.

43. The computer system of claim 42 further comprising:

a chain port coupled to said user array for receiving input from an additional one of said at least one memory algorithmic processors in said computer system.

44. The computer system of claim 43 further comprising:

said chain port coupled to said output FIFO for providing output to another one of said memory algorithmic processors in said computer system.

45. The computer system of claim 42 wherein said user array comprises at least one FPGA.

46. The computer system of claim 45 wherein said user array comprises four FPGAs.

47. The computer system of claim 42 further comprising:

a control block coupled to said input data buffer, said user array and said FIFO for operationally controlling a function of said at least one memory algorithmic processor element.
 Description Submit all comments and votes
 


BACKGROUND OF THE INVENTION

The present invention relates, in general, to the field of computer architectures incorporating multiple processing elements. More particularly, the present invention relates to a multiprocessor computer architecture incorporating a number of memory algorithmic processors ("MAP") in the memory subsystem or closely coupled to the processing elements to significantly enhance overall system processing speed.

As commodity microprocessors increase in capability there is an ever increasing push to use them in high performance multiprocessor systems capable of performing trillions of calculations per second at significantly lower cost than those made from custom counterparts. However, many of these processors lack specific features common to systems in this category that employ much more expensive custom processors. One such feature is the ability to perform vector processing.

In this form of processing, a data register or buffer is filled with operands forming what is called a vector. All of these operands are then passed one after the other through a functional unit capable of performing operations such as multiplication. This functional unit will output one result every clock cycle. This type of processing does require that the same operation be performed on all operands in the input vector and it is, therefore, widely used in that it exhibits much higher processing rates than the traditional scalar method of computation used in most microprocessors.

Nevertheless, neither vector nor scalar processors perform very well when required to perform bit manipulation as is required, for example, in matrix arithmetic. One such function is a bit matrix multiply operation in which two matrices of different sizes are multiplied together to form a third matrix. Another shortfall of both vector and scalar processing is their inability to quickly perform pattern searches such as those used in a variety of pattern recognition programs.

A solution to all of these deficiencies can be found by building a high performance computer which contains numbers of commodity microprocessors to reduce the system cost together with MAP elements developed by SRC Computers, Inc., assignee of the present invention, to provide the deficient functions at very low cost. The MAP architecture and specific features thereof is disclosed in the aforementioned patent applications, the disclosures of which are herein specifically incorporated by this reference.

SUMMARY OF THE INVENTION

The enhanced memory algorithmic processor architecture for multiprocessor computer systems of the present invention is an assembly that not only contains, for example, field programmable gate arrays functioning as the memory algorithmic processors, but also an operand storage, intelligent address generation, on board function libraries, result storage and multiple I/O ports. Like the original MAP architecture disclosed in the aforementioned patent applications, this architecture differs from other so called "reconfigurable" computers in many ways.

First, its function is intended to be altered every few seconds distinguishing itself from other systems with very long reconfiguration times primarily intended for a single function. Secondly, it contains dedicated hardware to provide for large data set operand storage (on the order of 16 Mbytes or more) allowing the MAP element to function autonomously from its host system once operands are loaded. Thirdly, it contains dedicated data ports to allow, but not require, multiple MAP elements to be chained together to perform very large operations. As currently contemplated, it is intended that typically 32 to 512 or more MAP sections can be connected in a single system.

Further, the MAP element is intended to augment, not replace, the high performance microprocessors in the system. As such, in a particular embodiment of the present invention, it may be connected through the memory subsystem of the computer system resulting in it being very tightly coupled to the system as well as being globally accessible from any processor in the system. This technique was developed by SRC Computers, Inc. and distinguishes the MAP architecture from all other so called "attached array processor" systems that may exist today. While such "attached array processor" systems may bear some superficial similarities to MAP based systems, they are entirely separate units connected to the host computer through relatively slow interconnects resulting in lost system performance.

The MAP architecture developed by SRC Computers, Inc. as defined in the aforementioned patent applications overcomes many of the limitations of such "attached array processor" systems. Because of the particular limitations in the exemplary architecture disclosed therein surrounding the attachment of input storage and chaining capabilities, certain vector processing functions may not have been optimally implemented unlike relatively smaller algorithms.

Through the addition of these and other features to the MAP architecture, a much more powerful multiprocessor computer system is provided. Moreover, while, as originally disclosed, another feature of the MAP architecture was its ability to perform direct memory access ("DMA") into the common the memory of the system, enhancements disclosed herein have expanded the potential utilization of this feature.

Particularly disclosed herein is a Memory Algorithmic Processor ("MAP") assembly (or element) comprising reconfigurable field programmable gate array ("FPGA") circuitry, an intelligent address generator, input data buffers, output first-in, first-out ("FIFO") devices and ports to allow connection to a memory array and chaining of multiple MAP assemblies for the purpose of augmenting the capability of a microprocessor in a high performance computer.

Further disclosed herein is a MAP assembly comprising an intelligent address generator capable of supporting a data gather function from its associated input buffer or common memory. The MAP assembly may also comprise circuitry to allow the reconfigurable elements to reprogram their on-board configuration read only memory ("ROM") devices to cause alterations in the functionality of the reconfigurable circuitry.

Still further disclosed herein is a MAP assembly comprising dedicated input and output ports for the purpose of allowing an infinite number of MAP elements to be chained together to accomplish a single function. The MAP assembly may also incorporate provisions to create a single MAP chain or multiple independent MAP chains automatically based on the contents of the reconfigurable circuitry.

Further disclosed herein is a MAP assembly comprising output FIFOs for the purpose of holding output data and allowing the MAP element to not stall in the event the processor reading these results is delayed due to outside factors such as workload or crossbar switch conflicts. The MAP assembly may further comprise relatively large dedicated input storage buffers to allow for optimization of operand transfer as well as allow multiple accesses to an operand without requiring external processor intervention.

Still further disclosed herein is a MAP assembly comprising a dedicated port for connection to an input buffer so that the MAP element can simultaneously receive operands via the chained input (chain) port and the input buffer. This allows the MAP element to perform mathematical processing at the maximum possible rate while also allowing the MAP element to accept operands via the chain port while accessing reference data in the input buffer (such as reciprocal look up tables) to allow the MAP element to perform operations such as division at the fastest possible rate.

Also further disclosed herein is a MAP assembly which may comprise connections to the memory subsystem of a high performance computer for the purpose of providing global access to it from all processors in a multiprocessor high performance computer system. The MAP assembly incorporates the capability to update multiple on board function ROMs under program control while in the system and may also include connections to the memory subsystem of a high performance computer utilizing DMA to accept commands from a microprocessor.

BRIEF DESCRIPTION OF THE DRAWINGS

The aforementioned and other features and objects of the present invention and the manner of attaining them will become more apparent and the invention itself will be best understood by reference to the following description of a preferred embodiment taken in conjunction with the accompanying drawings, wherein:

FIG. 1 is a simplified, high level, functional block diagram of a multiprocessor computer architecture employing memory algorithmic processors ("MAP") in accordance with the disclosure of the aforementioned patent applications in an alternative embodiment wherein direct memory access ("DMA") techniques may be utilized to send commands to the MAP elements in addition to data;

FIG. 2 is a simplified logical block diagram of a possible computer application program decomposition sequence for use in conjunction with a multiprocessor computer architecture utilizing a number of MAP elements located, for example, in the computer system memory space, in accordance with a particular embodiment of the present invention;

FIG. 3 is a more detailed functional block diagram of an exemplary individual one of the MAP elements of the preceding figures and illustrating the bank control logic, memory array and MAP assembly thereof;

FIG. 4 is a more detailed functional block diagram of the control block of the MAP assembly of the preceding illustration illustrating its interconnection to the user FPGA thereof in a particular embodiment;

FIG. 5 is a functional block diagram of an alternative embodiment of the present invention wherein individual MAP elements are closely associated with individual processor boards and each of the MAP elements comprises independent chain ports for coupling the MAP elements directly to each other;

FIG. 6 is a functional block diagram of an individual MAP element wherein each comprises on board memory and a control block providing common memory DMA capabilities;

FIG. 7 is an additional functional block diagram of an individual MAP element illustrating the on board memory function as an input buffer and output FIFO portions thereof;

FIG. 8 is a more detailed functional block diagram of an individual MAP element as illustrated in FIGS. 6 and 7;

FIG. 9 is a user array interconnect diagram illustrating, for example, four user FPGAs interconnected through horizontal, vertical and diagonal buses to allow for expansion in designs that exceed the capacity of a single FPGA;

FIG. 10 is a functional block diagram of another alternative embodiment of the present invention wherein individual MAP elements are closely associated with individual memory arrays and each of the MAP elements comprises independent chain ports for coupling the MAP elements directly to each other; and

FIGS. 11A and 11B are timing diagrams respectively input and output timing in relationship to the system clock ("Sysclk") signal.

DESCRIPTION OF A PREFERRED EMBODIMENT

With reference now to FIG. 1, a multiprocessor computer 10 architecture in accordance with one embodiment of the present invention is shown. The multiprocessor computer 10 incorporates N processors 120.sub.0 through 12.sub.N which are bi-directionally coupled to a memory interconnect fabric 14. The memory interconnect fabric 14 is then also coupled to M memory banks comprising memory bank subsystems 16.sub.0 (Bank 0) through 16M (Bank M). N number of memory algorithmic processors ("MAP") 112.sub.0 through 112.sub.N are also coupled to the memory interconnect fabric 14 as will be more fully described hereinafter.

With reference now to FIG. 2, a representative application program decomposition for a multiprocessor computer architecture 100 incorporating a plurality of memory algorithm processors in accordance with the present invention is shown. The computer architecture 100 is operative in response to user instructions and data which, in a coarse grained portion of the decomposition, are selectively directed to one of (for purposes of example only) four parallel regions 102.sub.1 through 102.sub.4 inclusive. The instructions and data output from each of the parallel regions 102.sub.1 through 102.sub.4 are respectively input to parallel regions segregated into data areas 104.sub.1 through 104.sub.4 and instruction areas 106.sub.1 through 106.sub.4. Data maintained in the data areas 104.sub.1 through 104.sub.4 and instructions maintained in the instruction areas 106.sub.1 through 106.sub.4 are then supplied to, for example, corresponding pairs of processors 108.sub.1, 108.sub.2 (P1 and P2); 108.sub.3, 108.sub.4 (P3 and P4); 108.sub.5, 108.sub.6 (P5 and P6); and 108.sub.7, 108.sub.8 (P7 and P8) as shown. At this point, the medium grained decomposition of the instructions and data has been accomplished.

A fine grained decomposition, or parallelism, is effectuated by a further algorithmic decomposition wherein the output of each of the processors 108.sub.1 through 108.sub.8, is broken up, for example, into a number of fundamental algorithms 1101.sub.1A, 110.sub.1B, 110.sub.2A, 110.sub.2B through 110.sub.8B as shown. Each of the algorithms is then supplied to a corresponding one of the MAP elements 112.sub.1A, 112.sub.1B, 112.sub.2A, 112.sub.2B, through 112.sub.8B which may be located in the memory space of the computer architecture 100 for execution therein as will be more fully described hereinafter.

With reference additionally now to FIG. 3, an exemplary implementation of a memory bank 120 in a MAP system computer architecture 100 of the present invention is shown for a representative one of the MAP elements 112 illustrated in the preceding figure. Each memory bank 120 includes a bank control logic block 122 bi-directionally coupled to the computer system trunk lines, for example, a 72 line bus 124. The bank control logic block 122 is coupled to a bi-directional data bus 126 (for example 256 lines) and supplies addresses on an address bus 128 (for example 17 lines) for accessing data at specified locations within a memory array 130.

The data bus 126 and address bus 128 are also coupled to a MAP element 112. The MAP element 112 comprises a control block 132 coupled to the address bus 128. The control block 132 is also bi-directionally coupled to a user field programmable gate array ("FPGA") 134 by means of a number of signal lines 136. The user FPGA 134 is coupled directly to the data bus 126. In a particular embodiment, the FPGA 134 may be provided as a Lucent Technologies OR3T80 device.

The computer architecture 100 comprises a multiprocessor system employing uniform memory access across common shared memory with one or more MAP elements 112 which may be located in the memory subsystem, or memory space. As previously described, each MAP element 112 contains at least one relatively large FPGA 134 that is used as a reconfigurable functional unit. In addition, a control block 132 and a preprogrammed or dynamically programmable configuration ROM (as will be more fully described hereinafter) contains the information needed by the reconfigurable MAP element 112 to enable it to perform a specific algorithm. It is also possible for the user to directly download a new configuration into the FPGA 134 under program control, although in some instances this may consume a number of memory accesses and might result in an overall decrease in system performance if the algorithm was short-lived.

FPGAs have particular advantages in the application shown for several reasons. First, commercially available FPGAs now contain sufficient internal logic cells to perform meaningful computational functions. Secondly, they can operate at speeds comparable to microprocessors, which eliminates the need for speed matching buffers. Still further, the internal programmable routing resources of FPGAs are now extensive enough that meaningful algorithms can now be programmed without the need to reassign the locations of the input/output ("1/0") pins.

By, for example, placing the MAP element 112 in the memory subsystem or memory space, it can be readily accessed through the use of memory read and write commands, which allows the use of a variety of standard operating systems. In contrast, other conventional implementations may propose placement of any reconfigurable logic in or near the processor, however these conventional implementations are generally much less effective in a multiprocessor environment because, unlike the system and method of the present invention, only one processor has rapid access to it. Consequently, reconfigurable logic must be placed by every processor in a multiprocessor system, which increases the overall system cost. In addition, MAP element 112 can access the memory array 130 itself, referred to as Direct Memory Access ("DMA"), allowing it to execute tasks independently and asynchronously of the processor. In comparison, were it placed near the processor, it would have to compete with the processors for system routing resources in order to access memory, which deleteriously impacts processor performance. Because MAP element 112 has DMA capability, (allowing it to write to memory), and because it receives its operands via writes to memory, it is possible to allow a MAP element 112 to feed results to another MAP element 112. This is a very powerful feature that allows for very extensive pipelining and parallelizing of large tasks, which permits them to complete faster.

Many of the algorithms that may be implemented will receive an operand and require many clock cycles to produce a result. One such example may be a multiplication that takes 64 clock cycles. This same multiplication may also need to be performed on thousands of operands. In this situation, the incoming operands would be presented sequentially so that while the first operand requires 64 clock cycles to produce results at the output, the second operand, arriving one clock cycle later at the input, will show results one clock cycle later at the output. Thus, after an initial delay of 64 clock cycles, new output data will appear on every consecutive clock cycle until the results of the last operand appears. This is called "pipelining".

In a multiprocessor system, it is quite common for the operating system to stop a processor in the middle of a task, reassign it to a higher priority task, and then return it, or another, to complete the initial task. When this is combined with a pipelined algorithm, a problem arises (if the processor stops issuing operands in the middle of a list and stops accepting results) with respect to operands already issued but not yet through the pipeline. To handle this issue, a solution involving the combination of software and hardware is disclosed herein.

To make use of any type of conventional reconfigurable hardware, the programmer could embed the necessary commands in his application program code. The drawback to this approach is that a program would then have to be tailored to be specific to the MAP hardware. The system of the present invention eliminates this problem. Multiprocessor computers often use software called parallelizers. The purpose of this software is to analyze the user's application code and determine how best to split it up among the processors. The present invention provides significant advantages over a conventional parallelizer and enables it to recognize portions of the user code that represent algorithms that exist in MAP elements 112 for that system and to then treat the MAP element 112 as another computing element. The parallelizer then automatically generates the necessary code to utilize the MAP element 112. This allows the user to write the algorithm directly in his code, allowing it to be more portable and reducing the knowledge of the system hardware that he has to have to utilize the MAP element 112.

With reference additionally now to FIG. 4, a block diagram of the MAP control block 132 is shown in greater detail. The control block 132 is coupled to receive a number of command bits (for example, 17) from the address bus 128 at a command decoder 150. The command decoder 150 then supplies a number of register control bits to a group of status registers iS2 on an eight bit bus 154. The command decoder 150 also supplies a single bit last operand flag on line 156 to a pipeline counter 158. The pipeline counter 158 supplies an eight bit output to an equality comparitor 160 on bus 162. The equality comparitor 160 also receives an eight bit signal from the FPGA 134 on bus 136 indicative of the pipeline depth. When the equality comparitor 160 determines that the pipeline is empty, it provides a single bit pipeline empty flag on line 164 for input to the status registers 152. The status registers 152 are also coupled to receive an eight bit status signal from the FPGA 134 on bus 136 and it produces a sixty four bit status word output on bus 166 in response to the signals on bus 136, 154 and line 164.

The command decoder 150 also supplies a five bit control signal on line 168 to a configuration multiplexer ("MUX") 170 as shown. The configuration MUX 170 receives a single bit output of a 256 bit parallel-serial converter 172 on line 176. The inputs of the 256 bit parallel-to-serial converter 172 are coupled to a 256 bit user configuration pattern bus 174. The configuration MUX 170 also receives sixteen single bit inputs from the configuration ROMs (illustrated as ROM 182) on bus 178 and provides a single bit configuration file signal on line 180 to the user FPGA 134 as selected by the control signals from the command decoder 150 on the bus 168.

In operation, when a processor 108 is halted by the operating system, the operating system will issue a last operand command to the MAP element 112 through the use of command bits embedded in the address field on bus 128. This command is recognized by the command decoder 150 of the control block 132 and it initiates a hardware pipeline counter 158. When the algorithm was initially loaded into the FPGA 134, several output bits connected to the control block 132 were configured to display a binary representation of the number of clock cycles required to get through its pipeline (i.e. pipeline "depth") on bus 136 input to the equality comparitor 160. After receiving the last operand command, the pipeline counter 158 in the control block 132 counts clock cycles until its count equals the pipeline depth for that particular, algorithm. At that point, the equality comparitor 160 in the control block 132 de-asserts a busy bit on line 164 in an internal group of status registers 152. After issuing the last operand signal, the processor 108 will repeatedly read the status registers 152 and accept any output data on bus 166. When the busy flag is de-asserted, the task can be stopped and the MAP element 112 utilized for a different task. It should be noted that it is also possible to leave the MAP element 112 configured, transfer the program to a different processor 108 and restart the task where it left off.

In order to evaluate the effectiveness of the use of the MAP element 112 in a given application, some form of feedback to the use is required. Therefore, the MAP element 112 may be equipped with internal registers in the control block 132 that allow it to monitor efficiency related factors such as the number of input operands versus output data, the number of idle cycles over time and the number of system monitor interrupts received over time. One of the advantages that the MAP element 112 has is that because of its reconfigurable nature, the actual function and type of function that are monitored can also change as the algorithm changes. This provides the user with an almost infinite number of possible monitored factors without having to monitor all factors all of the time.

With reference additionally now to FIG. 5, a functional block diagram of a portion of an alternative embodiment of a computer system 20 in accordance with the of the present invention is shown. In the computer system 20 illustrated, individual MAP elements 112.sub.A, 112.sub.B etc. are each closely associated with individual processor boards 22.sub.A, 22.sub.B respectively. As depicted, each of the MAP elements 112 comprises independent chain ports 24 for coupling the MAP elements 112 directly to each other.

Individual ones of the MAP elements 112 are coupled between the processor board 22 write trunk 26 and read trunk 28 of each processor board 22 in addition to their coupling to each other by means of the chain ports 24. A switch couples the write trunk 26 and read trunk 28 of any given processor board to any other memory subsystem bank 16.sub.A, 16.sub.B etc. As generally illustrated, each of the memory subsystem banks 16 includes a control block 122 and one or more memory arrays 130.

With reference additionally now to FIG. 6, a functional block diagram of an individual MAP element 112 is shown wherein each MAP element 112 comprises an on board memory 40 and a control block 46 providing common memory DMA capabilities. Briefly, the write trunk 26 and read trunk 28 are coupled to the control block 46 from the common memory switch which provides addresses to the memory 40 and receives addresses from the user array 42 on address lines 48. Data supplied on the write trunk