|
Claims  |
|
|
We claim:
1. A programmable integrated circuit comprising:
logic units which perform operations on data in response to instructions of
a defined set of instructions;
memories which store and retrieve data in response to received addresses;
a configurable interconnect which provides signal transmission between the
logic units and memories, the interconnect being configurable from
configuration control data to define data paths, originating at logic
units and/or memories, through the interconnect as address inputs to
memories, data inputs to memories and logic units, and instruction inputs
to logic units such that the interconnect is configurable to define an
interdependent functionality of the memories and logic units; and
programmable configuration storage for storing the configuration control
data.
2. An integrated circuit as claimed in claim 1 wherein at least one memory
provides at least part of the instructions to a logic unit.
3. An integrated circuit as claimed in claim 1 wherein the logic units are
configurable to receive data directly from associated memories.
4. An integrated circuit as claimed in claim 1 wherein the configuration
storage stores multiple contexts of configuration control data for
reconfiguration of the programmable interconnect.
5. An integrated circuit as claimed in claim 4 wherein a context selection
signal that selects among the multiple contexts is routed by the
configurable interconnect.
6. An integrated circuit as claimed in claim 5 wherein a global context
selection signal that selects among the multiple contexts is globally
broadcast to the programmable configuration storage of the device.
7. An integrated circuit as claimed in claim 1 wherein the interconnect is
configurable to provide a static value to a logic unit or to provide a
variable value from a static source.
8. An integrated circuit as claimed in claim 1 wherein the memories are
deployable as data memory, register files, program counters, and
instruction stores for other logic units.
9. An integrated circuit as claimed in claim 1 wherein the interconnect
comprises network drivers which transmit received signals between the
memories and logic units.
10. A programmable integrated circuit comprising:
logic units which perform operations on data in response to instructions of
a defined set of instructions;
memories which store and retrieve data in response to received addresses;
a configurable interconnect which provides signal transmission between the
logic units and memories, the interconnect being configurable from
configuration control data to define data paths through the interconnect
as address inputs to memories, data inputs to memories and logic units,
and instruction inputs to logic units such that the interconnect is
configurable to define an interdependent functionality of the memories and
logic units; and
programmable configuration storage for storing the configuration control
data;
wherein a global context selection signal that selects among multiple
contexts is globally broadcast to the programmable configuration storage
of the device.
11. A programmable integrated circuit comprising:
logic units which perform operations on data in response to instructions of
a defined set of instructions and which are configurable to be chained
together to form wider data paths than provided by a single logic unit;
memories which store and retrieve data in response to received addresses;
a configurable interconnect which provides signal transmission between the
logic units and memories, the interconnect being configurable from
configuration control data to define data paths through the interconnect
as address inputs to memories, data inputs to memories and logic units,
and instruction inputs to logic units such that the interconnect is
configurable to define an interdependent functionality of the memories and
logic units; and
programmable configuration storage for storing the configuration control
data.
12. A programmable integrated circuit comprising:
logic units which perform operations on data in response to instructions of
a defined set of instructions;
memories which store and retrieve data in response to received addresses;
a configurable interconnect which provides signal transmission between the
logic units and memories, the interconnect being configurable from
configuration control data to define data paths through the interconnect
as address inputs to memories, data inputs to memories and logic units,
and instruction inputs to logic units such that the interconnect is
configurable to define an interdependent functionality of the memories and
logic units, the interconnect being configurable to provide a value to a
logic unit or memory from a source, which is determined by a value from
another logic unit or memory; and
programmable configuration storage for storing the configuration control
data.
13. A programmable integrated circuit comprising:
logic units which perform operations on data in response to instructions of
a defined set of instructions;
memories which store and retrieve data in response to received addresses,
the logic units being each grouped with memories to form repeating
functional units;
a configurable interconnect which provides signal transmission between the
logic units and memories, the interconnect being configurable from
configuration control data to define data paths through the interconnect
as address inputs to memories, data inputs to memories and logic units,
and instruction inputs to logic units such that the interconnect is
configurable to define an interdependent functionality of the memories and
logic units; and
programmable configuration storage for storing the configuration control
data.
14. An integrated circuit as claimed in claim 13 further comprising
programmable logic arrays on data paths between functional units which
perform bit level logic operations.
15. An integrated circuit as claimed in claim 13 further comprising
reduction logic which performs logic operations on the output of the logic
units and passes a result to other functional units as control
information.
16. An integrated reconfigurable computing device, comprising:
an array of functional units comprising:
multibit arithmetic logic units which perform operations on data in
response to instructions;
memories which store and retrieve data in response to received addresses;
function switches which determine the source of the instructions to the
logic units; and
address/data switches which are configurable by the other functional units
and determine the source of addresses to the memories and the source of
data to the logic units and memories.
17. An integrated circuit as claimed in claim 16 wherein the logic units
are configurable to both operate on data from the associated memories and
operate on data received from outside the functional unit via the
address/data switches.
18. An integrated circuit as claimed in claim 16 wherein the function
switches also determine the configuration of the memories.
19. An integrated circuit as claimed in claim 16 wherein the address/data
switches are configurable to provide static values, values from other
functional units, and values from sources, which are determined by other
functional units.
20. An integrated circuit as claimed in claim 16 wherein in the output from
the arithmetic logic units are distributed over a local network to
near-neighbor functional units.
21. An integrated circuit as claimed in claim 16 wherein the functional
units comprise network drivers which transmit received signals to other
functional units.
22. An integrated circuit as claimed in claim 21 wherein sources of the
received signals to the network drivers are programmable by other
functional units.
23. A method for organizing signal transmission within an array of logic
units which perform operations on data in response to instructions and
memories which store and retrieve data in response to received addresses,
the method comprising:
transmitting data read from the memories as instructions or data to the
logic units or addresses or data to other memories; and
transmitting data generated by the logic units as instructions or data to
other logic units or addresses or data to the memories.
24. A method as claimed in claim 23 further comprising transmitting data
from logic units or memories as control to other logic units or memories.
25. A method as claimed in claim 23 further comprising reorganizing the
paths of the data and instructions in response to control from the array.
26. A method as claimed in claim 23 further comprising transmitting static
values, values from other memories or logic units, and values from sources
determined by the memories or logic units.
27. A method as claimed in claim 23 further comprising performing bit level
logic operations to control data paths between memories or logic units.
28. A method as claimed in claim 23 further comprising selecting among
multiple contexts using a globally broadcast context selection signal to
the programmable configuration storage of the device.
29. A method as claimed in claim 23 further comprising configuring the
logic units to be chained together to form wider data paths than provided
by a single logic unit.
30. A method as claimed in claim 23 further comprising providing a value to
a logic unit or memory from a source, which is determined by a value from
another logic unit or memory.
31. A method as claimed in claim 23 further comprising grouping each of the
logic units with memories to form repeating functional units. |
|
|
|
|
Claims  |
|
|
Description  |
|
|
BACKGROUND OD THE INVENTION
Continuing advances in semiconductor technology have greatly increased the
amount of processing that can be performed by single-chip, general-purpose
computing devices. The relatively slow increase in inter-chip
communication bandwidth requires that modern high performance devices use
as much of the potential on-chip processing power as possible. This
results in large, dense integrated circuit devices and a large design
space of processing architectures.
One way of viewing this design space is in terms of granularity. Designers
have the option of building very large processing units, or many smaller
ones, in the same space. Traditional architectures are either very coarse
grain, such as microprocessors, or very fine grain, such as field
programmable gate arrays (FPGAs). Both architectures have advantages and
disadvantages.
Microprocessors incorporate very few large processing units that operate on
wide data-words, and each unit is hardwired to perform defined
instructions on these data-words. Usually each unit is optimized for a
different set of instructions, such as integer and floating point, and the
units are generally hardwired to operate in parallel. The hardwired nature
of these units allows very rapid instructions. In fact, a great deal of
area on modern microprocessor chips is dedicated to cache memories in
order to support a very high rate of instruction issue. Thus, the devices
efficiently handle very dynamic instruction streams.
Very fine grain devices, such as FPGAs, incorporate a large number of very
small processing elements. These elements are arranged in a configurable
interconnect network. The configuration data used to define the
functionality of the processing units and network can be thought of as a
very large, semantically powerful, instruction word. Nearly any operation
can be described and mapped to hardware.
SUMMARY OF THE INVENTION
Unfortunately, because microprocessors are highly optimized for simple,
wide-word, dynamic instructions, they are relatively inefficient when
performing other kinds of operations. For example, many cycles are
required to build up complex operations that are not part of the
processor's pre-selected instruction set. Also, when performing short-word
operations, much of the processing unit is not being used, and when the
instructions being issued are very regular, the large instruction caches
are unnecessary. Thus, very coarse-grain microprocessors are not equipped
to take the maximum advantage of these cases.
The size of the "instruction word" creates a number of problems with
fine-grain FPGA devices, however. Reloading new instructions takes a
relatively long time, making dynamic instruction streams very difficult
for these devices. Moreover, if the operation being performed is, in fact,
a wide word operation, a great deal of this "instruction word" must be
dedicated to re-describing the operation for each of the small processing
elements. Thus, fine grain processing elements are not well equipped to
take advantage of a large number of common computing operations.
The present invention utilizes a large number of intermediate-grain
processing elements which are arranged in a configurable mesh. Thus, the
regularity and rapid instruction issue features of coarse-grain units are
exploited, but a reconfigurable or programmable interconnect allows these
units to be connected in an application-specific manner. This means that
coarse-grain resources, such as memory and processing, can be deployed in
a way that takes advantage of the opportunities for optimization present
in any given problem. In addition, configuration memories may be deployed
to take advantage of application specific redundancy.
In general according to one aspect, the invention features a programmable
integrated circuit that comprises a logic units that perform operations on
data in response to instructions and memories that store and retrieve
addressed data. A configurable or programmable interconnect provides a
mode of signal transmission between the logic units and memories.
Configuration control data defines data paths through the interconnect,
which can be address inputs to memories, data inputs to memories and logic
units, and instruction inputs to logic units. Thus, the interconnect is
configurable to define an interdependent functionality of the functional
units. A programmable configuration storage stores the configuration
control data.
Thus the present invention may be configured to operate according to a
number of traditionally distinct computing architectures. For example, a
centrally located functional unit may be assigned the role of arithmetic
logic unit (ALU) with memories of surrounding functional units being
configured to act as instruction caches, register files, and program
counters. Wider data paths are accommodated by tying near-neighbor ALUs to
each other. Wider instructions are achieved by configuring instruction
memories of separate functional units as if they were a single memory. For
a different problem, the same integrated circuit may be reconfigured to
emulate a single instruction multiple data (SIMD) architecture. The logic
units of rows of functional units are tied together to create wider data
paths, and the rows perform separate serial tasks.
In specific embodiments, functional units may provide at least part of the
instructions to logic units of other functional units. Also, the
configuration storage may hold multiple contexts of configuration control
data for reconfiguration of the programmable interconnect.
In other embodiments, the interconnect may support three different modes of
operation: a static value in which a value set by the configuration data
is provided to a functional unit or static source in which another
functional unit serves as the value source. A dynamic source mode can be
included in which the source is determined by the value from another
functional unit.
In still other embodiments, each logic unit can also have programmable
logic arrays on data paths between functional units which perform bit
level logic operations. Additionally, reduction logic can be added that
performs logic operations on the output of the logic units and passes a
result to other functional units as control information. Network drivers
are assigned to each unit to transmit received signals to other functional
units. The sources of the signals received by the drivers may also be
dynamic so that the sources are programmable by other functional units.
In general according to another aspect, the invention features an
integrated reconfigurable computing device, which has functional units of
multi-bit arithmetic logic units and memories. A configurable interconnect
that connects the units includes function ports which determine the source
of the instructions to the logic units. Network ports of the units are
configurable by the functional units and determine the source of addresses
to the memories and the source of data to the logic units and memories.
In general according to still another aspect, the invention can also be
characterized in the context of a method for organizing signal
transmission within an array of functional units. Data read from the
memories of functional units may be transmitted as instructions to the
logic units of other functional units. Also, data read from logic units
may be transmitted as addresses for the memories of other functional
units. Finally, the data read from functional units can also be used as
data inputs for the logic units of other functional units.
In specific embodiments, the paths of the data and instructions are dynamic
in response to control from the functional units. More specifically,
static values, values from other functional units, and values from sources
may be transmitted between functional units.
The above and other features of the invention including various novel
details of construction and combinations of parts, and other advantages,
will now be more particularly described with reference to the accompanying
drawings and pointed out in the claims. It will be understood that the
particular method and device embodying the invention are shown by way of
illustration and not as a limitation of the invention. The principles and
features of this invention may be employed in various and numerous
embodiments without departing from the scope of the invention.
BRIEF DESCRIPTION OF THE DRAWINGS
In the accompanying drawings, reference characters refer to the same parts
throughout the different views. The drawings are not necessarily to scale;
emphasis has instead been placed upon illustrating the principles of the
invention. Of the drawings:
FIG. 1 shows a programmable integrated processing device of the present
invention, which has been configured as an 8-bit microprocessor;
FIG. 2 shows a SIMD processor configuration for the processing device
according to the invention;
FIG. 3 shows a 32-bit processor configuration for the processing device
according to the invention;
FIG. 4 shows a very long instruction word (VLIW) processor configuration
for the processing device according to the invention;
FIG. 5 shows multiple instruction multiple data (MIMD) processor
configuration for the processing device according to the invention;
FIG. 6 is a block diagram showing the architecture of a basic functional
unit (BFU) core of the present invention;
FIG. 7 is a block diagram showing the inter-BFU connectivity provided by
the level-1 network connections;
FIG. 8 is a block diagram showing the BFU interconnection provided by the
level-2 network connections;
FIG. 9 is a block diagram showing the network switch architecture for a BFU
of the present invention;
FIG. 10 is a block diagram illustrating the function switch architecture of
the present invention;
FIG. 11 is a block diagram showing the address/data and network switch
architecture of the present invention;
FIG. 12 is a block diagram illustrating the floating port architecture of
the present invention;
FIG. 13 is a block diagram showing the level-1 network drivers of the
present invention;
FIG. 14 shows the level-2 drivers of the present invention;
FIG. 15 shows the level-3 drivers of the present invention;
FIG. 16 shows BFU input registers of the present invention;
FIG. 17 shows the reduction logic in the BFU control architecture of the
present invention;
FIG. 18 is an example of multi-BFU reduction performed by the reduction
logic of the present invention;
FIG. 19 is a block diagram illustrating the operation of the distributed
programmable logic array (PLA) associated with each BFU according to the
invention;
FIG. 20 is a block diagram showing the control logic for a single BFU;
FIG. 21 shows an alternative embodiment of the configuration memory
supporting multiple contexts;
FIG. 22 is a block diagram of the configurable logic device of the present
invention in the form of an integrated chip;
FIG. 23 is a block diagram showing the input/output port architecture for
the chip of the present invention;
FIG. 24 is a block diagram showing the structure of an I/O register
according to the invention;
FIG. 25 is a block diagram of a programmable logic array for customizing
the chip's interface;
FIG. 26 is a block diagram showing the movement of data from the BFU core
off-chip according to the invention;
FIG. 27 is a block diagram of a selector switch that chooses the core
outputs to be driven on an output wire according to the invention;
FIG. 28 is a block diagram showing a tri-state buffer used in the selector
switch of FIG. 20;
FIG. 29 is a block diagram illustrating how data enters the BFU core from
off-chip;
FIG. 30 is a block diagram showing the selector switch that selects among
incoming data bytes from I/O ports and PLAs according to the invention;
FIG. 31 is a block diagram of a C/R input architecture according to the
invention;
FIG. 32 is a block diagram showing the construction of the controller
switches of the level-3 network lines according to the invention;
FIG. 33 is a block diagram illustrating the dynamic control of the
controller switches, which is shared between pairs of controllers at each
column, according to the invention;
FIG. 34 shows the architecture of one of the dynamic control switches
according to the invention;
FIG. 35 is a block diagram showing the connectivity of BFUs in a
systolic-type configuration according to the invention;
FIG. 36 shows the configuration of the BFUs for a microcoded-type
implementation for the convolution problem according to the invention;
FIG. 37 shows the organization of the BFUs for a VLIW, horizontal
microcode-type implementation according to the invention; and
FIG. 38 shows the organization of the BFUs for a VLIW/MSIMD-type
implementation according to the invention.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
FIG. 1 shows a multi-bit microprocessor configuration of a reconfigurable
processing device, which has been constructed and programmed according to
the principles of the present invention. A two-dimensional array of basic
functional units 100 are located in a programmable interconnect 101. Five
of the BFUs 100 and the portion of the reconfigurable interconnect
connecting the BFUs have been configured to operate as a microprocessor
102.
Each of the BFUs 100 preferably has addressable memory resources and logic
resources, such as an 8-bit arithmetic logic unit (ALU). One of the BFUs
100, denoted ALU, utilizes its logic resources to perform the logic
operations of the microprocessor 102 and utilizes its memory resources as
a data store and/or extended register file. Another BFU operates as a
function store F that controls the successive logic operations performed
by the logic resources of the ALU. Two additional BFUs, A and B, operate
as further instruction stores that control the addressing of the memory
resources of the ALU. A final BFU, PC, operates as a program counter for
the various instruction BFUs F, A, B.
As shown in FIG. 2, the same reconfigurable processing array, however, may
be reprogrammed to function as a SIMD system, and as described below, this
reconfiguration can occur on a cycle-by-cycle basis. The functions of the
program counter PC and instruction stores A, B and F have been again
assigned to different BFUs 100, but the ALU function has been replicated
into 12 BFUs. Each of the ALUs is connected via the reconfigurable
interconnect 101 to operate on globally broadcast instructions from the
instruction stores A, B, F. These same operations are performed by each of
these ALU, or common instructions may be broadcast on a row-by-row basis.
FIG. 3 shows how wider data paths can be constructed in the programmable
device. This 32-bit microprocessor configured device has the same
instruction stores A, B, F and program counter as described in connection
with FIG. 1. Four BFUs, however, have been assigned an ALU operation, and
the ALUs are chained together to act as a single 32-bit wide
microprocessor in which the interconnect 101 supports carry-in and
carry-out operations between the ALUs.
FIG. 4 shows how the device can be configured to operate as a very long
instruction word (VLIW) system. The various instruction stores A, B, F are
defined to encompass multiple BFUs 100 to accommodate the desired
instruction word width.
FIG. 5 shows the configuration of the present system to operate as a
multiple instruction multiple data (MIMD) system. The 8-bit microprocessor
configuration 102 of FIG. 1 is replicated into an adjacent set of BFUs to
accommodate multiple, independent processing units within the same device.
Of course, wider data paths could also be accommodated by chaining ALUs of
each processor 102 to each other.
1. Basic Functional Unit Architecture
FIG. 6 shows the moderately coarse grain, preferably 8-bit, BFU core.
Primarily, the BFU core has memory block 110, basic ALU core 120, and
configuration memory 105.
The main memory block 110 is a 256 word.times.8 bit wide memory, which is
arranged to be used in either single or dual port modes. In dual port
mode, the memory size is reduced to 128 words in order to be able to
perform the two simultaneous read operations without increasing the read
latency of the memory. The memory mode is controlled by control logic 114
accessed through a Memory/Mux function port 112, and the write enable can
be controlled either through the memory/mux function port 112 or by the
control logic 134 accessed through ALU function port 132. Control logic is
hardwired and also controls the ALU functions.
In single port mode, the memory 110 uses the A.sub.-- ADR port for an
address and outputs the selected value to both A.sub.-- PORT and B.sub.--
PORT. In dual port mode, the A.sub.-- ADR port selects a value for
A.sub.-- PORT only, and B.sub.-- ADR port selects a value for the B.sub.--
PORT.
In either mode the read operation takes place during the first half of the
clock cycle, and the values are latched for the rest of the cycle. Write
operations take place on the second half of the cycle via the DATA memory
port. Writes are always done to the current A.sub.-- ADR address.
A feedback path 118 shown as a dashed line may be used. The BFU core
performs "A op B.fwdarw.A" in one cycle. Two cycles are needed to perform
"A op B.fwdarw.C" operations. In this case, the feedback is performed by
the normal Level-1 network described in more detail later.
The configuration memory 105 stores configuration words that control the
configuration of the interconnect. It also stores configuration
information for a control architecture. Optionally, it can also be a
multi-context memory that receives a globally broadcast 2-bit global
context selecting signal. The memory is addressed via network port A 122
and receives data from port B 124. The write enable WE is issued by the
control logic 114.
The ALU 120 is a basic 8-bit arithmetic logic processing unit. The
following operations are supported:
Input Invert--Prior to performing any of the following operations either,
or both of the ALU inputs, A.sub.-- in or B.sub.-- in, can be inverted.
Pass--Passes either A.sub.-- in or B.sub.-- in to Out. With the input
inversion, this operation can be a NOT.
NAND--Performs bitwise operation: (A NAND B). With input inversions this
can be an OR.
NOR--Performs bitwise operation: (A NOR B). With input inversions this can
be a AND.
XOR--Perform bitwise operation: (A XOR B). With input inversions this can
be a XNOR.
Shift--Shifts A or B either left or right one bit.
Add--Performs (A+B+C.sub.-- in). C.sub.-- in can be selected from 0, 1, or
C.sub.-- out of an adjacent cell. Combined with the input inversion a
subtract can be made: (A-B)=(A+B+1).
Multiply--Performs (A*B). Can also perform (A*B+X) and (A*B+X+Y), where X
and Y are special inputs. These operations are needed to create pipelined
multiply structures. Multiply operations require two cycles to fully
complete. The low byte is available on the first cycle and the high byte
is available on the second.
The two network ports 122,124 feed addresses to memory ports A.sub.-- ADR
and B.sub.-- ADR. Data is feed to the memory 110 from Network port B via
the memory DATA port. A data multiplexor 126 selects either the feedback
back path 118 or the network port B output. Network ports A and B 122, 124
outputs can feed directly to the ALU 120 by configuring ALU input
multiplexors 128,130. The memory function port 112 controls the operation
of the data and ALU input multiplexors 126,128,130 via the control logic
114.
The BFU core is designed to be smoothly chained to other BFUs to form
wider-word ALU structures by properly configuring the control logic 134
via the ALU function port Fa. In order to accomplish this, the user must
specify the carry-chain of each of datapath element as it travels through
multiple BFUs by setting the following bits in each of the BFUs:
LSB--Set to "1" marks the least-significant-byte position.
MSB--Set to "1" marks the most-significant-byte position.
Rightsource--Specifies the direction to the next least-significant-byte,
which can also be set to receive a carry from another source.
Leftsource--Specifies the direction to the next most-significant-byte,
which can also be set to receive a carry from another source.
The source selection can be one of the following:
North--North BFU.
East--East BFU.
South--South BFU.
West--West BFU.
Local--The local BFU's carry from the previous cycle.
Control Bit--The local Control Bit.
Zero--Constant Zero.
One--Constant one.
In addition, pipeline stages can be inserted into the carry chain by
specifying CarryPipeline to be "1". This will register the incoming carry
prior to its being used. This is important for addition operations,
because the carry-chain is limited by the clock period and the speed of
the adder.
Based on this local information, the actual Shift and Add operations of the
ALU 120 have different effects. There are two main shift functions: Left
and Right. Left shift moves the bits towards the MSB, and right shifts
move the bits towards the LSB. Normally, the carry-in value is used to
fill the newly-created opening, but if the cell is an LSB or MSB, the new
bit is determined by additional information contained the chosen shift
instruction. For left shifts, the LSB position will be different, while
for the Right Shifts it will be the MSB position. The options are:
Force Carry--This option will override the LSB/MSB setting and force the
shift to use the carry-in from its designated source (Left/Rightsource).
This is useful for shift-rotations.
Skip Bit--This option will keep the same LSBit/MS | | |