|
Claims  |
|
|
What is claimed is:
1. A monolithic, autonomous processor chip for selected data processing,
said chip including:
an array of at least 16 bit-serial multipliers;
a plurality of input/output ports for supplying selected complex data and
complex coefficients to said multipliers for multiplication;
means for receiving and latching complex data and coefficients supplied to
said multipliers and for parity checking of the data and coefficients;
a bit-serial adder/substractor matrix connected to said multiplier array
for combining the multiplier output data results;
shift register means receiving and temporarily storing signals from said
adder/subtractor matrix and for serial to parallel conversion of stored
signals for delivery of said stored signals to input/output ports;
scaling control means for scaling said signals from said adder/subtractor
matrix before storage in said shift register means;
sequencer control means having control outputs connected to said multiplier
array, to said adder/subtractor matrix, to said shift register means, and
to said scaling control means for controlling internal step-by-step
operations of said chip;
mode controller means connected to said sequencer control means for
controlling and selecting the operational modes of said chip and for
activating said sequencer control means to provide repetitive sequencing
of selected operations, said mode controller being connectable for
communication with a host processor; and
address generator means on said chip and connected to said sequencer
control means and to said mode control means for selecting complex data
and complex coefficients for multiplication in said multipliers.
2. The processor chip of claim 1, wherein said plurality of input/output
ports are bit-parallel ports.
3. The processor chip of claim 2, wherein said bit-serial multiplier array
and said bit-serial adder/subtractor matrix cooperate to compute a radix-4
butterfly.
4. The processor chip of claim 1, further including external memory means
connected to said input/output ports for supplying data and coefficients
to said processor chip and for receiving processed signals.
5. The processor chip of claim 1, wherein each of said bit serial
multipliers includes a plurality of bit cells, each bit cell including at
least one full adder unit and at least one master/slave flip-flop shift
register.
6. An array of processor chips interconnected to provide independent
simultaneous operation for multiprocessing, comprising:
at least first and second processor chips each including a plurality of
input/output ports, each processor chip including an arithmetic section
for processing data in accordance with a predetermined algorithm;
memory means connected to said array for storage of data to be processed,
said input/output ports being connected in parallel with said memory
means;
control means including address generator means on each of said processor
chips for selecting data from said memory means for supply to the
arithmetic section of the corresponding processor chip by way of said
input/output ports, for processing said selected data, and for subsequent
return of processed data to said memory means; and
at least one of said chips being a redundant chip operating in parallel
with another chip to provide a check on the processed data produced by at
least one other of said chips.
7. A high performance, monolithic, autonomous signal processor chip for
computing digital signal processing algorithms based on the Fast Fourier
Transform, comprising:
a multiplier array of bit-serial complex multipliers for carrying out four
simultaneous bit-serial complex multiplications of four complex data words
with four complex coefficients words, each multiplier including 20 bit
slices to accommodate words up to 20 bits in length;
an adder/subtractor matrix connected to said multiplier array to receive
and combine output data from said bit serial complex multiplier array to
produce high-precision serial result signals;
result shift register means connected to said adder/substractor matrix for
temporarily storing said result signals;
input/output means connected to said multiplier array for supplying said
selected data and coefficient words to said chip and connected to said
result shift register means for transferring said result signals out of
said chip;
an address generation logic circuit for selecting from an external memory
said data and coefficient words for multiplication by said multiplier
array;
a control sequencer connected to said multiplier array and to said address
generation control logic for operating said multiplier array to perform
said complex operations and to control all internal step-by-step
operations needed for said chip to process data; and
a mode controller connected to said control sequencer and to said address
generation control logic circuit for selective operation of said chip,
said mode controller being connectable to a chip external host computer,
said mode controller activating said control sequencer to initiate a
selected sequence of operations.
8. The signal processor chip of claim 7, wherein said input/output means
are connected to external RAM and ROM memories for storing said data words
and said coefficient words, respectively, said external RAM memory further
storing result signals transferred out of said processor chip.
9. The signal processor chip of claim 8, wherein said input/output means
include parallel input/output ports on said processor chip, said
input/output ports being connected to said external RAM and ROM memories
through address, data, and control buses.
10. The signal processor chip of claim 9, wherein said input/output means
on said processor chip are further connected to said address generation
control logic circuit, whereby data and coefficient words are selected
from said RAM and ROM memories.
11. The signal processor chip of claim 10, wherein said input/output means
on said processor chip are further connected to said mode controller on
said processor chip, said mode controller being selectively programmed fro
a desired mode of operation, said mode controller further including a
control sequencer initializing circuit and a ring counter, said control
sequencer initializing circuit activating said ring counter.
12. The signal processor chip of claim 11, further including rounding
circuitry connected to said adder/subtractor matrix, for bit-serial
rounding of said result signals in said result shift register means.
13. The signal processor chip of claim 12, further including parity check
circuitry connected to said multiplier array.
14. The signal processor chip of claim 13, further including scaling
circuitry connected between said adder/subtractor matrix and said result
shift register means for scaling and parity generation on said result
signals.
15. The signal processor chip of claim 14, wherein said multiplier array,
address generation circuit, mode controller, control sequencer
initializer, control sequencer, adder/subtractor matrix, result shift
register means, scaling, rounding and parity means, and input/output means
are positioned on a single chip and closely spaced to permit extremely
short interconnections to produce a high-speed processor chip of extremely
small dimensions.
16. The signal processor chip of claim 10, wherein said result shift
register means includes a first set of result shift registers for
temporary storage of real result signals and a second set of result shift
registers for temporary storage of imaginary results, said real and
imaginary result signals being produced by the bit-serial
addition/subtraction in said adder/subtractor matrix of the bit-serial
complex multiplication from said multiplier array.
17. The signal processor chip of claim 7, wherein said mutliplier array
comprises four, bit-serial complex multipliers, each having 20 bit slices,
each bit slice including four bit slice core segments and each core
segment incorporating a data latch for receiving a corresponding data bit,
a coefficient latch for receiving a corresponding coefficient bit, a
multiplier stage having a full adder, a sum-save static register, and a
carry-save static register, data word bits from said data latch and
complex coefficient bits from said coefficient latch being connected to
said multiplier stage for multiplication, the output signal from said
multiplier stage being connected to said carry-save register and through
said sum-save register to a core segment output line.
18. The signal processor chip of claim 17, wherein said data latch includes
a master/slave flip-flop circuit.
19. The signal processor chip of claim 7, wherein said adder/subtractor
matrix comprises a plurality of sum and difference networks connected to
said complex multiplier circuits.
20. The signal processor chip of claim 7, wherein said control sequencer
includes a counter means for controlling the sequence of operation of said
multiplier array.
21. The signal processor chip of claim 7, wherein said input/output means
are connected to external data, address, and control buses for connecting
said chip to external RAM and ROM memories for storage of data words and
coefficient words to be multiplied in said multiplier array, and for
storage of said result signals, said external buses being adapted for
connection to a host computer for supplying data words in said RAM memory.
22. The signal processor chip of claim 21, wherein said input/output means
is further connected to said mode controller for communication between
said mode controller and a host computer by way of said buses.
23. A processor chip array element, comprising:
first and second high-performance, monolithic, autonomous signal processor
chips for computing digital signal processing algorithms based on the Fast
Fourier Transform, each of said processor chips including a multiplier
array of bit-serial complex multipliers, an address generation control
logic circuit for selecting data and coefficient words for multiplication
by said multiplier array, a control sequencer for operating said
multiplier array to perform complex multiplication, an adder/subtractor
matrix connected to said multiplier to receive and combine outputs data
from said bit-serial complex multiplier array to produce serial result
signals, result shift register means for temporarily storing said result
signals, input/output ports connected to said multiplier array and to said
result shift register means, and a mode controller for selective operation
and control interfacing with a chip external host computer via said
input/output ports;
clock generator means connected to synchronously drive said first and
second processor chips;
first and second memory means for said first and second processor chips,
respectively;
first address data bus means connecting the input/output ports of said
first processor chip to said first memory means for storing first result
signals from said first processor chip;
second address and data bus means connecting the input/output ports of said
second processor chip to said second memory means for storing second
result signals from said second processor chip; and
means connected to said first and second processors for comparing said
first and second processor result signals, whereby one of said first and
second processor chips serves as an active processor, and the other serves
to check the accuracy of the active processor.
24. The processor chip array element of claim 23, further including
interface means for connecting said first and said second address and data
bus means, whereby said chip array element can be connected in parallel
with additional, similar chip array elements, whereby multiple signal
processing can be carried on simultaneously in corresponding multiple chip
array elements. |
|
|
|
|
Claims  |
|
|
Description  |
|
|
BACKGROUND OF THE INVENTION
The present invention relates, in general, to a high performance signal
processor, and more particularly to a signal processor for efficiently
computing digital signal processing algorithms based on the Fast Fourier
Transform.
The Fast Fourier Transform (FFT) is one of the most frequently used
algorithms in digital signal processing. It finds applications in digital
audio systems, radar and sonar signal processing, seismic systems, and
speech processing. These applications require numerical precision ranging
from 8 bits to 20 bits and, in some cases require a floating point number
representation. The sampling rates for such processing vary from Hertz to
mega-Hertz. These broad and varied requirements have been difficult to
meet, and accordingly, prior devices for carrying out the FFT have been
large arrays, using multiple integrated circuit chips on a printed circuit
board. Such large arrays have operated under the control of a host
computer, burdening the computer and limiting the speed at which such
devices could operate. Since in many applications, such as radar signal
processing, speed is the primary objective of a processor, prior processor
systems and devices have been unduly limiting.
In addition to the limits on device speeds, many prior devices have
encountered difficulties in producing a high degree of numerical accuracy,
caused in part by the need to round off the intermediate processing
results during the processing operations. Such rounding, or truncation, of
intermediate results occurs after the parallel multipliers and/or adders
used in prior systems, where the least significant bits are eliminated,
thereby limiting the accuracy of the results. A further problem with such
prior devices is that they require a large amount of space (multiple chip
sets) to accommodate the arrays, which results in relatively high power
consumption. Finally, prior implementations require complicated and
sophisticated programming to enable those systems to work, whereas the
present invention has a simple three-line handshake protocol with a host
computer.
SUMMARY OF THE INVENTION
The present invention is directed to an autonomous, monolithic signal
processor chip incorporating network components and having a system
architecture which enables it to perform signal processing an order of
magnitude faster than was available with prior systems, with greater
numerical precision, and with greater reliability than was available with
prior systems.
The present signal processor, in one embodiment, is a 2.mu.CMOS process
designed in VLSI (Very Large Scale Integration) technology to efficiently
compute digital signal processing algorithms based upon the Cooley-Tukey
Decimation-In-Time Radix-4 Fast Fourier Transform. One chip can be used as
a stand-alone peripheral in a microprocessor system, or a number of chips
can be combined into arrays in order to process signals with sampling
rates of several MHz. Using this VLSI monolithic chip invention, there are
substantial reductions in power needs and reductions in real estate needs
(ie, circuit board space) over comparable previous implementations of the
FFT algorithm.
Although most monolithic processing systems designed to perform the Fast
Fourier Transform have been designed around the use of a single 16
bit.times.16 parallel multiplier (see "A 2.mu. CMOS/LSI 32-point Fast
Fourier Transform Processor" B. L. Troutmans et al, Proc. 1982 IEEE ISCC,
pgs. 26-27 and 282-283; and Digital Signal/Array Processing Products, New
Product Information, Advanced Micro Devices, 901 Thompson Place,
Sunnyvale, Calif. 94086; and Electronic Design News, Nov. 10, 1983, page
256), such devices are complex and require numerous chips, which have the
disadvantages noted above. The present system takes advantage of a
bit-serial approach to overcome many of the prior device problems. Thus,
the present device uses sixteen 20 bit+20 bit bit-serial multipliers and
24 serial adders to compute more accurate results much faster than
comparable bit-parallel processors.
The present device operates with a very simple asynchronous control
interface with a host computer, so that the user can select a system
architecture which best suits a wide range of applications of the device.
Thus, the present signal processor can be a simple peripheral to a
microprocessor system which performs filtering on audio inputs or other
complex signals, or can be a part of a complex fault-tolerant array of
chips computing FFTs for radar or sonar applications. This is done while
maintaining a high level of numerical accuracy (signal-to-noise ratio). In
any iterative computation, there is always a loss of accuracy at each
iterative step when finite register lengths are used. The present
invention minimizes this loss of accuracy when contrasted with prior
comparable implementations. For example, in a representative 1024 point
FFT followed by an Inverse FFT (IFFT), this invention would yield in the
range of 88 dB of signal-to-noise ratio which is significantly better than
was available in prior FFT devices. Furthermore, the present device
maximizes the speed of operation, providing in the range of 15.4 million
multiplications per second in a very small, low power device.
The FFT algorithm is an efficient means of computing the Discrete Fourier
Transform (DFT) on a block of data. The block of data represents a finite
duration sequence (or signal) in time. Computing the DFT uniquely maps
this sequence into a frequency domain representation. The N words in the
DFT results are the values of the Z-transform at N equally spaced points
around the unit circle in the Z-plane. The DFT can be expressed as
follows:
##EQU1##
where the W.sub.N terms complex roots of the unity, and
##EQU2##
The x(i) are the N words of the time domain signal and the X(k) are the N
words of the frequency domain transform. The DFT can also be written as a
matrix-vector multiplication which requires O(N.sup.2) operations, where
"O()" is the notation for "Order of Magnitude". However, using the FFT
algorithm, the computational complexity can be reduced to O(N log N).
The FFT algorithm derives its regularity and flexibility from repeatedly
applying the same primitive arithmetic operation, called a "butterfly," to
the block of data. The two most common butterfly operations are the
radix-2and the radix-4 butterflies, which can be described in matrix
equations.
The decimination-in-time radix-2 butterfly operates on two data inputs as
follows:
##EQU3##
where
##EQU4##
To form an N point FFT using a radix-2 butterfly processor, log.sub.2 N
sequential scans are made through the data memory. Each scan requires N/2
butterfly operations.
Using a radix-4 butterfly replaces four radix-2 butterflies. The radix-4
operations can be expressed as:
##EQU5##
It should be noted that outputs X(1) and X(2) have been interchanged in
(3) from the "normal" formulation of the radix-4 butterfly. This allows
the FFT results to be arranged in "bit-reversed" order instead of radix-4
"digit-reversed" order.
The architecture of the present device is based on the efficient execution
of the radix-4 butterfly operation using bit-serial arithmetic hardware.
The radix-4 butterfly is preferred because it reduces the necessary
communication between the device and external memory by 50% compared with
radix-2, it allows four times the arithmetic parallelism to be carried out
in the processor, and it allows improvement in the numerical precision by
reducing errors caused by the rounding and scaling of data. In addition,
the invention implements the radix-4 butterfly in a unique architecture
which does not restrict the possible FFT sizes to only powers of 4.
Rather, the present invention can calculate any FFTs with sizes which are
any power of 2 between 4 and 16,384 (16K). This is a novel and unique
trait of the architecture, since it allows the outputs of the radix-4
butterfly to be produced in a "bit-reversed" ordered FFT result instead of
radix-4 "digit reversed" ordered result. The "digit-reversed" ordered FFT
results would normally be the result of a system based on the standard
radix-4 butterfly operation.
The complement of the DFT algorithm, known as the Inverse DFT (IDFT), can
be used to transform a sequence in the frequency domain back to the time
domain. The formula for the IDFT is:
##EQU6##
where X(i) represents the N words of the frequency domain representation
to be transformed back into the N words of the time domain singal x(n).
This has the same form as the DFT equation (1), except that the W matrix
entries have been complex conjugated, and the result divided by N. This
allows an efficient inverse FFT algorithm to implement the IDFT using
hardware which is nearly identical to the FFT hardware, giving the chip of
the present invention a second, easily attainable mode of operation.
The processor chip of the present invention interfaces directly with a host
computer and with suitable RAM and ROM for the storage of data and
coefficients external to the chip. The processor chip contains all the
control logic required to autonomously execute FFTs or other computations
without intervention by the host computer. Therefore, only "OPERATE",
"DONE", and "LOAD ASSIGNMENT" interface lines between the host and the
processor chip are required. After the host computer has loaded input data
into the memory, it activates the processor via the "OPERATE" line, and
the processor operates asynchronously under the control of a local
external clock which can have the processor operate at 50 MHz, for
example. Upon completion of its assignment, the device activates the
"DONE" line to the host computer, which responds by deactivating the
"OPERATE" line, and resetting the processor chip. This avoids burdening
the host computer with complex control functions, as was necessary in
prior FFT systems.
The processor chip includes mode control logic which enables it to operate
on blocks of data containing up to 16K complex points in a programmed
sequence of up to five modes, including FFT, Inverse FFT, windowing,
multiplication, and scaling. These modes allow the operator to assign a
variety of signal processing tasks, including FFTs, IFFTs, Finite Impulse
Response (FIR) filtering, convolution and correlation. The device can be
quickly changed from one task to another, with differing data sizes, if
desired.
To accomplish the foregoing, the processor chip of the present invention
consist of arithmetic hardware, including an array of 16 bit-serial
multipliers and a bit-serial adder/subtractor matrix. The chip further
includes input latches for receiving data from an external memory through
input-output ports, and control logic, comprising of an address generator
which provides the addresses necessary to extract data and coefficients
from memory and to provide the addresses for storing the results
computation. In addition, there is an internal control sequencer PLA
(Programmable Logic Array) which provides all the necessary signals to
operate the multipliers and the adder subtractor matrix. The control
sequencer also generates the signals for controlling the data scaling,
rounding and parity generation and the shifting of the arithmetic results
into temporary shift register arrays before the results of the arithmetic
processing are returned to memory.
The processor chip of the invention can further be used as an element in an
array of processor chips operated by a host computer for providing
significantly increased processing speed. Such an array also provides
redundancy for the chips, to permit selected chips to perform a watchdog
function to detect errors in the data or address outputs of an active
chip, thereby providing a fault tolerant system.
More particularly, the device of the present invention consists of a
monolithic, autonomous signal processor chip having a multiplier array of
4 bit-serial complex multipliers, each includng 20 bit slices, for
carrying out four simultaneous bit-serial complex multiplications of four
complex data words and four complex coefficient words, each complex word
being up to 20 bits in length. A control logic circuit including an
address generator on the processor chip selects the data and coefficient
words to be multiplied from suitable memory devices which may be RAM or
ROM devices external of the chip. A control sequence on the chip drives
the multiplier array to perform the complex multiplication. Novel adder
circuitry is used in the multiplier array.
An adder/subtractor matrix including a plurality of sum and difference
networks is connected to the output of the multiplier array to receive and
combine the outputs from the array to produce high-precision real and
imaginary serial result signals which are temporarily stored in
corresponding real and imaginary result shift registers. Prior to such
storage, the result signals are bit-serially rounded by rounding circuitry
connected to the adder/subtractor matrix.
Input/output circuits on the processor chip are connected to the multiplier
array to supply data and coefficient words to be multiplied, and are
connected to the result shift registers for supplying result signals
temporarily stored in the shift register to external memory under the
control of the control logic circuit on the processor chip.
Scaler circuits may be provided for the adder/subtractor matrix for scaling
the result signals, and suitable parity check circuitry may be provided.
The multiplier array of four bit-serial complex multipliers includes 20 bit
slices, each of which consists of one corresponding bit slice core segment
from each multiplier. Each of the four core segments in a slice
incorporates a data latch and a coefficient latch for receiving
corresponding bits of the input data and coefficient words. Each core
segment also includes a multiplier stage having a full adder connected to
a sum-save static register and a carry-save static register, the data
latch and the coefficient latch being connected to the multiplier stage
through a partial bit generator. The data latches each include a
master/slave flip-flop circuit.
All of the elements of the processor are positioned on a single chip, and
are closely spaced to permit extremely short interconnections to provide a
low-noise, high-speed processor chip of extremely small dimensions. The
chip is connected through its input/output ports to external data,
address, and control buses to external read only memory (ROM) and external
random access memory (RAM), which store coefficients and data words,
respectively. The chip is also connected through the buses to a host
computer, which supplies the data words for use in the processor chip, and
which receives the results of the processing. The processor chip is driven
by a local clock which is independent of the host computer, so that the
processor operates asynchronously. This allows the processor to function
without burdening the computer.
BRIEF DESCRIPTION OF DRAWINGS
The foregoing and additional objects, features and advantages of the
present invention will be more clearly understood from the following
detailed description of a preferred embodiment thereof, taken in
conjunction with the accompanying drawings, in which:
FIG. 1 is a diagrammatic illustration of a system utilizing a single
processor chip in accordance with the present invention, the chip being
used as a peripheral processor for a host computer;
FIG. 2 is a diagrammatic illustration of the floorplan of the architecture
of the processor chip of the present invention;
FIG. 3 is a diagrammatic illustration of the data flow in the processor
chip of the present invention;
FIG. 4 is a diagrammatic illustration of the circuitry for the complex
multiplier and adder/subtractor arrays and associated circuitry for the
processor shown in FIG. 2;
FIG. 5 is a schematic diagram of a master/slave flip-flop used in the
multiplier array of FIG. 4;
FIG. 5A is a logic diagram of the flip-flop circuit of FIG. 5;
FIG. 6 is a schematic diagram of a full adder used in the multiplier array
of FIG. 4;
FIG. 7 is bit-slice core segment, four of which are used in each bit-slice
for each of the complex multipliers shown in FIG. 4;
FIG. 8 is a sum/difference cell used in the add/subtractor matrix of FIGS.
2 and 4;
FIG. 9 is a diagram of the hierarchy of the functional controls carried out
in the control circuits of the device of FIG. 2;
FIG. 10 is a more detailed block diagram of the control sequencer
illustrated in FIG. 2;
FIG. 11 is a block diagram of a programmed logic array, used in the device
of FIG. 2;
FIG. 12 is a two phase clocking diagram for the PLA of FIGS. 10 and 11;
FIG. 13 is an example of one sum-of-products logic implementation using the
AND/OR planes of the PLA of FIG. 12;
FIG. 14 is an example of part of the schematic for a programmable logic
array;
FIG. 15 is a diagrammatic illustration of the generalized pipelined timing
diagram for the input/output of data, coefficients, and calculations into
and out of the processor chip of the present invention;
FIG. 16 is a diagrammatic illustration of a two-processor array element
which includes an "active" processor and its fault-detecting "watchdog"
processor; and
FIG. 17 is a diagrammatic illustration of an array system using a
multiplicity of the array elements of FIG. 16.
DESCRIPTION OF PREFERRED EMBODIMENT
Turning now to a more detailed consideration of the present invention,
there is shown in diagrammatic form in FIG. 1 a system 8 utilizing the
device of the present invention for signal analysis. A host computer 10
receives on line 11 samples of a signal to be analyzed, and supplies to a
Random Access Memory (RAM) 14, data obtained from the signal samples in
the form of complex words of up to 20 bits length. The coefficients
(W.sub.N) required to compute an FFT are permanently stored in a Read-Only
Memory (ROM) 12; these coefficients are multiplied by the signal data in
performing the radix-4 FFT butterfly arithmetic. The data (in RAM 14) and
the coefficients (in ROM 12) may require from 4 to 16,384 memory points
each, depending upon the user's option, and accordingly each memory can be
as large as is a 16K (16,384) word memory. The host computer 10 is
connected to the memories 12 and 14 and to the processor chip of the
present invention, illustrated at 16, by means of a data bus 18, over
which data is supplied to the memories by the host computer 10, over which
data and coefficients are transferred to the processor 16 for processing,
and over which results are returned to RAM memory 14 from processor chip
16. An address bus 20 is connected from the host computer 10 to memories
12 and 14 and to the processor 16 to place data in selected locations of
the RAM memory 14, in addition to allowing the host computer 10 to send
encoded programming information to the processor chip 16. It also allows
the processor chip 16 to select data sequentially from the memories 12 and
14 for processing by means of addresses generated in the processor chip
16, and to return the processed results to RAM memory 14. A control bus 22
is connected between the host computer 10, the processing chip 16, and the
memories 12 and 14 to permit the host computer 10 to enable and disable
the chip 16 and the memories 12 and 14.
Processor chip 16 is illustrated in diagrammatic form in FIG. 2 and its
data flow is illustrated in FIG. 3. The processor chip 16 is a monolithic
VLSI chip which may be constructed using a conventional 2 .mu.m bulk CMOS
process with two layers of metalization, providing on the order of 62,000
transistor devices on a chip 7.5 mm.times.7.5 mm in size. The processor
uses a 20 bit block floating point internal data represention and can
accept FFT inputs up to that degree of precision (20 bits) using a fixed
point representation. The processor chip shown in FIG. 2 includes
conventional parallel input/output (I/O) data ports indicated generally at
24 and 26 by which the chip is connected via data bus 18 (FIG. 1) to the
memories 12 and 14. Parallel data ports are used instead of bit-serial
input/output ports to increase the I/O bandwidth and to simplify the
interfacing requirements to the ROM 12 and RAM 14. This allows
simultaneous access of the real and imaginary components of complex data
words and coefficients. As shown in FIG. 3, which represents data flow in
the processor chip 16, I/O port 24 provides access to real components, and
I/O port 26 provides access to imaginary components. Twenty I/O terminals
are provided for each of the ports 24 and 26, for a total of forty
terminals. The combined use of dual I/O ports with the radix-4 algorithm
cuts the input/output time by a factor of four, when compared to
conventional radix-2 computations.
As shown in FIG. 2, processor 16 includes a multiplier array 28, having 16
bit-serial multipliers, which receives four complex data words and four
complex coefficients selected from RAM 14 and ROM 12, respectively, by way
of the I/O ports 24 and 26, where each complex word consist of 20 bits
real and 20 bits imaginary. The data words and coefficients are selected
by means of an address generation control logic circuit 30 on processor
chip 16 and are supplied to multiplier array 28 by bus 31, functionally
shown in FIG. 3, but not in FIG. 2, for simplicity of illustration. Four
complex multiplications are executed in the multiplier array 28 using the
16 bit-serial multipliers, under the control of a control sequencer 32,
driven by a local external processor clock 34 (FIG. 1). All four of the
complex multiplications are carried out simultaneously in array 28, and
the products are supplied by bus 35 (FIG. 3) for combination in a
bit-serial adder/subtractor matrix 36. The result signals from matrix 36
are temporarily stored in result shift registers 38 and 40, which store
the real and imaginary components of the results, respectively. A
rounding/parity generation circuit 42 rounds the add/subtractor output
before temporary storage in the shift registers 38 and 40. Parity
generation is performed on results in 42, if the user programs the
processor 16 to use parity.
A mode controller circuit 44 performs all the high level interfacing
control between the processor chip 16 and the host computer 10. It
receives external encoded programming information from host computer 10
via address bus 20. Internally it sends and receives control signals to
the control sequencer 32, the address generation control 30, and other
cells of the processor chip 16 as required.
The mode controller 44 controls the high level operation of the chip 16,
and controls the interfacing with the host computer. It includes five
major functional areas. First, it incorporates a small programmable logic
array (PLA) to provide high level control signals. Second, it provides a
bank of latches to store programming information received from the host
computer. Third, it includes a scan counter register, which counts the
number of iterations for the FFT implementation. Fourth, it includes a
column of logic circuitry to select the mode of chip operation and to
determine when the chip has finished its assigned tasks. Finally, it
includes a control sequencer initializer, which is a column of logic
circuitry to activate the control sequencer and to deactivate the control
sequencer when the chip has accomplished all of its programmed tasks. In
summary, the mode controller 44 represents a combination of conventional
logic circuitry with selected functions required to control the internal
operation of the chip and to interface with the external host computer.
Such controllers are generally known.
The processor chip 16 receives data and coefficient inputs in a 16-bit
fixed-point 2's-complement format, and since the signal processing is
executed in place, no unrecoverable overflow from the arithmetic process
is permitted. However, after the addition/subtraction of the matrix of
complex numbers in the radix-4 computation, a growth of three bits (binary
digits) per computation could occur in a few cases, expanding the result
words to more than 20 bits. For this reason some scaling is required and
is handled by the scaling control circuitry 43. In the present invention,
overflow is prevented by scaling the intermediate results as a set of
block floating-point numbers. Four extra guard bits are provided to allow
for growth so that the intermediate results are stored with at most 20
bits of precision. On average, at least one result will be stored with 18
bits of precision. That is, each bit-serial multiplication of a 20-bit
signed 2's-complement data word and coefficient produces a 39-bit
double-precision bit-serial data result and the sign bit. The bit-serial
results obtained in the multipliers 28 are summed in the bit-serial
adder/subtractor matrix 36 to form the result of the butterfly
| | |