|
Claims  |
|
|
What is claimed is:
1. A 2-D DCT circuit arrangement comprising:
an input permutation processor configured and arranged to reorder an input
sequence of data samples, wherein reordered data samples are represented
as a matrix;
a plurality of DCT processors coupled to the input permutation processor,
each DCT processor arranged to apply a 1-D DCT along an associated
extended diagonal of the matrix;
a polynomial transform processor coupled to the DCT processors, the
polynomial transform processor arranged to apply a polynomial transform to
output from the 1-D DCTs; and
an output permutation processor coupled to the polynomial transform
processor, the output permutation processor arranged to re-order output
data from the polynomial transform.
2. The circuit arrangement of claim 1, wherein the polynomial transform
processor includes:
a plurality of columns of storage elements, each storage element arranged
for storage of a polynomial, wherein storage elements in a first column
are arranged to receive respective output polynomials from the DCT
processors;
a plurality of radix butterflies, each radix butterfly arranged to receive
input polynomials from a pair of storage elements in one column of storage
elements and provide output polynomial to a pair of storage elements in
another column of storage elements.
3. The circuit arrangement of claim 2, wherein selected ones of the storage
elements are arranged as FIFO storage.
4. The circuit arrangement of claim 2, wherein selected ones of the storage
elements are RAM storage.
5. The circuit arrangement of claim 2, wherein selected ones of the storage
elements are FIFO storage, and selected others of the storage elements are
RAM storage.
6. The circuit arrangement of claim 1, the input permutation processor
including:
n sets of parallel-to-serial shift registers, each set including a first
parallel-to-serial shift register arranged to receive an input sample, and
other parallel-to-serial shift registers in the set coupled in cascade to
the first register; and
n sets of multiplexers, wherein input terminals of each multiplexer in a
set are coupled to serial output terminals of the parallel-to-serial shift
registers in a corresponding set.
7. The circuit arrangement of claim 6, wherein the polynomial transform
processor includes:
a plurality of columns of storage elements, each storage element arranged
for storage of a polynomial, wherein storage elements in a first column
are arranged to receive respective output polynomials from the DCT
processors;
a plurality of radix butterflies, each radix butterfly arranged to receive
input polynomials from a pair of storage elements in one column of storage
elements and provide output polynomials to a pair of storage elements in
another column of storage elements.
8. The circuit arrangement of claim 7, wherein selected ones of the storage
elements are FIFO storage.
9. The circuit arrangement of claim 7, wherein selected ones of the storage
elements are RAM storage.
10. The circuit arrangement of claim 7, wherein selected ones of the
storage elements are FIFO storage, and selected others of the storage
elements are RAM storage.
11. 2-D DCT circuit arrangement implemented on an FPGA, comprising:
a plurality of function generators arranged as an input permutation
processor to reorder an input sequence of data samples, wherein reordered
data samples are represented as a matrix;
a plurality of function generators arranged as a plurality of DCT
processors, each DCT processor coupled to the input permutation processor
and arranged to apply a 1-D DCT along an associated extended diagonal of
the matrix;
a plurality of function generators arranged as a polynomial transform
processor, the polynomial transform processor coupled to the DCT
processors and arranged to apply a polynomial transform to output
polynomials from the 1-D DCTs; and
a plurality of function generators configured to reorder output data from
the polynomial transform.
12. The circuit arrangement of claim 11, wherein the polynomial transform
processor includes:
a plurality of function generators arranged as columns of storage elements,
each storage element arranged for storage of a polynomial, wherein a first
column is arranged to receive respective sets of output polynomials from
the DCT processors;
a plurality of function generators arranged as radix butterflies, each
radix butterfly coupled to receive input polynomials from a pair of
storage elements in one column of storage elements and provide output
polynomials to a pair of storage elements in another column of storage
elements.
13. The circuit arrangement of claim 12, wherein selected ones of the
storage elements are function generators configured as FIFO storage.
14. The circuit arrangement of claim 12, wherein selected ones of the
storage elements are function generators configured as RAM storage.
15. The circuit arrangement of claim 12, wherein selected ones of the
storage elements are function generators configured as FIFO storage, and
selected others of the storage elements are function generators configured
d as RAM storage.
16. The circuit arrangement of claim 11, the input permutation processor
including:
a plurality of function generators configured as n sets of
parallel-to-serial shift registers, each set including a first
parallel-to-serial shift register arranged to receive an input sample, and
other parallel-to-serial shift registers in the set coupled in cascade to
the first register; and
a plurality of function generators configured as n sets of multiplexers,
wherein input terminals of each multiplexer in a set are coupled to serial
output terminals of the parallel-to-serial shift registers in a
corresponding set.
17. The circuit arrangement of claim 16, wherein the polynomial transform
processor includes:
a plurality of function generators arranged as columns of storage elements,
each storage element arranged for storage of a polynomial, wherein a first
column is arranged to receive respective output polynomials from the DCT
processors;
a plurality of function generators configured and arranged as radix
butterflies, each radix butterfly coupled to receive input polynomials
from a pair of storage elements in one column of storage elements and
provide output polynomials to a pair of storage elements in another column
of storage elements.
18. The circuit arrangement of claim 17, wherein selected ones of the
storage elements are function generators arranged as FIFO storage.
19. The circuit arrangement of claim 17, wherein selected ones of the
storage elements are function generators arranged as RAM storage.
20. The circuit arrangement of claim 17, wherein selected ones of the
storage elements are function generators arranged as FIFO storage, and
selected others of the storage elements are function generators arranged
as RAM storage.
21. A method for constructing a 2-D DCT circuit arrangement, comprising:
configuring a plurality of function generators as an input permutation
processor, wherein the input permutation processor reorders input
sequences of data samples, wherein reordered data samples are represented
as a matrix;
configuring a plurality of function generators as a plurality of DCT
processors, wherein each DCT processor is coupled to the input permutation
processor and applies a 1-D DCT along associated extended diagonal of the
matrix;
configuring a plurality of function generators as a polynomial transform
processor, wherein the polynomial transform processor is coupled to the
DCT processors and applies a polynomial transform to output from the 1-D
DCTs;
configuring a plurality of function generators as an output processor,
wherein the output processor re-orders output from the polynomial
transform.
22. The method of claim 21, further comprising:
configuring a plurality of function generators as columns of storage
elements, wherein each storage element is arranged for storage of a
polynomial, and a first column is arranged to receive respective output
polynomials from the DCT processors;
configuring a plurality of function generators as radix butterflies,
wherein each radix butterfly is arranged to receive input polynomials from
a pair of storage elements in one column of storage elements and provide
output polynomials to a pair of storage elements in another column of
storage elements.
23. The method of claim 22, further comprising configuring selected ones of
the storage elements as FIFO storage.
24. The me hod of claim 22, further comprising configuring selected on s of
the storage elements as RAM storage.
25. The method of claim 22, further comprising configuring selected ones of
the storage elements as FIFO storage, and configuring selected others of
the storage elements as RAM storage.
26. The method of claim 21, further comprising:
configuring a plurality of function generators as n sets of
parallel-to-serial shift registers, wherein each set includes a first
parallel-to-serial shift register coupled to receive an input sample, and
other parallel-to-serial shift registers in the set are coupled in cascade
to the first register; and
configuring a plurality of function generators as n sets of multiplexers,
wherein input terminals of each multiplexer in a set are coupled to serial
output terminals of the parallel-to-serial shift registers in a
corresponding set.
27. The method of claim 26, further comprising:
configuring a plurality of function generators as columns of storage
elements, wherein each storage element is arranged for storage of a
polynomial, and a first column is arranged to receive respective sets of
outputs from the DCT processors
configuring a plurality of function generators as radix butterflies,
wherein each radix butterfly is coupled to receive inputs from a pair of
storage elements in one column of storage elements and provide outputs to
a pair of storage elements in another column of storage elements.
28. The method of claim 27, further comprising configuring selected ones of
the storage elements as FIFO storage.
29. The method of claim 27, further comprising configuring selected ones of
the storage elements as RAM storage.
30. The method of claim 27, further comprising configuring selected ones of
the storage elements as FIFO storage, and configuring selected others of
the storage elements as RAM storage. |
|
|
|
|
Claims  |
|
|
Description  |
|
|
FIELD OF THE INVENTION
The present invention generally relates to the 2-dimensional (2-D) discrete
cosine transform (DCT), and more particularly to implementing the 2-D
forward and inverse DCT on an FPGA using a polynomial transform.
BACKGROUND
The 2-D DCT is at the heart of many low-rate codecs for video compression.
For example, the DCT is an integral part of the MPEG and H.261 standards.
The DCT's time efficient computation is also of great interest in various
communications and multi-media applications.
There are several strategies available to the digital signal processing
(DSP) system engineer for realizing a DCT-based codec. One option is to
use a software programmable DSP processor such as the TMS320C5x from Texas
Instruments. This brings high flexibility to a design at the expense of
performance. At the other end of the implementation spectrum is an ASIC
solution, which provides high performance with little or no flexibility. A
third option includes field programmable gate arrays (FPGAs).
FPGAs offer high-performance without sacrificing design flexibility The
conventional technique for realizing a 2-D DCT is to exploit the transform
separability and decompose the problem into a sequence of 1-D
sub-problems. That is, first a 1-D DCT is performed on the rows, followed
by a 1-D DCT on the, columns. For high-resolution N.times.N-pixel
(N>=1024) color images, a parallel architecture that incorporates row
and column, processors as well as a matrix transposition engine must be
used to accommodate real-time data rates. Using distributed arithmetic (as
described in U.S. Pat. No. 3,77,130 entitled, "Digital Filter for PCM
Encoded Signals" to Croisier et al.) to implement a 1-D DCT on an FPGA can
greatly reduce the number of configurable logic blocks (CLB s) used for
the DCT.
While distributed arithmetic reduces the number of CLBs of an FPGA that are
used to implement the 2-D DCT, it is desirable for economic reasons to
further reduce the number of CLB used to implement the 2-D DCT.
SUMMARY OF THE INVENTION
The present invention includes a circuit arrangement that implements a 2-D
forward and inverse DCT using a polynomial transform. In one embodiment,
an input permutation processor reorders input sample data, wherein the
reordered data samples logically form a matrix. A plurality of 1-D DCT
processors are coupled to the input permutation processor. Extended
diagonals passing through the matrix reference data that are provided as
input to respective 1-D DCT processors, which operate in parallel. Output
data from the 1-D DCT processors are provided as input data to a
polynomial transform processor. The polynomial transform processor applies
a polynomial transform to data from the 1-D DCTs, and output data from the
polynomial transform processor are re-ordered in a prescribed order for
2-D DCT outputs.
In another embodiment the various processors and storage elements are
implemented on FPGA function generators. Since the 1-D DCT processors and
polynomial transform are multiplier free, usage of FPGA resources is
minimized.
Various embodiments are set forth in the Detailed Description and Claims
which follow.
BRIEF DESCRIPTION OF THE DRAWINGS
Various aspects and advantages of the invention will become apparent upon
review of the following detailed description and upon reference to the
drawings in which:
FIG. 1 is a functional block diagram of a system for performing a 2-D DCT
using a polynomial transform, according to one embodiment of the
invention;
FIG. 2. shows an example reordering of input data;
FIGS. 3A-H illustrate 8 extended diagonals of a matrix, where the data on
the diagonals are used to compute 1-D DCTs;
FIG. 4 shows the matrix-vector form of the 1-D DCT for the case N=8;
FIG. 5 is a functional block diagram of one embodiment of an input
permutation processor in combination with a 1-D DCT processor;
FIG. 6: is a functional block diagram of one embodiment of a polynomial
transform processor; and
FIG. 7 is a functional block diagram of a shift logic element.
DETAILED DESCRIPTION
The present invention uses a polynomial transform to compute the i2-D DCT
on an FPGA. The paper by J. Prado and P. Duhamel entitled "A Polynomial
Transform Based Computation of the 2-D DCT with Minimum Multiplicative
Complexity", IEEE International Conference on Acoustics, Speech and Signal
Processing, pp. 1347-1350, 1996 (hereinafter, "Prado") further describes
the mathematical basis of the polynomial transform of the 2-D DCT.
FIG. 1 is a functional block diagram of a system for performing a 2-D DCT
using a polynomial transform, according to one embodiment of the
invention. The system includes input-permutation processor 102, 1-D DCT
processor 104, polynomial transform (PT) processor 106, and output
permutation processor 108.
Input data is provided to input-permutation processor 102. For example, in
a video processing application the input data may be represented as an
8.times.8 matrix of pixel data. Input-permutation processor 102 reorders
the pixel data in accordance with the following known definition of a
permuted sequence:
Y.sub.n.sub..sub.1 , n.sub.2 =X.sub.2n.sub..sub.1 , 2n.sub.2
Y.sub.N-n.sub..sub.1 -1,n.sub.2 =X.sub.2n.sub..sub.1 +1, 2n.sub.2
Y.sub.n.sub..sub.1 ,N-n.sub.2 -1=X.sub.2n.sub..sub.1 , 2n.sub.2 +1
Y.sub.N-n.sub..sub.1 , N-n.sub.2 -1=X.sub.2n.sub..sub.1 +1,2n.sub.2 +1
where
n.sub.1, n.sub.2 =0, . . . N/2-1
The reordered data can be represented as a matrix in which the original
input data in the original matrix has been moved in accordance with the
above definition. Input-permutation processor 102 assembles the reordered
data into 8 independent memories (riot shown), each containing data
associated with selected diagonals of the matrix (see FIGS. 2 and 3A-H).
1-D DCT processor 104 is comprised of a set of units that perform the 1-D
DCTs in parallel using distributed arithmetic. In general, each unit
requires b+1 clock cycles to complete a 1-D N-point DCT, where b is the
number of bits in a data sample (e.g., 8). Since the polynomial transform
(PT) processing overlaps the first stage DCT calculation, the data
re-ordering must be completed in b+1 clock cycles so that processor 106 is
not left waiting for data. In the available b +1 cycles, N.sup.2 numbers
must be accessed by DCT processor 104. This is achieved by presenting the
data as polynomials of degree (N-1) to the input stage of DCT processor
104. Thus, for the common example case of N=8 with 8-bit data samples, a
64-bit data bus can be used to provide eight 8-bit samples at each clock
cycle.
Polynomial transform processor 106 implements the mathematically defined
polynomial. The following equation is one expression of a polynomial
transform:
##EQU1##
One embodiment for computing the polynomial transform is via a direct
evaluation of the equation above. However, this approach is not optimal in
that sub-expressions are not shared between the output terms, which
results in additional arithmetic, and therefore, additional hardware
resource requirements.
Analysis of the polynomial transform equation reveals that the output terms
share sub-expressions in much the same way that the output terms of a
discrete Fourier transform share common sub-terms. To produce an optimal
solution, from a hardware requirement reference, it is desirable to
compute each sub-term only once, and combine these components to produce
the final polynomial transform result. This can be achieved using a
radix-2 partitioning of the problem. The partitioning is very similar to
the standard radix-2 partitioning used for the common radix-2 Cooley-Tukey
(CT) fast Fourier transform. The CT algorithm employs operations on
elements in the complex field, while the fast polynomial transform, that
is the radix-2 based approach, operates on elements that are taken from
the polynomial field.
Output permutation processor 108 reorders the output to produce the final
result. As will be seen from FIG. 6, the output terms from the polynomial
transform processor 106 are output in a permuted fashion. That is, the top
output arm generates P.sub.0 (z), with successive output ports generating,
P.sub.4 (Z), P.sub.2 (Z), P.sub.6 (Z), P.sub.1 (Z), P.sub.5 (z), P.sub.3
(z) and P.sub.7 (z) respectively. The natural and preferred. ordering of
the results would be P.sub.0 (z), P.sub.1 (z), P.sub.2 (z), P.sub.3 (Z),
P.sub.4 (Z), P.sub.5 (z), P.sub.6 (z) and P.sub.7 (Z). Output permutation
processor 108 rearranges the permuted result to generate the naturally
ordered transform as the final result.
Processing blocks 102, 104, 106, and 108 operate in parallel. Thus, while
DCT processors 104 are operating on data set i, polynomial transform
processors 106 are operating on data set i-1, output permutation processor
is operating on data set i-2, and input permutation processor 102 is
operating on data set i+1.
FIG. 2 shows an example reordering of input data in accordance with the
permuted sequence defined above. Matrix 150 represents the input data, and
matrix 152 represents the reordered input data. The entries in the cells
in matrix 150 denote the row and column coordinates of the original input
data, and the entries in the cells in matrix 152 show how data referenced
by the original coordinates is reordered in matrix 152. For example, the
data of cell 0,1 in matrix 150 is moved to cell 0,7 in matrix 152, as can
be seen from "0,1" occupying cell 0,7 in matrix 152. In other words, cell
0,7 of matrix 152 has the data from cell 0,1 of matrix 150.
It will be appreciated that the address sequencing required to effect the
data reordering can be accomplished with an FPGA. An address generator can
be customized to satisfy the data access requirements without impacting
the processing time.
FIGS. 3A-H illustrate the 8 extended diagonals of matrix 152 that are used
to compute the 1-D DCTs. In one embodiment, all the 1-D DCTs are computed
in parallel. It will be appreciated that the 8 diagonals cover all the
cells in the matrix.
In FIG. 3A, the diagonal begins at cell 0,0 and progresses, by moving down
one row and over one column until column 7 is reached at cell 7,7. In FIG.
3B, the diagonal begins at cell 1,0 (row 1, column 0) and progresses by
moving down 5 rows and over 1 column. Note that after the last row is
reached, row counting circles back to row 0. FIGS. 3C-3H show the
remaining extended diagonals that comprise the other sets of sample inputs
to the 1-D DCTs.
FIG. 4 shows the matrix-vector form of the 1-D DCT for the case N=8. The
matrix-vector form is shown to illustrate application of the 1-D DCT using
the extended diagonals for FIGS. 3A-H and to illustrate operation of
polynomial, transform processor 106. It will be appreciated that each DCT
output sample Y.sub.k (k=0, 1, . . . 7), can be computed as a vector
dot-product of the basis function C.sub.k (k=0, 1, . . . 7), and input
samples X.sub.i (i=0, 1, . . . 7), where C.sub.k =cos(2.pi.k/32). That is,
Y.sub.0 =C.sub.0 X.sub.0 +C.sub.0 X.sub.1 +C.sub.0 X.sub.2 +C.sub.0
X.sub.3 +C.sub.0 X.sub.4 +C.sub.0 X.sub.5 +C.sub.0 X.sub.6 +C.sub.0
X.sub.7 ; Y.sub.1 =C.sub.1 X.sub.0 +C.sub.3 X.sub.1 +C.sub.5 X.sub.2
+C.sub.7 X.sub.3 -C.sub.7 X.sub.4 -C.sub.5 X.sub.5 -C.sub.3 X.sub.6
-C.sub.1 X.sub.7 ; etc. Note that:
##EQU2##
Recall that DCT processor 104 computes the 1-D DCTs along the extended
diagonals of the reordered data (example extended diagonals of a reordered
8.times.8 data set are illustrated in FIGS. 3A-H). DCT processor 104
includes a plurality of 1-D DCT processing units, each arranged to compute
the 1-D DCT of a respective one of the extended diagonals. For example,
relative to the data set of FIG. 3A, one processing unit computes Y.sub.0
=C.sub.0 X.sub.0,0 +C.sub.0 X.sub.2,2 +C.sub.0 X.sub.4,4 +C.sub.0
X.sub.6,6 +C.sub.0 X.sub.7,7 +C.sub.0 X.sub.5,5 +C.sub.0 X.sub.3,3
+C.sub.0 X.sub.1,1 ; Y.sub.1 =C.sub.1 X.sub.0,0 +C.sub.3 X.sub.2,2
+C.sub.5 X.sub.4,4 +C.sub.7 X.sub.6,6 -C.sub.7 X.sub.7,7 -C.sub.5
X.sub.5,5 -C.sub.3 X.sub.3,3 -C.sub.1 X.sub.1,1 ; . . . ; Y.sub.7 =C.sub.7
X.sub.0,0 -C.sub.5 X.sub.2,2 +C.sub.3 X.sub.4,4 -C.sub.1 X.sub.6,6
+C.sub.1 X.sub.7,7 -C.sub.3 X.sub.5,5 +C.sub.5 X.sub.3,3 -C.sub.7
X.sub.1,1 ; relative to FIG. 3B another unit computes Y.sub.0 =C.sub.0
X.sub.0,0 +C.sub.0 X.sub.2,2 +C.sub.0 X.sub.4,4 +C.sub.0 X.sub.6,6
+C.sub.0 X.sub.7,7 +C.sub.0 X.sub.5,5 +C.sub.0 X.sub.3,3 +C.sub.0
X.sub.1,1 ; Y.sub.1 =C.sub.1 X.sub.0,0 +C.sub.3 X.sub.2,2 +C.sub.5
X.sub.4,4 +C.sub.7 X.sub.6,6 -C.sub.7 X.sub.7,7 -C.sub.5 X.sub.5,5
-C.sub.3 X.sub.3,3 -C.sub.1 X.sub.1,1 ; . . . ; Y.sub.7 =C.sub.7
X.sub.0,0-C.sub.5 X.sub.2,2 +C.sub.3 X.sub.4,4 -C.sub.1 X.sub.6,6 +C.sub.1
X.sub.7,7-C.sub.3 X.sub.5,5 +C.sub.5 X.sub.3,3 -C.sub.7 X.sub.1,1 ; and so
on for FIGS. 3C-3H. It will be appreciated that even though the same
output sample names (e.g.,Y.sub.0) are used above for explaining
computation of the 1-D DCTs for the different extended diagonals, there
are separate data paths for the respective l1-D DCT output samples so that
Y.sub.0 of the extended diagonal of FIG. 3A is not confused with Y.sub.0
of the extended diagonal o FIG. 3B. In one embodiment, distributed
arithmetic is used to compute the 8 DCTs along the extended diagonals.
Unlike the row-column based DCT method, the present method does not require
a second set of 8 1-D DCTs. Instead, a DCT polynomial transform is applied
to the 1-D DCT outputs from DCT processor 104. The polynomial transform is
comprised adders and subtractors, and no multipliers, which reduces the
number of FPGA resources consumed.
FIG. 5 is a functional block diagram of one embodiment of one input
permutation processor in combination with a 1-D DCT processor. It will be
appreciated that for each row in the input data a separate combination
including permutation processor 102a and 1-D DCT processor 104a is
required. For example, when processing 8.times.8 data blocks, 8 pairs of
permutation processors and 1-D DCT processor are required.
Two columns of parallel-to-serial shift registers (PSRs) are provided for
double buffering of the input samples, x(n). While new data x(n) is being
loaded into one column of PSRs, e.g., 202, the data from the other column,
e.g., 204, is clocked out in a bit serial fashion to multiplexers M0, . .
. ,M7, and then on to DCT processor 104a. The PSRs can be implemented
using an array of flip-flops (FFs). It will be appreciated that each CLB
in a Xilinx FPGA, for example, provides 2 FFs. Thus, for N=8, 4 CLBs are
required to construct a single PSR.
Multiplexers M0, . . . , M7 can be implemented using CLB function
generators, wherein each function generator implements 2:1 multiplexer.
The 2:1 multiplexers can then be cascaded to implement the larger 8:1
multiplexers M0, . . . , M7. It will be appreciated that the multiplexers
do not occupy much FPGA real-estate since they are configured to operate
on 1-bit wide data. The select signals, for example, the select signal
online 206 to multiplexers M0, . . . , M7 are generated, state machine
208. In one embodiment, state machine 208 includes a 3-bit counter and
8-bit wide look-up table (LUT) for example. The counter provides the
addresses for the LUT and the LUT contents reflect the data access pattern
defined in FIGS. 3A through 3H.
FIG. 6 is a functional block diagram of one embodiment of polynomial
transform processor 106. The inputs are designated as T.sub.i (z) (for
i=0, 1, . . . 7), and the transform results in polynomial bit-reversed
order are designated as P.sub.k (z) (for k=0, 1, . . . 7). Polynomial
transform processor 106 include a plurality of columns 302, 304, 306, 308,
310, 312, and 31 of polynomial storage elements. The polynomial storage
elements of column 302 store the respective outputs from the 1-DCT
processors. Thus, each storage element includes storage for up to 8 1-D
DCT output samples using the 8.times.8 matrix example. For example,
T.sub.0 (z) defines the DCT output samples generated from the extended
diagonal shown in FIG. 3A (Y.sub.0, Y.sub.1, Y.sub.2, Y.sub.3, Y.sub.4,
Y.sub.5, Y.sub.6, Y.sub.7). Polynomial storage element 316 is the storage
element associated with T.sub.0 (z)
The PT processing primarily includes polynomial butterflies, along with
suitable multiplexing at various points in the structure to avoid the
introduction of any complex quantities and to implicitly handle
"multiplication" by the polynomial transform kernel. In this context,
"multiplication" does not refer to an arithmetic product, but rather
refers to appropriate indexing of the data and any required sign-change.
The sign changes are associated with the polynomial residue reductions
that are required in computing the polynomial transform, as known in the
art.
A parallel pipeline structure is used to match the latency of the first
stage 1-D DCT processing with the throughput of the PT processor. If, for
example, the 1-D DCT processing requires 9 clock cycles to execute, the PT
must be performed within 9 clock cycles to avoid idling.
At this juncture, a change in notation is made to transition from the 1-D
DCT output samples to the polynomial expression. Specifically, relative to
1-D DCT outputs generated from the extended diagonal of FIG. 3A, T.sub.0
(z)=a.sub.0 z.sup.0 +a.sub.1 z.sup.1 +a.sub.2 z.sup.2 +a.sub.3 z.sup.3
+a.sub.4 z.sup.4 +a.sub.5 z.sup.5 +a.sub.6 z.sup.6 +a.sub.7 z.sup.7, where
a.sub.0 =Y.sub.0, a.sub.1 =Y.sub.1, a.sub.2 =Y.sub.2, a.sub.3 =Y.sub.3,
a.sub.4 =Y.sub.4, a.sub.5 =Y.sub.5, a.sub.6 =Y.sub.6, and a.sub.7
=Y.sub.7. Similarly, relative to 1-D DCT outputs generated from the
extended diagonal of FIG. 3B, T.sub.1 (z)=b.sub.0 z.sup.0 +b.sub.1 z.sup.1
+b.sub.2 z.sup.2 +b.sub.3 z.sup.3 +b.sub.4 z.sup.4 +b.sub.5 z.sup.5
+b.sub.6 z.sup.6 +b.sub.7 z.sup.7 ; relative to FIG. 3C, T.sub.2
(z)=c.sub.0 z.sup.0 +c.sub.1 z.sup.1 +c.sub.2 z.sup.2 +c.sub.3 z.sup.3
+c.sub.4 z.sup.4 +c.sub.5 z.sup.5 +c.sub.6 z.sup.6 +C.sub.7 Z.sup.7 ; . .
. ; and relative to FIG. 3H, T.sub.7 (z)=h.sub.0 z.sup.0 +h.sub.1 z.sup.1
+h.sub.2 z.sup.2 +h.sub.3 z.sup.3 +h.sub.4 z.sup.4 +h.sub.5 z.sup.5
+h.sub.6 z.sup.6 +h.sub.7 z.sup.7.
Each polynomial storage element stores a sequence of N 1-D DCT output
samples, where N is the number of 1-D DCT output samples produced from an
extended diagonal. In one embodiment, selected ones of the polynomial
storage elements are implemented using FIFOs to minimize control overhead.
See, for example, Alfke, U.S. Pat. No. 5,758,192. Others of the polynomial
storage elements are implemented using dualport RAM. See, for example,
Freidin et al., U.S. Pat. No. 5,566,123. Thus, known structures for both
implementations are available for CLBs of an FPGA.
The polynomial storage elements for T.sub.0 (z)-T.sub.7 (z) are paired and
input to column 332 of polynomial butterflies. For example T.sub.0 (z) is
paired with T.sub.4 (z), and polynomial storage elements 316 and 334
provide input to butterfly 336, which in turn provides input to polynomial
storage elements 338 and 340 of column 304.
The PT processing stage consists of several columns of polynomial
butterflies. Unlike the butterflies that are used in many FFT algorithms,
the PT transform butterflies operate on polynomials as compared to complex
tuples. In addition, the butterflies do not include any multipliers, only
adders and subtractors. Adders and subtractors can be readily implemented
using FPGA function generators.
The first output from each butterfly is generated by adding the input
polynomials, and the second output is generated by subtracting the input
polynomials. For example, butterfly 36 generates a first output to storage
element 338 as a function of T.sub.0 (z)+T.sub.4 (z) and a second output
to storage element 340 as a function of T.sub.0 (z)-T.sub.4 (z). Thus,
storage element 338 receives output that is a function of (a.sub.0
+e.sub.0)z.sup.0 +(a.sub.1 +e.sub.1)z.sup.1 +(a.sub.2 +e.sub.2) z.sup.2
+(a.sub.3 +e.sub.3) z.sup.3 +(a.sub.4 +e.sub.4) z.sup.4 +(a.sub.5
+e.sub.5)z.sup.5 +(a.sub.6 +e.sub.6)z.sup.6 +a.sub.7 +e.sub.7)z.sup.7 ;
and storage element 340 receives output that is a function of (a.sub.0
-e.sub.0)z.sup.0 +(a.sub.1 -e.sub.1)z.sup.1 +(a.sub.2 -e.sub.2) z.sup.2
+(a.sub.3 -e.sub.3)z.sup.3 +(a.sub.4 +e.sub.4)z.sup.4 +(a.sub.5
-e.sub.5)z.sup.5 +(a.sub.6 -e.sub.6)z.sup.6 +(a.sub.7 -e.sub.7)z.sup.7.
Outputs P.sub.0 (z), P.sub.2 (z), P.sub.4 (Z), P.sub.6 (z) are generated
using the implementation described above as applied to columns 304, 306,
and 308 of polynomial storage elements and columns 342 and 344 of
butterflies. Outputs P.sub.0 (z), P.sub.2 (z), P.sub.4 (z) P.sub.6 (z),
however, are generated using butterflies in combination with additional
multiplexing and shift logic.
The additional multiplexing logic is illustrated with reference to the rows
of polynomial storage elements aligned with T.sub.4 (z) and T.sub.6 (z).
The additional multiplexing logic is used to produce new sequences for the
rows one iteration later in time. For example, in column 342 of
butterflies, rows 4 and 6 are combined using butterfly circuit 356 to
produce new sequences one iteration later. One example of a butterfly
computation is defined in the equations below:
Y'(4, i)=1/2(Y(4, i)-Y(6,8-i))
Y(6,i)=1/2(y(4,i)+y(6,8-i)) i=1,2,3
y'(4,8-i)=1/2(y(4,8-i)-y(6,i))
y(6,8-i)=1/2(y(4,8-i)+y(6,i)) i=1,2,3
The primed variables indicate the new outputs and the notation y.(i,j)
references the j-th element in the i-th row of a 2-D data set. It will be
appreciated that column elements 0 and 4 are not variables in these
equations, and therefore, must pass through the processing phase unaltered
in order to be available to later processing stages. The selection signal
applied to multiplexer 352 allows the coefficient for terms z.sup.0 and
z.sup.4 from polynomial storage elements 340 and 354 to bypass butterfly
356 and progress unaltered to the next stage (column 306). For all other
terms, the polynomial sum and difference are presented at the output of
multiplexer 352. The previous example has demonstrate the nature of the
calculations required to compute a sub-expression in the polynomial
transform. The function of the multiplexers was also explained. Similar
functionality, that is, to a pass a sum/difference calculation, or a
direct result, is needed at other nodes in the architecture shown in FIG.
6., this is the purpose of all of the multiplexers in the architecture.
The exact details are embodied in the mathematical description of the
polynomial transform given by Prado. The remaining multiplexer of PT
processor 106 perform comparable functions for other terms of the
polynomials in later stages of the transform. The basis for the bypasses
is found in the math described by Prado.
The mathematical definition of the polynomial transform requires that
certain intermediate results must be scaled by 1/2. It is well know that
this can be achieved in binary arithmetic by right shifting a binary value
by 1 digit position. Care must be taken to preserve the correct sign of
the scaled result. This can be achieved by simply copying the most
significant digit of the original word to the most significant digit
position in the scaled output result. Block 362 represents a 9-bit
register for storage of a scaled intermediate result.
PT processor 106 also includes shift logic elements 372 and 374. FIG. 7
illustrates a functional example of these shift logic elements. Shift
logic element 402 includes 4 data input ports: two 1-bit shift right
inputs and two 2-bit shift right inputs. Data applied to the 1-bit shift
right input ports is shifted right by one bit, and data applied to the
2-bit shift right input ports is shifted right by two bits. A selection
signal input is provided to select one of the four in puts as the output.
The data selections are made in accordance with the mathematical
description of the polynomial transform given by Prado. Relative to shift
logic element 372 of FIG. 6, the same data from polynomial storage element
376 is provided as input on lines 404 and 406 to shift logic element 402.
Similarly, the same data from polynomial storage element 378 is provided
as input on lines 408 and 410 to shift logic element 402. Thus, shift
logic element 372 provides as output one of data from storage element 376
shifted right by one bit, data from storage element 376 shifted right by
two bits, data from storage element 378 shifted right by one bit, or data
from storage element 378 shifted right by two bits.
The present invention is believed to be applicable to a variety of 2-D DCT
applications, including video and audio encoding. While the present
invention is not so limited, an appreciation of the present invention has
been provided by way of specific examples involving 8.times.8 matrices of
data. Other aspects and embodiments of the present invention will be
apparent to those skilled in the art from consideration of the
specification and practice of the invention disclosed herein. It is
intended that the specification and illustrated embodiments be considered
as examples only, with a true scope and spirit of the invention being
indicated by the following claims.
* * * * *
|
|
|
|
|
Description  |
|