|
Description  |
|
|
BACKGROUND
1. Field of the Invention
The present invention relates generally to motion estimation, and more
specifically to a programmable architecture and methods for motion vector
and/or prediction error determination.
2. Description of Related Art
Applications such as video telephone, digital television, and interactive
multimedia using such digital storage technology as CD-ROM, digital audio
tape, and magnetic disk require digital video coding, or video
compression, to achieve the necessary high data transfer rates over
relatively low bandwidth channels. Various standards have been proposed
for video coding. A standard for the storage and transmission of still
images has been adopted by the International Standards Organization
("ISO"), Joint Photographic Expert Group ("JPEG"); see "JPEG Technical
Specification, Revision 5," JPEG-8-R5, January 1980. A standard for
digital television broadcast coding at 30/45 Mb/s is under consideration;
see CCIR-CMTT/2, "Digital Transmission of Component-Coded Television
Signals at 30-34 Mb/s and 45 Mb/s Using the Discrete Cosine Transform,"
Document CMTT/2-55. A standard for video telephony and video conferencing
at 64 to 1920 kb/s has been adopted by the International Consultative
Committee for Telephone and Telegraph ("CCITT"); see "Draft Revision of
Recommendation H.261," Document 572, CCITT SG XV, Working Party XV/1,
Spec. Grp. on Coding for Visual Telephony. A standard for storage
applications below 1.5 Mb/s, which are similar to the applications
targeted by the CCITT standard, is under consideration by the Moving
Picture Experts Group ("MPEG") of the ISO. Video coding algorithms have
been proposed as contributions to the standardization activity of
ISO/MPEG; see Wong et al., "MCPIC: A Video Coding Algorithm for
Transmission and Storage Applications," IEEE Communications Magazine,
November 1990, pp. 24-32.
Many video coding techniques include a predictive mode that realizes data
compression between two different video frames by identifying how a frame
is unlike a preceding frame. In predictive mode, the frame is represented
in terms of a set of vectors of the displacement of respective groups of
pixels in the frame relative to their position in the preceding frame,
known as motion vectors; and difference information representing the
degree of difference between the displaced pixels and the corresponding
pixels in the preceding frame. Because the amount of data in the set of
motion vectors and difference information tends to be considerably less
than the amount of data in the frame itself, the two frames are adequately
represented by the considerably less data present in the preceding frame
plus the motion vectors and difference information. When the frame is
required in uncompressed form, it is reconstructed by applying the motion
vectors and difference information to the preceding frame.
Because effective video coding requires the intimate integration of digital
video compression technology, integrated circuit technology, and digital
storage media, and as various standards for digital video compression
exist and are proposed, a need has arisen for a flexible, high
performance, low implementation cost programmable architecture for motion
estimation.
SUMMARY OF THE INVENTION
The present invention is advantageous in many respects. For example, the
programmability aspect of the present invention enables support of future
algorithms, and allows the addition of customer-proprietary optimizations
and algorithms.
These and other advantages are achieved in the present invention, which in
one embodiment is an apparatus for performing an arithmetic operation on
groups of pixels under program control having two memories and an
arithmetic unit. One of the memories has a plurality of addressable
locations N pixels in width and a read port, and N pixels from any one of
the addressable locations are accessible in parallel on the read port
during an address cycle. The other memory has a plurality of addressable
locations greater than N pixels in width and a read port, and any N
contiguous pixels from any one of the addressable locations are accessible
in parallel on the read port during an address cycle. The arithmetic unit
is connected to the two memory ports.
In another embodiment, a memory is included that has a plurality of
addressable locations greater than N pixels in width and two read ports,
and any N contiguous pixels from any one of the addressable locations are
accessible in parallel on each of the read ports during an address cycle.
The arithmetic unit is connected to the two ports.
In an embodiment of a pixel-group random access memory, a memory array has
a plurality of addressable locations N pixels in width and a read port,
and N pixels from any one of the addressable locations and N pixels from
an adjacent addressable location are accessible in parallel on the read
port during an address cycle. A shifter has its input coupled to the read
port, and provides N pixels on its output.
In several method embodiments, groups of pixels are read from two memory
ports, at least one of which is pixel-group random addressable, and used
to determine sums of absolute differences, pixel differences, and pixel
averages.
BRIEF DESCRIPTION OF THE DRAWINGS
In the Figures, where like reference numerals indicate like parts,
FIG. 1 is a schematic diagram showing the relationship between an image
block and search blocks within a search window;
FIG. 2 is a block diagram showing a motion estimation architecture in
accordance with the present invention;
FIG. 3A is a block diagram representing a memory arrangement for full pixel
motion estimation;
FIG. 3B is a block diagram representing a memory arrangement for one-half
and one-quarter pixel interpolation;
FIGS. 4 and 5 are pictorial representations of full pixel positions of a
search matrix and their relationship to interpolated pixels at one-half
and one-quarter pixel displacements;
FIGS. 6 and 7 are schematic representations of the configuration of various
memories in the architecture of FIG. 2;
FIG. 8 is a block diagram of an address generator;
FIG. 9 is a block schematic diagram of a pixel-group random access memory
useful in the datapath of the architecture of FIG. 2;
FIGS. 10A and 10B are a schematic representation of a portion of the
pixel-group random access memory of FIG. 9;
FIGS. 11A-11D are a block schematic diagram of a funnel shifter and
transpose network useful in the datapath of the architecture of FIG. 2;
FIG. 12 is a schematic representation of the pixel-group random access
memory of FIG. 9 useful in explaining the operation of the funnel shifter
of FIG. 11;
FIG. 13 is a block schematic diagram of another memory useful in the
datapath of the architecture of FIG. 2; and
FIG. 14 is a block schematic diagram of an arithmetic logic unit useful in
the datapath of the architecture of FIG. 2.
DETAILED DESCRIPTION OF THE PREFERRED AND OTHER EMBODIMENTS
Motion vector searching typically involves comparing an input or image
block with search blocks within a search window centered on the frame
location of the image block. The image block 12 shown in FIG. 1 is
obtained, for example, from a video input device 10, which may be a video
camera, video transmission, preframe video memory, or the like. The image
block 12 may be any convenient size; 16.times.16 pixels is exemplary. The
search window 24 is obtained typically from a frame memory 20, in which a
previously processed frame is stored. The search window 24 is
approximately centered on the location of the image block 12. In FIG. 1,
the search block 22 (shown in solid line) represents the zero displacement
search block. The search blocks of the search window 24 are generally of
the same size as the image block 12. The search window 24 is defined by an
illustrative displacement of the search block 22 eight pixels to the left
(block 26.1, outlined in a fine phantom line) and seven pixels to the
right (block 26.2, outlined in a coarse phantom line), seven pixels up,
and eight pixels down. In this embodiment, the size of the search window
24 for a full pixel search is 31.times.31. A larger search window 24 may
be used if more memory is available.
The image block 12 is successively compared in comparator 30 with the
search blocks in the search window 24, and is represented for storage or
transmission by displacement data, or motion vectors, and by difference
information, or prediction error data, based on the closest matching
search block in the search window 24.
Typically, although not necessarily, luminance information is used for
motion vector searching. The size of the basic luminance information unit
is somewhat discretionary, and generally depends on the application and
design choice. For example, in the embodiment of a vision processor
described in detail in the above-referenced patent document of Fandrianto
et al. entitled "Vision Processor," which is incorporated herein by
reference, the basic video information processing unit, or macroblock, is
a 16.times.16 pixel luminance matrix.
An illustrative programmable architecture 100 for implementing motion
vector searching is illustrated in FIG. 2. Rapid and efficient motion
vector searching is accommodated by two high-speed, multi-ported register
files in the datapath of the architecture 100: an image block, best match
block memory conveniently referred to as DP memory 124, and a search
memory conveniently referred to as DPCM memory 130. The memories 124 and
130 are configured in an advantageous manner based on the desired video
information block size and on the critical operations required of the
architecture 100 in executing certain widely accepted current standards
and possible future standards. Other important data path elements for
motion vector estimation include two funnel shifters 140 and 144, an
arithmetic logic unit ("ALU") 154, and a tree adder 156. Shifter 140 is
connected to port A of the DPCM memory 130, and shifter 144 is connected
to port B of the DPCM memory 130. ALU 154 receives pixel data items from
shifters 140 and 144 and from the DP memory 124.
The video information stored in the DP memory 124 and the DPCM memory 130
are stored as blocks. A basic configuration for an motion vector searching
on an integral full pixel basis is shown in FIG. 3A. Two blocks are stored
in the DP memory 124, a "P" or preframe block 80 and a "D" block 82.
Illustratively, each block is 16.times.16 pixels, so that the minimum
memory size is 16.times.32.times.8 bits, for example. The entire search
window 24 is copied into the DPCM memory 130 from frame memory 20.
Illustratively, to contain the search window 24, the size of DPCM memory
130 need be 31.times.31.times.8 bits.
For motion vector searching on a full integral pixel basis, the best block
match is determined among all search blocks in the search window 24 by a
minimum sum of absolute block differences criteria. For each search block,
a minimum sum of absolute block differences is determined in accordance
with the expression
##EQU1##
in which P.sub.ij (i,j=1, . . . , 16) are the pixels of the image block 12
stored in P block 80 of DP memory 124, and W.sub.Xo+i,Yo+j (i,j=1 , . . .
, 16) are the pixels of a given search block having an origin
X.sub.o,Y.sub.o in the search window 24, where X.sub.o,Y.sub.o are in the
illustrative range of -8 through +7. Note that the center search block 22
(FIG. 1) is at X.sub.o,Y.sub.o =0,0.
The minimum sum of absolute block differences of expression (1) is
implemented in the architecture of FIG. 2 as follows, although other
implementations may be realized. The image block 12 is read into the P
block 80, while the entire search window 24 is read from an external
memory (not shown) into the DPCM memory 130, overwriting any previously
stored search window. For each search block in the search window 24, the
differences between the pixels stored in P block 80 and the current search
block is determined in ALU 154, and summed by tree adder 156. Two hundred
fifty-six differences are computed per each sum. The sum for the current
search block is compare with the minimum sum of absolute block differences
stored in a register (not shown) in controller 102, and substituted
therefor if less, along with the search block identity. These steps are
repeated until all search blocks of the search window 24 stored in DPCM
memory 130 have been compared with the image block 12 stored in the P
block 80 of the DP memory 124, at which point the minimum sum of absolute
block differences and the search block to which it corresponds, known as
the best match block, have been identified.
If no half-pixel or finer interpolation is to be done, the motion vector is
known simply from the spatial identity of the best match block, while the
prediction error is determined as follows. The prediction error is the
difference between the best match block stored relative to location
W.sub.Xb,Yb and the image block stored in P block 80, or
PE.sub.ij .DELTA. W.sub.Xb+i,Yb+j -P.sub.ij (2)
for i,j=1, . . . , 16. This calculation is performed in the ALU 154, and
the results, the prediction error, are written into the D block 82 of the
DP memory 124. Note that the calculation for expression (2) was previously
performed in the implementation of expression (1), and could have been
written into the D block 82 of the DP memory 124 or into other memory
rather than recalculated. In the architecture of FIG. 2, however,
performing the calculation for expression (2) is generally faster than
multiply writing into the D block 82 during the calculation of expression
(1) while avoiding the need for additional memory.
In the case of integral pixel motion estimation, only part of the DPCM
memory 130 needs to be updated, as the search area for the next preframe
block typically overlaps with the search area of the current preframe
block. For example, where the preframe block size is 16.times.16,
typically only 16 new columns of the DPCM memory 130 need to be brought in
from the external memory. This compares favorably with the alternative of
bringing in 31 new columns for every new search area. As the search area
of the DPCM memory 130 increases, this technique becomes increasingly
useful for reducing the external memory bandwidth requirements and overall
system cost. The PRAM (Pixel-group Random Access Memory) addressing mode
allows addressing the DPCM memory 130 in such a way that any random N
contiguous pixels can be accessed in parallel from a memory array of size
greater than N pixels in width. It will be appreciated that a minor
additional complexity introduced by bringing in only part of the new
search area is that the starting address of the search area shifts by a
fixed amount in the horizontal direction.
For improved prediction accuracy, half-pixel estimation and quarter-pixel
estimation are performed after the integral pixel estimation. In motion
vector searching with one-half or one-quarter pixel accuracy, the DPCM
memory 130 is used to store a search window 24 that is in large part an
interpolated search matrix generated from a best match search block from a
less accurate estimation operation.
A basic configuration of the DP memory 124 and the DPCM memory 130 for
half-pixel estimation is shown in FIG. 3B. As in full pixel estimation,
two blocks are stored in the DP memory 124, the P block 80 and the D block
82, and each block is 16.times.16 pixels. Somewhat more than four blocks
are stored in the DPCM memory 130, however. An "X" block 70 receives the
best match search block and surrounding pixels loaded from the frame
memory 20 or from the block stored relative to the location W.sub.Xb,Yb.
"A" block 72, "B" block 74, and "C" block 76 are interpolated from the X
block 70 and used in half pixel and, later, quarter pixel estimation, as
described below. Illustratively, the X block 70 is 18.times.18 pixels, the
A block 72 is 18.times.17 pixels, the B block 74 is 17.times.18 pixels,
and the C block 76 is 17.times.17 pixels, so that the preferable minimum
memory size of the DPCM memory 130 for half-pixel estimation is
35.times.35.times.8 bits, for example.
A conceptual representation of an illustrative interpolated search matrix
400 is shown in FIG. 4. The matrix 400 comprises four completely
interleaved matrices corresponding to blocks 70, 72, 74 and 76,
respectively containing pixels X.sub.r,c, A.sub.r,c, B.sub.r,c, and
C.sub.r,c, wherein "r" is the row number and "c" is the column number. The
number of rows and columns in each of the interleaved X, A, B and C
matrices is dependent on the application and to some extent design choice.
In an illustrative arrangement useful in the vision processor disclosed in
the aforementioned patent document of Fandrianto et al. entitled "Vision
Processor," the 16.times.16 pixels of a best match search block 71 from a
motion vector search on an integral full pixel basis are shown as matrix
elements X.sub.1,1 through X.sub.16,16, bounded within the region 402 in
FIG. 4 by a double line. Note that the full X block 70 includes X.sub.r,c
(r=0, . . . , 18; c=0, . . . , 18) pixels. Pixels X.sub.0,0 through
X.sub.0,17, X.sub.0,0 through X.sub.17,0, X.sub.17,0 through X.sub.17,17,
and X.sub.0,17 through X.sub.17,17 are adjacent the best match search
block 71 and are copied into X block 70 of the DPCM memory 130 to allow
interpolation of fractional pixel positions about all of the pixels of the
best match search block 71. Pixels A.sub.r,c (r=0, . . . ,17; c=0, . . . ,
16) of the A block 72 are horizontally interpolated at half-pixel
locations, from the X block 70. Pixels B.sub.r,c (r=0, . . . , 16; c=0, .
. . , 17) of the B block 74 are vertically interpolated at half-pixel
locations, from the X block 70. Pixels C.sub.r,c (r=0, . . . , 16; c=0, .
. . , 16) are pixels interpolated at half-pixel locations, preferably
vertically from the A matrix 72 or horizontally from the B matrix 74, but
may also be interpolated diagonally from the X matrix 70. The D block 82
and the P block 80 stored in the DP memory 124 are each 16.times.16
pixels. Block 70 was present during the motion vector search on an
integral full pixel basis, and is merely relocated in the DPCM memory 130,
to reduce external memory bandwidth.
The A block 72, the B block 74, and the C block 74 are interpolated as
follows. The A block 72 is formed by progressive horizontal interpolations
of the X block 70. A pixel group from a row of the X block 70 is addressed
on both ports A and B of the DPCM memory 130. Accordingly, the same pixel
group is loaded into both shifters 140 and 144. One of the pixel groups is
shifted one pixel; for example, the pixel group in funnel shifter 144 is
shifted one pixel, or eight bits, to the right. The unshifted output from
funnel shifter 140 and the one-pixel right shifted output from the funnel
shifter 144 are presented to respectively the A and B inputs of the ALU
154, which performs a divide by two and a rounding off. The result is
routed from the ALU 154 into appropriate address locations of the A block
74 in the DPCM memory 130. This process is continued until the entire
horizontal interpolation of the X block 70 is complete and the entire A
block 72 created.
The B block 74 is formed by progressive vertical interpolations of the X
block 70. A pixel group from a row of the X block is addressed on port A
of the DPCM memory 130, and a pixel group from an immediately adjacent row
of the X block 70 having the same column locations is addressed on port B
of the DPCM memory 130. The pixel groups on ports A and B pass through
funnel shifters 140 and 144 without being shifted, and are presented to
respectively the A and B ports of the ALU 154. The ALU 154 performs a
divide by two and a rounding off, and the result is routed into
appropriate address locations of the B block of the DPCM memory 130. This
process is continued until the entire vertical interpolation of the X
block 70 is complete and the entire B block 74 created.
The C block 76 is formed by progressive interpolation of preferably either
the A block 72 vertically, or the B block 74 horizontally. Alternatively,
progressive interpolation of the X block diagonally may be done.
Horizontal and vertical interpolation are described above in the context
of the A block 72 and the B block 74. In diagonal interpolation of the X
block 70, one pixel group from the X block 70 is addressed on port A of
the DPCM memory 130, and a pixel group from an immediately adjacent row of
the X block 70 having the same column locations is addressed on port B of
the DPCM memory 130. One of the pixel groups is shifted one pixel; for
example, the pixel group in funnel shifter 144 is shifted one pixel, or
eight bits, to the right. The unshifted output from funnel shifter 140 and
the one-pixel right shifted output from the funnel shifter 144 are
presented to respectively the A and B inputs of the ALU 154, which
performs a divide by two and a rounding off. The result is routed from the
ALU 154 into appropriate address locations of the C block 76 in the DPCM
memory 130. This process is continued until the entire horizontal
interpolation of the X block 70 is complete and the entire C block 76
created.
Once the search matrix 400 is generated, motion vector searching on a
half-pixel basis is similar to motion vector searching on a full-pixel
basis, as described in association with expression (1) above. Note,
however, that because the X block 70 is 18.times.18 rather than
16.times.16, the interpolated A block 72 is effectively two interpolated
16.times.16 blocks, the interpolated B block 74 is effectively two
interpolated 16.times.16 blocks, and the interpolated C block is
effectively four interpolated 16.times.16 blocks. The DPCM memory 130 must
be carefully addressed to properly read these eight interpolated
16.times.16 blocks. Once all search blocks of the search matrix 400 stored
in DPCM memory 130 have been compared with the image block 12 stored in
the P block 80 of the DP memory 124, the minimum sum of absolute block
differences resides in controller 102, along with the identity of the
search block to which it corresponds, known as the best match block.
If no quarter-pixel interpolation is to be done, the motion vector is known
simply from the spatial identity of the best match block, while the
prediction error is determined as described above in association with
expression (2). The calculation is performed in the ALU 154, and the
results, the prediction error, are written into the D block 82 of the DP
memory 124.
Motion vector searching on a quarter-pixel basis is similar to motion
vector searching on a full-pixel basis, except that an absolute block
difference is determined from a comparison of the P block 80 with a memory
block that contains pixels displaced one-quarter pixel from the best match
block. Various interpolation techniques may be used in the derivation of a
given quarter pixel estimation block, depending on various factors such as
the amount of memory available and the size of the memory ports and data
buses. The following technique is suitable for the architecture of FIG. 2,
although other techniques may be used if desired.
Generally, the best matched block (which is either a full pixel block or a
half pixel interpolated block) is copied from the DPCM memory 130 into a
free block of the DP memory 124, which at this point in the process may be
the D block 82 or any additional memory block such as 84 (shown in phantom
in FIG. 3B) as might be furnished for scratchpad or other purposes. The
block of DPCM memory 130 previously containing the best match block is now
free to receive the current quarter pixel interpolated block. When
interpolation is restricted to only horizontal and vertical interpolation,
some of the quarter pixel estimation blocks are interpolated from one or
more of the full and half pixel estimation search blocks (X block 70, A
block 72, B block 74, and C block 76), while other quarter pixel
estimation search blocks are interpolated from quarter pixel estimation
search blocks. Alternatively, when diagonal interpolation is also used,
all quarter pixel estimation search blocks are interpolated from the full
and half pixel estimation search blocks.
The current quarter pixel interpolated block is compared with the image
block stored in the P block 80 of the DP memory 124. The comparison yields
a current sum of absolute block differences, which is compared with the
minimum sum of absolute block differences stored in the controller 102. If
the current sum of absolute block differences is less than the minimum sum
of absolute block differences, the new value replaces the old value stored
in the controller 102, and the identity of the current quarter pixel
search block is substituted for the identity of the former best match
block stored in controller 102. If the current sum of absolute block
differences is equal to or greater than the minimum sum of absolute block
differences, no change is made.
At the end of the quarter pixel estimation, the identity of the best match
block resides in a register of the controller 102. This may be a full
pixel estimation block, a half pixel estimation block, or a quarter pixel
estimation block. The motion vector is known simply from the spatial
identity of the best match block, while the prediction error between the
image block stored as P block 80 in the DP memory 124 and the best match
search block stored in the DPCM memory 130 is determined as described
above in association with expression (2). The calculation is performed in
the ALU 154, and the results, the prediction error, are written into the D
block 82 of the DP memory 124.
In the limited memory embodiment of FIG. 2, the order in which the quarter
pixel estimation search blocks are generated and compared, and the
selection of obsolete full and half pixel estimation search blocks to be
overwritten are somewhat discretionary. The order shown in FIG. 5 about
the pixel element A.sub.10,10 is therefore illustrative. For purposes of
illustration, assume that after half pixel estimation, the half pixel
interpolated A block 72 is found to be the best match block. FIG. 5 shows
in the highlighted areas about the elements A.sub.r,c in the interleaved
search matrix 400 (see, for example, the numbered highlighted areas about
element A.sub.10,10) that eight blocks must be generated and compared with
the image block 12.
First, the best match block is moved from A block 72 of the DPCM memory 130
into the D block 82 of the DP memory 124. In conformance with FIG. 5,
however, the pixels of the best match block now stored in the D block 82
are referred to as A.sub.r,c. This frees up the A block 72 to hold the
current quarter pixel interpolated block, the pixels of which for
convenience are referred to as Q.sub.r,c.
The first quarter pixel estimation search block of Q1 pixels is generated
from a horizontal interpolation of the data in the X block 70 and the D
block 82, and stored in the A block 72 for the absolute displaced block
difference calculation, in accordance with the following expression.
Q1.sub.r,c =(X.sub.r,c +A.sub.r,c)/2 (3)
Consider, for example, the interpolation of the Q1 pixel to the left of
pixel A.sub.10,10. The pixel group A.sub.10,8 -A.sub.10,15 in row ten of
the D block 82 is addressed on, say, port A of the DP memory 124 and
presented through the multiplexer 152 to the A port of the ALU 154. At
about the same time, a collection of pixels containing the pixel group
X.sub.10,8 -X.sub.10,15 in row ten of the X block 70 is addressed on, say,
port A of the DPCM memory 130 and the pixel group X.sub.10,8 -X.sub.10,15
is selected by shifter 140 and presented through the multiplexer 152 to
the B port of the ALU 154. The ALU 154 sums the pixel groups, divides by
two, and rounds the result. An eight pixel result at the output of the ALU
154 is routed back to the DPCM memory 130, where it is stored as pixels
Q1.sub.10,8 -Q1.sub.10.15 in the tenth row of the A block 72.
The second quarter pixel estimation search bloc | | |