|
Claims  |
|
|
What is claimed:
1. A system for Q.times.R kernel convolution of an M.times.N array of
words, where Q.times.R is a subset of M.times.N comprising:
input means for receiving, in sequence, words P from said M.times.N array;
coefficient means for storing coefficients C.sub.1 through C.sub.Q.times.R
;
Q.times.R matrix of multiplying means, each multiplying means having a
first input receiving Pi and a second input receiving a dedicated Cj for
producing a product PiCj where i varies from 1 to M.times.N and j is one
of 1 to Q.times.R;
said first input of each of said multiplying means is connected to a first
input of a preceding multiplying means in its respective row of a one
stage buffer means;
said first input of a first multiplying means in each row is connected to a
first input of a last multiplying means of a previous row by a row buffer
means having M-R+1 stages; and
adder means connected to said Q.times.R matrix of multiplying means for
adding said products PiCj to providing an output convolution Pc as a first
output.
2. A system according to claim 1 wherein said row buffer means is
programmable for different values of M.
3. A system according to claim 1 including:
second input means for receiving R-1 words P; and
a plurality of multiplexer means each having a first input from said second
input means and a second input from a row buffer means and an output
connected to a first input of a first multiplying means of a respective
row.
4. A system according to claim 3 including means for selectively connecting
said second input means to said adder means.
5. A system for Q.times.R kernel convolution of an M.times.N array of
words, where Q.times.R is a subset of M.times.N, comprising:
input means for receiving, in sequence, words P from said M.times.N array;
coefficient means for storing coefficient C.sub.1 through C.sub.Q.times.R ;
Q.times.R matrix of multiplying means, each multiplying means having a
first input receiving Pi and a second input receiving a dedicated Cj for
producing a product PiCj where i varies from 1 to M.times.N and j is one
of 1 to Q.times.R;
a plurality of buffer means each connecting said input means to a
respective multiplying means first input for storing one or more words P
and delaying inputting of Pi to a respective multiplying means as a
function of its position in said Q.times.R matrix and the row length M of
said M.times.N array;
an ALU means connected between said input means and said plurality of
buffer means for performing arithmetical and logical operations on said
words P before being inputted to said plurality of buffer means; and
adder means connected to said Q.times.R matrix of multiplying means for
adding said products PiCj to providing an output convolution Pc as a first
output.
6. A system according to claim 5 wherein said plurality of buffer means
includes:
row buffer means connected to a first multiplying means of a row and having
a plurality of stages as a function of row length M; and
other buffer means connected to said multiplying means other than said
first multiplying means of a row and having a number of stages as a
function of its position in its row.
7. A system according to claim 6, wherein said row buffer means is
programmable for different values of M and the other buffer means are a
fixed number of stages.
8. A system according to claim 7, wherein said system is convertible to a
(Q.times.R).times.1 convolution by programming said row buffers to Q
stages.
9. A system according to claim 6 including:
second input means for receiving R-1 words P; and
a plurality of multiplexer means each having a first input from said input
means and a second input from said row buffer means and an output
connected to a first input of a first multiplying means of a respective
row.
10. A system according to claim 9 including means for selectively
connecting said second input means to said adder means.
11. A system according to claim 5 including second input means for
receiving and storing a second word; and
wherein said ALU means includes a first input connected to said input means
and a second input connected to said second input means, and said ALU
means performs arithmetical and logical operations on both said words P
and said second word.
12. A system according to claim 5 wherein said coefficient means includes a
first and second coefficient means for storing first and second groups of
coefficients and a multiplexer means for connecting one of said
coefficient means to second inputs of said multiplying means.
13. A system according to claim 5 including:
second input means connected to said adder means; and
second output means connected to an output buffer means whose output Pd
represents a given number of stages of delay.
14. A system according to claim 13 wherein:
said output buffer means is programmable to adjust said given number of
stages of delay; and
said first mentioned input means includes an input buffer means
programmable to different numbers of stages of delay.
15. A system according to claim 14 wherein said plurality of buffer means
includes:
row buffer means connected to a first multiplying means of a row and having
a plurality of stages as a function of row length M;
other buffer means connected to said multiplying means other than said
first multiplying means of a row and having a number of stages as a
function of its position in its row; and
said output buffer means is a row buffer means of a last row Q.
16. A system according to claim 5 including
programming input means for receiving programming data;
address means for receiving addresses and determining the destination of
said programming data;
said address means directs coefficients from said programming input means
to said coefficient means by an appropriate address.
17. A system according to claim 16 wherein said coefficient means stores
two sets of coefficients and said address means selects one or the other
sets of coefficient to be provided to said multiplying means by an
appropriate address.
18. A system according to claim 16 an ALU means connected between said
input means and said multiplying means for performing arithmetical and
logical operations on said words P before being inputted to said
multiplying means; and
wherein said address means directs microcode from said programming input
means to said ALU means.
19. A system according to claim 18,
including an ALUR means for storing a second word from said programming
input means;
wherein said ALU means includes a first input connected to said input means
and a second input connected to said ALUR means; and
wherein said address means directs said second word from said programming
input means to said ALUR means.
20. A system according to claim 16 including instruction means for storing
an instruction word which provides control of said system; and
wherein said address means directs instruction words from said programming
input means to said instruction means.
21. A system according to claim 20 wherein said instruction word includes
bits for output rounding and number formats.
22. A system according to claim 20 including second data input means for
selectively providing second input data to said multiplying means and said
adder means; and
wherein said instruction word includes a bit for controlling said second
data input means.
23. A system according to claim 20 including input buffer means
programmable to different number of stages of delay; and
wherein said instruction word includes bits for programming said input
buffer means.
24. A system according to claim 16 wherein
said plurality of buffer means includes row buffers means programmable to
different values of M; and
said address means directs values of m from said programming input means to
said row buffer means for programming.
25. A system for Q.times.R kernel convolution of an M.times.N array of
words, where Q.times.R is a subset of M.times.N, comprising:
first input means for receiving, in sequence, words P from said M.times.N
array;
coefficient means for storing coefficients C.sub.1 through C.sub.Q.times.R
;
Q.times.R matrix of multiplying means, each multiplying means having a
first input receiving Pi and a second input receiving a dedicated Cj for
producing a product PiCj where i varies from 1 to M.times.N and j is one
of 1 to Q.times.R;
R-1 row buffer means each connecting said first input means to a respective
row of multiplying means first input for storing one or more words P and
delaying inputting of Pi to a respective multiplying means as a function
of its position in said Q.times.R matrix and the row length M of said
M.times.N array;
second input means for receiving R-1 words P;
a plurality of multiplexer means each having a first input from said second
input means and a second input from a row buffer means and an output
connected to a first input of a respective row of said multiplying means;
adder means connected to said Q.times.R matrix of multiplying means for
adding said products PiCj to proving an output convolution Pc as a first
output; and
selection means for selectively connecting said second input means to
either said multiplexing means or said adder means.
26. A system according to claim 25 including R-1 external buffers connected
to said second input means, each of said external buffers having a greater
capacity than said row buffer means.
27. A system according to claim 25 including a summing means, for each row
R, for providing a sum of products PiCj of its respective row to said
adder means; and
wherein said adder means adds said sums from said summing means to provide
said output convolution Pc.
28. A system according to claim 27 including a second input means for
receiving additional sums and said adder means is connected to said second
input means.
29. A system for Q.times.R kernel convolution of an M.times.N array of
words where Q.times.R is a subset of M.times.N comprising:
input means for receiving, in sequence, words P from said M.times.N array;
coefficient means for storing coefficients C.sub.1 through C.sub.Q.times.R;
Q.times.R matrix of multiplying means, each multiplying means having a
first input receiving Pi and a second input receiving a dedicated Cj for
producing a product PiCj where i varies from 1 to M.times.N and j is one
of 1 to Q.times.R;
a plurality of buffer means each connecting said input means to a
respective multiplying means first input for storing one or more words P
and delaying inputting of Pi to a respective multiplying means as a
function of its position in said Q.times.R matrix and the row length M of
said M.times.N array;
adder means connected to said Q.times.R matrix of multiplying means for
adding said products PiCj to proving an output convolution Pc as a first
output; and
frame reset means for resetting said matrix of multiplying means, said
buffer means and said adder means and not said coefficient means.
30. A system according to claim 29 including reset means for resetting said
coefficient means, said multiplying means, said buffer means and said
adder means.
31. A system according to claim 29 wherein said buffer means have
programmable length and said frame reset means resets said buffer means
content without resetting its programmable length.
32. A system according to claim 29 including clock input means for
receiving a clock pulse which determines the processing rate and a hold
input means for enabling or disabling said clock input means.
33. A system for S.times.T kernel convolution of an M.times.N array of
words where S.times.T is a subset of M.times.N comprising:
a plurality of Q.times.R capacity convolvers, where Q.times.R is a subset
of S.times.T, each having a data input, coefficient input, a cascade input
connected to an output adder of said convolver, a product output from said
output adder and a cascade output providing delayed input data;
said convolvers being arranged in rows and columns;
a cascade input of each convolver being connected to a product output of a
previous convolver in its respective row, and the cascade input of a first
convolver of each row being connected to a product output of a last
convolver of a previous row;
a data input of each convolver being connected to cascade output of a
previous convolver in its respective column, and the data output of
convolvers of a first row are connected to the data input of a first
convolver of said first row by delay means.
34. A system according to claim 33 wherein each convolver includes a said
delay means programmable between zero and Q stages of delay.
35. A system according to claim 33 wherein each convolver includes row
buffer means programmable for different values M. |
|
|
|
|
Claims  |
|
|
Description  |
|
|
BACKGROUND AND SUMMARY OF THE INVENTION
The present invention relates generally to matrix processors and more
specifically to matrix processors of images.
Modern image processing frequently requires that a massive number of
repetitive computations be performed on a single array of picture elements
(pixels) before they are displayed. The purpose of these computations is
to enhance the contents of a single picture frame, or modify them to suit
a specific image analysis objective. The two most common classes of pixel
operations are the pixel-point operations and pixel-group operations.
Pixel-point operations are considerably less computation-intensive than
the pixel-group operations; they usually call for simple arithmetic or
logic operations to be performed on each pixel of the frame. On the other
hand, pixel-group operations usually involve a square or rectangular
window (convolution kernel) of neighboring pixels upon which a set of
arithmetic operations is to be performed. This set of operations is
repeated for each pixel of the frame.
The value of the pixel usually indicates the intensity of a specific RGB
video component for chromatic applications, or a gray scale value of
monochromatic applications (such as radar signal processing or image
recognition).
It is the pixel-group operations that have, to date, established the limits
of real-time image processing speeds. The most common pixel-group
operation in image processing is a so-called spatial convolution, i.e. a
process of multiplying selected neighboring pixels by a set of values
called a convolution coefficient kernel followed by the summation of the
results. For instance a typical application, a set of, say, 9 pixels is
arranged as a 3.times.3 matrix:
##EQU1##
where x and y indicate pixel's coordinates in the frame array. Each
pixel's K-bit value (where K usually lies in the 6-12 bit range) of such a
3.times.3 array is then multiplied respectively by an M-bit coefficient
value (where M can be typically 6-16 bit value) from the 3.times.3
coefficient kernel:
##EQU2##
The matrix above is called the convolution kernel, and each of its
elements is referred to as convolution coefficient. Finally, after all 9
multiplications are completed, their results are then summed to yield the
new value of the pixel in the location x,y:
##EQU3##
The new value of the pixel NEWP(x,y) usually corresponds to the modified
value of its amplitude (gray scale). The entire process described above is
called "2-D spatial filtration" or "2-D spacial convolution" and
corresponds to a discrete two-dimensional filtration of the image in the
time domain. Depending on the set of values A . . . I picked for the
convolution kernel, a number of image processing functions can be
accomplished. In particular, operations such as image smoothing, edge
detection or extraction, or contrast enhancement can be accomplished. If
all kernel coefficients except for the middle one (E) are picked equal to
each other, or all nine of them are equal, the kernel is referred to as
symmetrical. This is a common form of the kernels used in modern image
processing. If three or more coefficients in the kernel differ from each
other the kernel is called asymmetrical.
Although the use of larger kernels, such as 5.times.5, 7.times.7 or even
15.times.15, is even more desirable (since it increases the bandwidth of
convolution), the amount of computations involved in convolution with such
large kernels is often prohibitive for most applications. Since industry
standard frame array sizes vary from 256.times.256 pixels to
4096.times.4096, the number of multiplications and additions which have to
be performed for a single frame convolution varies from approximately
600,000 to almost 160 million per frame. Consequently, if the frames are
to be convolved in real time (i.e. processed at the same rate, or faster,
than they are acquired and digitized), the total frame convolution time
puts an obvious restriction on the image acquisition time. Thus, for
example, if the industry standard medium resolution image of 512.times.512
is to be acquired and convolved using 3.times.3 kernel, almost 2.5 million
multiplications and additions will have to be performed. If a single
eight-bit multiply/accumulate operation is assumed to require 50
nanoseconds using off-the-shelf multiplier-accumulater (MAC), the total
amount of time required to complete a single frame convolution will be
0.125 seconds. This, in turn, would imply that the maximum frame
repetition rate would be limited to 8 Hz, a rate too slow for most
industrial and commercial applications typically requiring at least a 30
Hz frame repetition rate.
Consequently, most modern processors do not offer real-time 3.times.3
convolution capabilities. On the other hand, in most industrial and
military applications, the frame repetition rates vary from 30 Hz
(interlaced NTSC standard) to as fast as 400 Hz. This implies that for
512.times.512 pixel arrays total frame convolution times in the
millisecond range are needed. In practice, such convolution speeds have
rarely been accomplished and only in board-level designs. ECl-based
designs can meet such requirements.
Thus, it is an object of the present invention to provide a matrix
processor capable of real-time processing.
Another object of the present invention is to provide a matrix processor
which can do real-time convolution of 3.times.3, 5.times.5 and other
kernels.
Still a further object of the present invention is to provide an image
processor capable of doing kernel convolutions of matrix from N.times.M to
multiples of that array.
These and other objects are obtained by a system for Q.times.R kernel
convolution of a M.times.N array of words. The system includes an input
receiving the words P from the M.times.N array and coefficient storage for
storing the coefficient C.sub.1 through C.sub.Q.times.R. A matrix of
Q.times.R multipliers each have a first input for receiving a word Pi and
a second input for receiving a dedicated coefficient Cj for producing a
product PiCj. An adder is connected to the output of the plurality of
multipliers for adding the products to provide an output convolution at a
first or product output.
A plurality of buffers connect the word inputs to respective multipliers
for storing one or more words and delaying the input of the word to a
respective multiplier as a function of its position in the Q.times.R
matrix and the row length M of the M.times.N array. The input structure is
connected to the first multiplier of the row without a buffer. The
multipliers in the matrix have a row input with the first row being
connected to the input structure and the other rows' inputs are connected
to the row input of a preceding row by a row buffer having M stages. Each
of the multipliers within a specific row is connected to the row input by
a buffer having a stage for each position it is spaced from the first
position. Alternatively, the row buffer would have M-R+1 stages and each
of the multipliers in a row would be connected to the preceding multiplier
by a single stage buffer.
In either embodiment, the row buffer is programmable for different values
of M. This allows the system to handle different values of word arrays
without modification of multiplier structure. A cascade output as well as
the product output may be provided representing the total delay or number
of stages through the Q.times.R matrix. Depending upon the embodiment, it
is either out of the last multiplier or out of the last row buffer.
The input may be through an arithmetic logic unit and through programmable
stages of delay. A second input or cascade input is provided and is
connected to the rows other than the first row through a multiplexer. The
other input to the multiplexer is from the row buffers. The multiplexer
would provide that the input word be connected to the first row with the
other rows receiving either their inputs from the row buffers or from the
cascade inputs. This allows versatility of the base kernel Q.times.R to be
used in a structure for larger kernels or larger word arrays. A summer for
each row R provides a sum of the products of its respective row to an
output adder. The output adder provides the output convolution Pcout. The
second or cascade input may also provide sums from other kernel stages to
the output adder. A pair of coefficient stores are provided and a
multiplexer chooses which store is being provided as an input to the
multiplier matrix.
The system is converted to a (Q.times.R).times.1 convolution by programming
the row buffers to Q stages. Arrays larger than M.times.N may be handled
by a single system for Q.times.R kernel convolution by providing
additional external row buffers connected to the second inputs.
Alternatively a plurality of Q.times.R kernel convoluting systems may be
cascaded. For different size kernels other than Q.times.R, the basic
Q.times.R convoluting system maybe cascaded and selective coefficients set
to zero.
Where the convolution kernel is larger than the convolver cell, the
convolver cells may be arranged in a group of rows and columns. The
cascade input of each convolver is connected to a product output of a
previous convolver in its respective row. The cascade input of the first
convolver of each row is connected to a product output of the last
convolver of a previous row. The data input of each convolver is connected
to a cascade output of a previous convolver in its respective column. The
input of the convolvers of the first row are connected to the data input
of the first convolver by appropriate delays. The delays may be external
or part of the internal programmable input delay stages. The appropriate
convolution kernel as well as variation in the matrix of the input can be
controlled using appropriate coefficient to the individual convolvers.
Other objects, advantages and novel features of the present invention will
become apparent from the following detailed description of the invention
when considered in conjunction with the accompanying drawings.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a block diagram of a convolver incorporating the principles of
the present invention and illustrated as a 3.times.3 kernel;
FIG. 2A is a map for a 3.times.3 convolver;
FIG. 2B is a functional diagram of one embodiment of a 3.times.3 convolver
according to the principles of the present invention;
FIG. 2C is a functional diagram of another embodiment of a 3.times.3
convolver of the present invention;
FIG. 3A is a block diagram of the convolver of FIG. 1 used in combination
with external row buffers;
FIG. 3B is a functional diagram of a 3.times.3 convolver of the present
invention used in its cascade configuration;
FIG. 4 is a block diagram of the convolver of FIG. 1 arranged in an array
of rows and columns to provide a kernel larger than that of the individual
convolver as well as handling larger input arrays;
FIG. 5A is a coefficient mask for a 6.times.6 kernel convolution;
FIG. 5B is a block diagram of convolvers of FIG. 1 for a 6.times.6 kernel
convolution;
FIG. 6A is a map for a 3.times.6 kernel convolution;
FIG. 6B is a block diagram of convolvers of FIG. 1 connected in a 3.times.6
kernel convolution;
FIG. 7 is a block diagram of the use of the convolver of FIG. 1 with a
controller.
DETAILED DESCRIPTION OF THE DRAWINGS
One of the most frequently encountered operations in image processing
systems is 2-D convolution, or kernel convolution. This is a pixel group
operation which is used to perform image filtering functions such as: edge
extraction, feature enhancement, smoothing, or any combination of these.
Until recently, only board level systems were available to handle the
extraordinary computing requirements encountered in 2-D convolution. Today
there are just a handful of IC's which address this challenge, and only a
few offer a single chip solution. The present convolver accelerator, is a
single chip alternative to 3.times.3 kernel convolution, while being
versatile enough to handle larger convolution tasks through cascadability,
and 1-D convolution through programmability.
Although the description will address mainly high speed 1-D or 2-D digital
convolution, the present convolver can be used for many other real time
image processing and enhancement, robotics and computer vision, and radar
signal processing applications.
The convolver is a cascadable IC that is capable of performing real time
kernel convolution. As a single IC, it can perform 3.times.3 kernel
convolution on up to a 1k.times.N frame of pixels. It will accept the
pixels in non-interlaced raster scan order at a clock rate up to 40 MHz.
The convolved image will be available at the outputs in the same raster
scan order it was received beginning after the 3rd pixel of the 3rd row
has been accepted. Processing is complete only one cycle after the last
pixel has been accepted, or after 1k.times.N cycles, where N is the number
of rows in the frame. Thus, at a 40 MHz pixel clock rate, an entire
512.times.512 image is processed in under 7 milliseconds, with the first
new pixel available after only 26 microseconds, while a 1k.times.1k image
is convolved in 27 milliseconds, with the first pixel available after only
51 microseconds.
The convolver includes on-chip row buffers which can be programmed to any
desired depth, up to 1024 pixels. This allows for convolution on
non-standard, user defined row sizes, and also provides 9-tap
one-dimensional (1-D) filtering capabilities. An ALU and a Shifter are
also included in the pixel data path to allow pixel point operations to be
performed on the incoming data.
Cascadability allows multiple convolvers to be used for frames larger than
1024 pixels per row, and also for kernels larger than 3.times.3. There is
also an option allowing it to interface with external row buffers, giving
the user a low cost method for convolving larger frames.
Although the convolver will be described as a 3.times.3 kernel, this by way
of example and the basic convolver unit may be considered a Q.times.R
kernel convolver. Similarly, the pixel array first example has been a
512.times.512 array but may also be considered a generic M.times.N array.
Wherein the basic convolver unit is used in array to increase the kernel
beyond its Q.times.R capacity, this kernel array will have the generic
notation of S.times.T.
The convolver 10, as illustrated in FIG. 1, includes input port 12 for the
pixel data Pin and input port 14 for the coefficient Cin. An additional
input CASI is generally used for cascading at port 16. The outputs of the
convolver 10 include the convoluted kernel Pcout at port 18 and a
cascading output CASO at port 20. A control logic 22 then receives a
plurality of inputs to be described in more detail below. The pixel data
input port 12 is connected to a plurality of cascaded buffers 24, 26 and
28 whose outputs are connected through a 4 to 1 multiplexer 30 as a first
input Din to the arithmetic logic unit ALU 32. The control logic unit 22
selects the amount of delay of the pixel input Pin to the ALU 32. Although
multiplexer 30 is shown, the buffers 24, 26, and 28 may be programmably
selected.
The coefficient port 14 in addition to providing coefficients to be used in
the multipliers, may also provide the second input to the ALU 32 through a
register ALUR 34. By storing the second input to the ALU in ALUR 34, not
only can the coefficient input 14 be used for other functions, but a fixed
value can be maintained in the ALUR 34 for many cycles or frames. The
coefficient input 14 may also be used to provide ten bits of microcode to
AMC register 36 and nine bits of instruction code to INT register 38. A
typical example of the microcodes used in the ALU and stored in AMC
register 36 is shown in Table 1.
TABLE 1
______________________________________
Pixel Point (ALU) OPERATIONS
AMC REG AMC REG
7-LSB's Operation 3-MSB's Operation
______________________________________
0000000 Logical (0) 000 No Shift
0001000 Logical (A AND B)
001 Shift Right 1
0010000 Logical (A AND B)
010 Shift Right 2
0011000 Logical (A) 011 Shift Right 3
0100000 Logical (A AND B)
100 Shift Left 1
0101000 Logical (B) 101 Shift Left 2
0110000 Logical (A XOR B)
110 Shift Left 3
0111000 Logical (A OR B)
1000000 Logical (A NOR B)
1001000 Logical (A XNOR B)
1010000 Logical (B)
1011000 Logical (A OR B)
1100000 Logical (A)
1101000 Logical (A OR B)
1110000 Logical (A NAND B)
1111000 Logical (1)
0110001 SUM (A plus B)
1001010 DIFF (A minus B)
1001100 DIFF (B minus A)
______________________________________
By using different bits for the ALU logic operation from the shifting
operation, both types of operations may be performed in combination. A
default condition of the AMC register 36 is bypass which is Logical (A)
with no shift.
A typical example of the nine bits of instructions in the INT register 38
are shown in Table 2.
TABLE 2
______________________________________
BIT INSTRUCTION REGISTER
______________________________________
Cin <0> cascade input mode; i.e. internal or external
row delays
Cin <1:2> number of input delays at the Din input bus
Cin <3> number format, for Pin unsigned or 2's
complement
Cin <4> number format, for Cin unsigned or 2's
complement
Cin <5:6> type of rounding performed on Pcout, i.e. no
rounding, 16 bit rounding and 8 bit rounding
Cin <7:8> number of shift for CASI, i.e. None, 2, 4, 8
______________________________________
The first bit Cin <0> selects either internal row delays using programmable
row delays 42 and 44 or external row delays as received on the cascade
input 16. This selection is executed by multiplexers 48 and 50. The second
and third bits Cin <1:2> select the number of input delays by controlling
multiplexer 30 or direct control of input delay stages 24, 26 and 28. The
fourth bit Cin <3> signifies the input and output number format of pixels
input Pin and output Pcout. The fifth bit Cin <4> signifies the input
number format of the coefficients Cin.
The sixth and seventh bits Cin <5:6> determine the type of rounding
performed at the output Pcout at port 18. The rounding control of the
instruction register will vary depending upon the available length of the
convolved output Pcout 18. To increase the resolution of the output 20
bits are shown. Using a sixteen bit round, a one is added internally to
the bit position Pcout <3>. When an eight bit round is selected, a one is
added to the bit position Pcout <11>. The eight and ninth bits Cin <7:8>
define the amount of shifting of the cascade input CASI at port 16. This
may be no shift, 2 bit shift, 4 bit shift or an eight bit shift, for
example. The default state of the INT register 38 after a reset is all
zeros.
The output from the ALU 32 is provided directly to the first row of the
multiplier matrix 40 as well as to a first programmable delay stage or
shift register 42. The output of programmable shift register 42 is
provided to the second row of the multiplier matrix 40 through the
multiplexer 48 as well as providing an input to a second programmable
shift register 44. The output of the programmable delay stage or shift
register 44 is provided to the third row of the multiplier matrix 40
through a multiplexer 50. The second input to the multiplexers 48 and 50
is from the cascade input port CASI 16 As noted in Table 2, the cascade
input mode is selected by the first bit of the instruction register 38 to
control the multiplexers 48 and 50. For an internal delay, the
programmable row buffers 42 and 44 are selected as the input to the
multiplier matrix 40, by multiplexers 48 and 50. For external row delays,
the input port 16 for the CASI is selected as the input to the multiplier
matrix 40, by multiplexers 48 and 50.
The programmable shift registers 42 and 44 can each store up to for example
1024 8-bit pixel values and thus delay the pixels from 0 up to 1024 clock
cycles. The delay is represented by a small m which represents the row
length of the frame. The desired value m for the programmable shift
registers 42 and 44 is stored in end of row (EOR) register 46 in the
control logic 22 and is inputted through the coefficient input Cin 14. The
binary value stored in the end of row register 46 is interpreted by the
control logic 22 and the signal is sent to the programmable row buffers 42
and 44, to provide an offset between the real and write pointers. The
default position for the end of row register 46 is a maximum length.
The coefficients are provided at the coefficient input Cin 14 to a first
coefficient register CREGO 52 and a second coefficient register CREGI 54.
The use of two coefficient registers 52, 54 allows the user to load a new
set of coefficients without disturbing the current convolution. A
multiplexer 56 selects which coefficient registers are used as the
coefficients in the multiplier matrix 40.
The multiplier matrix 40 consists of nine 8.times.8 multipliers. They are
arranged in three vertical rows with each row representing a three-tap
1-dimensional FIR filter. The first input of each of the multipliers is a
pixel value or a delayed pixel value while the second input is a
coefficient from the coefficient registers. The rows are separated by the
row buffers 42 and 44 resulting in a 2-D sum of products. Each row
provides its own sum of product outputs SOP1, SOP2 and SOP3. The sum of
products are added in a final adder 57 which provides the convolved output
Pcout at port 18.
The cascaded input CASI port 16 may also be provided to the adder 57
through a multiplexer 58. The multiplexer 58 includes a register to store
the cascade input CASI and buffer the adder 57 from changes on port 16
during a cycle. When the first bit of the instruction register 38 selects
internal buffering, multiplexer 58 connects CASI at port 16 to the final
adder 57. When the first bit of the instruction register 38 selects
cascade or external buffering, multiplexer 58 provides no input. Thus the
convolver 10 may receive a cascaded input to be added to its internal
generated sum of products SOP1, SOP2 and SOP3. The cascade input CASI on
port 14 provided to the final adder 57 may be used to add offset values.
This could be used for increasing the brightness of the convolved image as
well as other desired effects.
By selecting the internal row buffers 42 and 44 by bit one in the
instructions register 38, the output Pcout at port 18 is represented by
Equation 1 and the convolver 10 has a configuration illustrated in FIG.
2B.
##EQU4##
If the cascade input CASI 16 is used as the input to the second and third
rows of the multiplier 40 by bit one in the instruction register 38, the
output Pcout at port 18 will be represented by Equation 2 and the
convolver 10 has a configuration illustrated in FIG. 3B.
##EQU5##
The inputs to the control logic 22 include the clock input CLK which can
run at a maximum speed of 40 MHz. The new FRAME input, also known as the
vertical sink input, is used to reset all the internal circuitry except
for the coefficient registers 52, 54, ALU 32, AMC register 36, EOR
register 46 and INT registers 38. Thus after a FRAME reset has occurred, a
new frame of pixels maybe convolved without reloading these registers. The
reset RSB input resets all the internal circuitry. Thus after a reset
occurs all data registers must be initialized or remain at the default
value. The input HOLD is used to gate the clock input CLK for all internal
circuitry. Thus when HOLD is enabled (high), the CLK has no effect on the
processor and the internally stored data registers. This pin is useful for
stopping the clock internally during blank periods. The output enable
input OEB is used to tristate the convolved output Pcout at port 18. The
input EALU enables the ALU 32's B input path. Thus the second pixel input
on the coefficient input port 14 is stored in the ALU register 34 and is
provided as a second input to the ALU 32.
The loading of the various registers and the selection of the appropriate
coefficient using multiplexer 56 is controlled by the three bit input A.
The load strobe input LDB enables the function determined by the address
A. The chip select input CSB enables or disables the load strobe LDB pin
port. Thus, it is not required to externally gate the LDB strobe signal.
The various functions controlled by the address A are described in Table
3.
TABLE 3
______________________________________
A<2:0> Function
______________________________________
000 Load EOR Register 46
001 Load ALU Microcode is AMC Reg. 36
010 Load CREGO 52
011 Load CREG1 54
100 Load INT Register 38
101 Select CREGO (coef. mask) MUX56
110 Select CREG1 (coef. mask) MUX56
______________________________________
The default position for multiplexer 56 of the coefficient registers is
selection of the first register CREGO 52.
The specific structure of the multiplier 40 will be described with respect
to FIGS. 2 and 3 for its two modes of operation, namely internal row
buffering and external row buffering. The 3.times.3 mask illustrated in
FIG. 2A is applicable to FIGS. 2 and 3. For the first bit in the
instruction register 38 being a 0, the internal row buffers 42 and 44 are
used. The configuration of the multiplier matrix is illustrated in FIG.
2B. The output of the ALU 32 is provided directly to the first row which
includes multipliers 60, 61 and 62 receiving coefficients I, H and G
respectively. Buffers 70 and 71 connect the output of the ALU 32 to
respective multipliers. The amount of the delay represents the position in
the row from the first multiplier 60. The outputs of the multipliers 60,
61 and 62 are combined in a summer 80 to provide the sum of their partial
products SOP1. The output SOP1 of summer 80 is provided to a final adder
57.
The first programmable row buffer delay element 42 connects the output of
the ALU 32 directly to its row of multipliers. The first multiplier 63
receives a signal from 42 directly while the multipliers 64 and 65 have
appropriate delay stages 72 and 73 depending upon their position from the
first multiplier 63. Multipliers 63, 64 and 65 multiply the delayed
signals by the coefficients F, E and D respectively and the partial
products to summer 81. The output SOP2 of summer 81 is provided to the
final adder 57.
The output of the first programmable buffer 42 is connected through row
buffer 44 to the next row of multipliers. The output of row buffer 44 is
connected directly to multiplier 66 and through appropriate delays 74 and
75, depending upon their position in the row, to multipliers 67 and 68.
The multipliers 66, 67 and 68 multiply their respective coefficients C,B
and A times the delayed input and provide these partial products to summer
82. The output SOP3 of summer 82 is provided to the final adder 57. The
output Pcout of final adder 57 at port 18 is equal to Equation 1.
Each of the rows of equation 1 represents a sum of Partial products SOP1,
SOP2 and SOP3. The last term CASI(n) in Equation 1 represents the cascade
input pr | | |