|
Description  |
|
|
The present invention relates in general to an apparatus for parallelly
processing local image data, e.g., for parallelly effecting spatial
convolution of local image data and to apparatus having an architecture
suited for implementation in an LSI configuration.
Many conventional image data processors are so designed as to process image
data in parallel with a view to increasing the processing speed. See, for
example, Japanese Patent Application Laid-Open No. 141536/76 (laid open on
Dec. 6, 1976). However, since the image data is of two-dimensional nature,
it will be difficult to parallelly process all of the data. Since the
image data processors are mostly intended to effect parallel processing of
image data for picture elements positioned close to one another, such as a
spatial convolution for noise elimination or image contour emphasis, the
parallel processing is mostly effected for local image data for m
rows.times.n of elements. There has not been known yet any apparatus for
parallel processing of local image data which is implemented in the form
of an LSI. In reality, a great difficulty is encountered in realizing such
processing apparatus of the hitherto known architecture in an LSI form in
view of the extremely high integration density and the great number of
connector pins required.
An object of the present invention is to provide an apparatus for
parallelly processing local image data which apparatus may be implemented
in a large scale integrated circuit or LSI.
According to one aspect of the present invention, for parallel processing
of local image data of m rows.times.n columns, n processor units are
juxtaposed each including a data memory having m processing parameter data
stored therein and serving to store m image data and each including a
processor stage so that m image data for each of the n rows are
sequentially or serially supplied to the juxtaposed processor units and
the so supplied image data is sequentially or serially shifted for each
row between the juxtaposed processor units whereby n image data for one
row are simultaneously (parallelly) processed each time a shift of the
image data is effected for the processor units, and such simultaneous
processing is repeated m times. With this structure, the number of
processor elements required for the local image data processing is reduced
to 1/m of that in the prior art apparatus and hence the number of image
data input pins is also reduced to 1/m of that in the prior art device,
which provides an architecture facilitating implementation of the image
data processing apparatus in an LSI form.
According to another aspect of the present invention, for parallel
processing of local image data of m rows and n columns, m image data
processing modules are connected in cascade each including n juxtaposed
processor stages each coupled to memory means having n processing
parameter data stored therein so that image data of a different one of the
m rows is sequentially or serially supplied to each of the processing
modules and the so supplied image data is sequentially or serially shifted
for each row between the juxtaposed processor stages whereby n image data
for one row is simultaneously (parallelly) processed each time a shift of
the image data is effected for the juxtaposed processor stages in each of
the processing modules. With this structure, each of the image data
processing modules may have to be provided with only one input port for
receiving image data to be processed and one output port for delivering
processed image data, which provides an architecture factilitating
implementation of the image data processing apparatus in an LSI form.
The present invention will now be described by way of an exemplified
embodiment with reference to the accompanying drawings, in which:
FIG. 1 is a block diagram of an example of the general structure of an
image data processing system;
FIG. 2 is a diagram useful for explaining parallel processing of local
image data;
FIGS. 3 to 5 are a block diagram of one embodiment of the present invention
and diagrams illustrating the operating steps of this embodiment;
FIGS. 6A and 6B are diagrams illustrating an example of the structure of
and the operation of a data memory which may be used in the present
invention;
FIGS. 7A, 7B, 8A and 8B are diagrams illustrating two other examples of and
the operation of a data memory which may be used in the present invention;
FIGS. 9A and 9B are a block diagram and an operational time chart of
another embodiment of the present invention;
FIGS. 10A, 10B, 11A and 11B are block diagrams and operational time charts
of modified forms of the embodiment shown in FIG. 9A;
FIGS. 12A and 12B are a block diagram and an operational time chart of
another embodiment of the present invention; and
FIGS. 13A and 13B are a block diagram and an operational time chart of
another embodiment of the present invention.
Referring to FIG. 1 which shows schematically a general structure of a
typical image data processing system, the system comprises an ITV camera 5
serving as an image data pickup unit, an image memory 3 serving as a
storage for image data supplied from the camera 5 and a CRT monitor 4
which serves to display the content of the image memory 3. The image
information or data stored in the image memory 3 is processed by an image
data processing apparatus 2, the result of the processing is again stored
in the image memory 3 and/or supplied to a supervising processor 1 for
controlling the whole system.
As a typical image data processing function, there is a known spatial
convolution function or a so-called spatial filter function according to
which, as shown in FIG. 2, local image data d.sub.11, d.sub.12, . . ,
d.sub.44 for picture elements in a 4 (rows).times.4 (columns) matrix, for
example, are multiplied by predetermined weight data w.sub.11, w.sub.12, .
. . , w.sub.44 (i.e. weighting factors), respectively, and the total sum
of the resultant products is arithmetically determined to obtain a
processing result g. By selecting appropriate magnitudes or values of the
weights or weight data w.sub.ij, the image data processing, such as
elimination of noise, emphasis of image contours and the like can be
accomplished.
An embodiment of the present invention will now be described with reference
to FIGS. 3 to 5.
In this embodiment, it is assumed that local image data for m rows.times.n
of elements (hereinafter, simply referred to as "m.times.n image data") is
intended to be processed. The image data parallel processing apparatus 2
is in a large scale integrated (LSI) circuit configuration including four
juxtaposed processor units (the same number of processor units as the
column number) 20-1 to 20-4 and summing means 23. Each of the processor
units includes, for a predetermined processing, such as the spatial
convolution, a data memory 21 for storing therein m (four, in this
embodiment) local image data and processing parameter data, such as weight
data, and each including a processor stage 22.
The data memory 21 of each processor unit is supplied with image data 31
externally of the LSI image data processing apparatus 2 as indicated by an
arrow 31 or from the data memory 21 of the adjacent processor unit as
indicated by 32 and sequentially shifted between the processor units.
Further, each of the processor stages 22 is supplied with two kinds of
data, for example, image data and weight data from the associated data
memory 21, as indicated by arrows 33 and 34. The summing means 23 serves
to sum the outputs 35-1 to 35-4 of the four processing units 20-1 to 20-4
representative of the results of the arithmetic operation and to
cumulatively add the outputs of the summing means 23 itself, whereby a
processing result 36 is delivered from the summing means 23.
Next, referring to FIGS. 4 and 5, description will be made as to the manner
in which local image data is supplied to the LSI image processing
apparatus 2 and the procedures for executing the spacial convolution. It
is assumed that predetermined weight data w.sub.11, w.sub.12, . . . ,
w.sub.44 has been stored in the memory areas of the data memory 21 as
shown in FIG. 5. In the first place, 4.times.4 local image data d.sub.11
to d.sub.44 is written or stored in the data memory 21 through repetition
of individual column scanning in the horizontal or row direction from the
left hand side as indicated by an arrow at (a) in FIG. 4. That is, the
local image data is supplied on a single picture element basis to the data
memory 21 in the image data processing apparatus 2 sequentially starting
from the leftmost vertical column image data d.sub.11, d.sub.21, d.sub.31
and d.sub.41 for the four picture elements up to the rightmost vertical
column image data d.sub.14, d.sub.24, d.sub.34 and d.sub.44 for the four
picture elements as viewed on FIG. 4 so that image data d.sub.14,
d.sub.24, d.sub.34 and d.sub.44 is stored in the data memory 21-1 of the
processor unit 20-1, image data d.sub.13, d.sub.23, d.sub.33 and d.sub.43
is stored in the data memory 21-2 of the processor unit 20-2, image data
d.sub.12, d.sub.22, d.sub.32 and d.sub.42 is stored in the data memory
21-3 of the processor unit 20-3, and image data d.sub.11, d.sub.21,
d.sub.31 and d.sub.41 is stored in the data memory 21-4 of the processor
unit 20-4 as indicated at (a) in FIG. 4 and in FIG. 5. It is further
assumed that the arithmetic operation for the spatial convolution of the
4.times.4 local image data d.sub.ij -1 as indicated at (a) by a broken
line block in FIG. 4 has been completed, whereupon the operation for the
spatial convolution for the horizontally next local image data d.sub.ij -2
displaced by one column in the row direction as shown at (b) by a dot-dash
line block in FIG. 4 is next to be effected.
When image data d.sub.15 for the first picture element belonging to the
succeeding column is inputted, as shown at (b) of FIG. 4, the image data
d.sub.14, d.sub.13, d.sub.12 and d.sub.11 stored in the corresponding
areas in the first row of the data memories 21-1 to 21-4 is shifted to the
right as viewed in FIG. 4, whereby the image data d.sub.15, d.sub.14,
d.sub.13 and d.sub.12 is now stored in the above-mentioned first row areas
of the data memories 21-1 to 21-4 of the processor units 20-1 to 20-4, as
also shown in FIG. 5. These picture elements represent by the image data
d.sub.15, d.sub.14, d.sub.13 and d.sub.12 simultaneously undergo an
arithmetic operation for determining the products with the corresponding
weight data w.sub.14, w.sub.13, w.sub.12 and w.sub.11, respectively, at
the associated processor stages 22.
When the second image data d.sub.25 (i.e., the image data for the picture
element at the second row of the same column) is inputted, the image data
d.sub.24, d.sub.23, d.sub.22 and d.sub.21 stored in the corresponding
areas in the second row of the data memories 21-1 to 21-4 is shifted
horizontally one by one, whereby the image data d.sub.25, d.sub.24,
d.sub.23 and d.sub.22 is now stored in the above-mentioned second row
areas of the data memories 21-1 to 21-4 of the processor units 20-1 to
20-4. This image data d.sub.25, d.sub.24, d.sub.23 and d.sub.22
simultaneously undergoes an arithmetic operation to determine the products
with the corresponding weight data w.sub.24, w.sub.23, w.sub.22 and
w.sub.21 at the associated processor stages. A partial sum of the products
thus derived is also arithmetically determined by the summing means 23.
In a similar manner, image data d.sub.35 and d.sub.45 at the third and
fourth rows is inputted, whereby the products of the image data d.sub.35,
d.sub.34, d.sub.33 and d.sub.32 and d.sub.45, d.sub.44, d.sub.43 and
d.sub.42 with the respective weight data w.sub.34, w.sub.33, w.sub.32 and
w.sub.31 and the weight data w.sub.44, w.sub.43, w.sub.42 and w.sub.41 are
determined at the associated processor stages 22, which are followed by
the summation of these products, as described above.
In this manner, the products of the image data d.sub.15, d.sub.25, d.sub.35
and d.sub.45 and the corresponding weight data w.sub.14, w.sub.24,
w.sub.34 and w.sub.44 are arithmetically determined by the processor unit
20-1. In a similar manner, the products are, respectively, determined by
the processor units 20-2 to 20-4. These outputs of the processor units
20-1 to 20-4 are cumulatively added to determine the total sum by the
summing means 23, whereby the output data 36 (or g) representative of the
result of the spatial convolution executed for the local image data
d.sub.ij -2 is obtained.
FIGS. 6A and 6B show an exemplary embodiment of the data memory 21
incorporated in each processor unit. The illustrated data memory 21 is
constituted by a two-port random access memory or RAM 40 having a
sufficient capacity for storing therein image data and processing
parameter data such as weight data. As will be seen from the time chart
shown in FIG. 6B, the image data 32 to be shifted is read out through the
second port and delivered to the data memory of the succeeding adjacent
processor unit, while the input data to be processed is stored at the same
address or area as that of the former image data read out through the
second port and at the same time the image data 32 is fed to the processor
stage 22 as the data 33 for the arithmetic operation described
hereinbefore. On the other hand, the weight data corresponding to the
above data 33 is also read out from the two-port RAM 40 through the first
port to be supplied to the processor stage 22 as the second data 34 for
the arithmetic operation.
As a modification of the two-port RAM 40, it is possible to provide two
RAM's separately, wherein one of the RAM's is used for storing the image
data with the other RAM being used for storing the processing parameter
data such as the weight data for the convolution operation and others. In
that case, the RAM for the image data storage may be constituted by a
shift register.
FIGS. 7A and 7B show another exemplary embodiment of the data memory 21
which is constituted by a shift register 50 and a one port RAM 51 of the
conventional type. As will be seen from the time chart shown in FIG. 7B,
the image data 53 to be shifted is at first read out (address 52) and
loaded in the shift register 50 to be supplied to the data memory of the
succeeding adjacent processor unit. On the other hand, the image data 31
determined to be inputted for processing is shifted and loaded into the
shift register 50 to be outputted as one of the data 33 for the arithmetic
operation and written simultaneously in the RAM 51. Subsequently,
corresponding weight data is read out (address 52) to be supplied to the
associated processor stage 22 as the other data 34 for the aforementioned
arithmetic operation. With the arrangement of the data memory 21 shown in
FIG. 7A, the inner circuit configuration of the RAM can be simplified
because there is no necessity to employ the two-port RAM, although the
time required for the processing is increased to some degree. Further, the
data memory configuration shown in FIG. 7A is advantageous in that data
transfer between remote processor units can be effected at a high speed
merely through the interposition of shift registers, provided that the
latter are bidirectional shift registers.
FIGS. 8A and 8B show still another version of the data memory 21. The image
data 31 to be inputted is loaded in a shift register 60 adapted to delay
the input image data by a predetermined number of the picture elements and
is at the same time supplied to the associated processor stage 22 as the
operation data 33. More particularly, in the case where the local image
data is 4 (rows).times.4 (columns) as mentioned above, the input image
data 31 is delayed by the same number of stages as the column number,
i.e., four shift stages and thereafter supplied to the succeeding data
memory as the image data 32. On the other hand, the corresponding weight
data is read out from a RAM 61 as the other operation data 34. This
configuration of the data memory 21 is advantageous in that the inputting
and the outputting of the data 31 and 32 can be accomplished at an
increased speed because there is no necessity of resorting to cooperation
of the RAM 61.
As versions of the shift register 60, it may be constituted by a RAM or by
a first-in first-out (FIFO) buffer. In any case, it is desirable that the
delay brought about by the shift register 60 be variable, depending upon
the number n of the rows in the local image data.
Another embodiment of the present invention will be described with
reference to FIGS. 9A and 9B. In the present embodiment, it is also
assumed that 4 rows.times.4 columns of local image data are intended to be
processed. The image data parallel processing apparatus 2-I is in a large
scale integrated circuit configuration including four image data
processing modules (the same number of processing modules as the row
number) 70A to 70D in cascade connection. The apparatus 2-I is supplied
with local image data (image data d.sub.14, d.sub.24, d.sub.34 and
d.sub.44 in FIG. 9, for example) column by column simultaneously or
parallelly from the image memory 3 and the processing results may be
stored again in the image memory 3.
Each of the cascade-connected image data processing modules 70A to 70D has
one input port 71 through which image data to be processed is to be
introduced and one output port 74 through which processing results are to
be delivered. Each of the processing modules 70A to 70D includes a weight
data memory 79 constituting means having four weight data (the same number
of weight data as the column number) stored therein, four juxtaposed
processor stages (the same number of juxtaposed processor stages as the
column number) 76 each having a multiplier coupled to the weight data
memory 79, a shift register 75 having a stage provided for each of the
processor stages 76, a summing circuit 77 constituting means for summing
the outputs of the processor stages 76, and a cumulative adding circuit 78
constituting means for cumulatively adding the output of the summing
circuit 77 and the output of the summing circuit of the preceding image
data processing module. The shift register 75 for the first processor
stage 76 is coupled to the input port 71 for sequentially introducing
image data. The shift registers 75 for the other processor stages 76
interconnect adjacent processor stages so that a different one of the
image data sequentially introduced through the first processor stage shift
register is shifted and supplied simultaneously to each of the juxtaposed
processor stages 76.
When image data d.sub.14 is inputted into the processor stage 76 #4 of the
processing module 70A, adjacent image data d.sub.13, d.sub.12 and d.sub.11
is also inputted to the corresponding processor stages 76 #3 to #1 through
the shift registers 75, respectively. The image data d.sub.11 may be
outputted through an image data output port 72 for the case where the area
for the spatial convolution is expanded to more than 4.times.4 image data.
Each of the processor stages 76 is thus supplied with image data d.sub.ij
to be processed and weight data w.sub.ij from the weight data memory 79
for executing the multiplication of these data. The result of the
multiplication, i.e. the product data from the four processor stages 76 is
supplied to the summing circuit 77 for summing the product data to produce
a partial sum, which is then cumulatively added to the partial sum coming
from the summing circuit of the preceding processing module through the
partial sum input port 73 by the cumulative adding circuit 78. The result
of the cumulative addition is supplied to cumulative adding circuit of the
succeeding processing module through the operation result output port 74.
Accordingly, when the processing modules are employed in four stages as
shown in FIG. 9A, there is outputted from the processing module 70D of the
final stage a total sum g which is given by the following expression:
##EQU1##
The processing mentioned above is illustrated in a time chart shown in FIG.
9B. When a processing result g.sub.1 of the operations executed during a
basic clock period .DELTA.t.sub.1 is outputted, then a similar operation
is performed on the adjacent 4.times.4 image data displaced by one column
in the row direction or by one row in the column direction, i.e., by one
image data corresponding to one picture element in the row or column
direction during a subsequent period .DELTA.t.sub.1, whereby the
corresponding result g.sub.2 is available as the output data. In this way,
the results of the spatial convolutions effected for every 4.times.4 image
data inputted successively are sequentially outputted.
FIG. 10A shows another embodiment of the invention according to which the
processing apparatus 2-I shown in FIG. 9A is so modified that the basic
clock time period .DELTA.t.sub.1 thereof is reduced by adopting a pipeline
processing. More particularly, in the former embodiment, the basic clock
time period .DELTA.t.sub.1 is necessarily not smaller than the sum of the
times required for the below mentioned processing steps:
(1) Inputting of an image data d.sub.ij to the shift register 75;
(2) Multiplication, at the processor stage 76, of an image data d.sub.ij by
a corresponding weight data w.sub.ij ;
(3) Summing for determining a partial sum by the summing circuit 77; and
(4) Cumulative addition of partial sums by the cumulalive adding circuit
78.
In contrast, in the processing apparatus 2-IP shown in FIG. 10A, there are
interposed pipeline registers 80 between the processing steps (1) and (2),
(2) and (3) and between the processing steps (3) and (4), respectively.
With this arrangement, the basic clock time period .DELTA.t.sub.2 is
reduced to the longest one of the time periods required for the
above-mentioned processing steps (1) to (4).
A pair of first pipeline registers 80 are provided for each of the
processor stages 76 so that each processor stage receives input image data
and delivers processed image data via one and the other of the pair of the
pipeline registers, and a second pipeline register 81 is provided for the
summing circuit 77 so that the summing circuit 77 delivers its output via
the pipeline register 81.
The operation of the processing apparatus 2-IP is illustrated in FIG. 10B.
It will be seen that the processing steps similar to those steps (1) to
(4) as mentioned above for each row image data of one matrix of 4.times.4
image data are executed during periods .DELTA.t.sub.2 -1 to .DELTA.t.sub.2
-4, respectively. At the same time, processing steps similar to those (1)
to (4) for each row image data of other four successive matrices of
4.times.4 image data each displaced by one image data from one another are
executed during the corresponding ones of the periods .DELTA.t.sub.2 -2 to
.DELTA.t.sub.2 -5. In this manner, the processing speed can be
significantly increased by operating successively the individual
components in the pipeline manner.
FIG. 11A shows still another embodiment of the invention according to which
the processing apparatus 2-IP shown in FIG. 10A is so modified that the
basic clock time period .DELTA.t.sub.2 can further be reduced. In the
processing apparatus 2-IP of FIG. 10A, the basic clock time period
.DELTA.t.sub.2 tends to be restricted by the time required for the
execution of the processing step similar to the above-mentioned step (4)
for cumulatively adding partial sums. This is because in the case where
the processing modules are realized in n stages the basic clock time
period .DELTA.t.sub.2 will amount to n times the sum of the time required
for the processing by the cumulative adding circuit 78 and the time
required for the inputting and the outputting of the operation results
through the ports 73 and 75. The delay time involved by the
inputting/outputting operations can not be negligible particularly when
the processing modules are implemented in an LSI configuration. In the
light of the above, in the image data parallel processing apparatus 2-IPS
shown in FIG. 11A, a third pipeline register 82 is inserted in the
cumulative addition input path for the adding circuit 78 so that the
circuit 78 receives the output of the adding circuit of the preceding
adjacent processing module, whereby the cumulative addition of the partial
sums of adjacent processing modules 70A to 70D are executed through the
pipeline operation. With this arrangement, the basic clock time period
.DELTA.t.sub.3 can be reduced to 1/n of .DELTA.t.sub.2. However, as
illustrated in FIG. 11B, it is necessary to compensate for misalignment
among the individual processing modules 70A to 70D with respect to the
timing with which the partial sums are arithmetically determined and
cumulatively added in each processing module. For compensating the
misalignment of the timing, a skew correcting shift register or a delay
register 93 which may be of variable number of stages is provided for each
of the processing modules 70B to 70D except the first stage processing
module 70A so that delivery of the output of the summing circuit is
delayed. The delay register 93 in each of the modules 70B to 70D is
disposed immediately following the input port 71. Since the pipeline
register 82 inserted in the cumulative addition input path in each of the
processing modules 70A to 70D has a single stage, the number of the stages
for the skew correcting or delay shift register 93 provided for the
processing modules 70B to 70D are selected as follows:
One stage for the processing module 10B, two stages for the processing
module 10C, and three stages for the processing module 10D.
The misalignment can thus be corrected or compensated, so that the pipeline
operations are allowed to be effected sequentially during a succession of
the periods each corresponding to the basic time period .DELTA.t.sub.3 as
is illustrated in the time chart of FIG. 11B.
As will be readily appreciated, the skew correcting or delay shift register
93 may be disposed immediately succeeding the summing circuit 77 or
immediately preceding or succeeding the processor stage 76 for
compensating the offset or misalignment in the timing.
FIG. 12A shows another embodiment of the image data processing apparatus
according to the invention. In the embodiments shown in FIGS. 9A, 10A and
11A, the input image data is supplied to the processor units by way of
shift registers. In contrast, in the image data parallel processing
apparatus 2-II shown in FIG. 12A, input image data is supplied in common
to individual processor units and the results of multiplications executed
by the processor units are cumulatively added to one another by
adder-register sets each including an adder 84 and a register 85 to
thereby obtain a partial sum.
In the present embodiment, it is also assumed that 4 rows.times.4 columns
local image data are intended to be processed. The processing apparatus
2-II is in a large scale integrated circuit configuration including four
image data processing modules (the same number of processing modules as
the row number) 90A to 90D in cascade connection. The apparatus 2-II is
supplied with local image data column by column simultaneously or
parallelly from the image memory 3 and the processing results may be
stored again in the image memory 3.
Each of the image data processing modules 90A to 90D has one input port 71
through which image data to be processed is to be introduced and one
output port 74 through which processing results are to be delivered. Each
of the processing modules 90A to 90D includes a weight data memory 79
constituting means having four weight data (the same number of weight data
as the column number) stored therein, four juxtaposed processor stages
(the sme number of processor stages as the column number) 83 each
including a multiplier coupled to the weight data memory 79 and each being
coupled in common to the input port 71, an adder 84 provided for each of
the processor stages 83 and arranged to receive the output of the
associated processor stage as one input thereto, three shift registers 85
each interposed between two adjacent adders and each storing the output of
the adder on the input side of the shift register in question and
supplying its output as another input to the adder on the output side of
the shift register in question, and a cumulative adding circuit 86
constituting means for cumulatively adding the output of the last stage
adder 84 and the output of the last stage adder of the preceding
processing module.
The operation of the processing apparatus 2-II will be described with
reference to FIG. 12B showing a time chart therefor. Referring to the
processing module 90A, during a time period .DELTA.t.sub.4 -1, image data
d.sub.11 is inputted through the input port 71 and a product d.sub.11
.times.w.sub.11 of the input image data d.sub.11 and the corresponding
weight data w.sub.11 read out from the weight data memory 79 is
arithmetically determined by the processor stage 83 #1 and loaded into the
register 85 #1 during the succeeding period or clock time .DELTA.t.sub.4
-2.
Image data d.sub.12 inputted in the time period .DELTA.t.sub.4 -2 is
multiplied by the corresponding weight data w.sub.12 by the processor
stage 83 #2 to produce a product d.sub.12 .times.w.sub.12. During the
succeeding time period .DELTA.t.sub.4 -3, the product d.sub.12
.times.w.sub.12 is added to the product d.sub.11 .times.w.sub.11 stored in
the register 85 #1 by the adder 84 to thereby produce a sum (d.sub.11
.times.w.sub.11 +d.sub.12 .times.w.sub.12) which is stored in the register
85 #2.
Image data d.sub.13 inputted in the time period .DELTA.t.sub.4 -3 is
multiplied by the corresponding weight data w.sub.13 in the processor
stage 83 #3 to produce a product d.sub.13 .times.w.sub.13 which is added
to the content (d.sub.11 .times.w.sub.11 +d.sub.12 .times.w.sub.12) of the
register 85 #2 by the adder 84 to thereby produce a sum (d.sub.11
.times.w.sub.11 +d.sub.12 .times.w.sub.12 +d.sub.13 .times.w.sub.13) which
in turn is stored in the register 85 #3 during the time period
.DELTA.t.sub.4 -4.
Image data d.sub.14 inputted in the time period .DELTA.t.sub.4 -4 is
multiplied by the corresponding weight data w.sub.14 in the processor
stage 83 #4 to produce a product d.sub.14 .times.w.sub.14 which is added
to the content (d.sub.11 .times.w.sub.11 +d.sub.12 .times.w.sub.12
+d.sub.13 .times.w.sub.13) of the register 85 #3 by the final stage adder
84 to thereby produce a sum .SIGMA.11=d.sub.11 .times.w.sub.11 +, . . . ,
d.sub.14 .times.w.sub.14. The partial sum thus obtained is cumulatively
added during the succeeding time period .DELTA.t.sub.4 -5 by the
cumulative adding circuit 86 provided in each of the processing modules
90A to 90D. As a result, there is produced from the cumulative adding
circuit 86 of the final stage processing module 90D a total sum g given by
the following expression:
##EQU2##
Processing results g.sub.1, g.sub.2, . . . by the spatial convolutions are
successively outputted at the basic clock time interval .DELTA.t.sub.4.
According to this embodiment, since image data is supplied to the
respective processor stages in common in each processing module, it is
possible to expand the area of the local image data without additionally
providing an output port for delivering the output of the final processor
stage (such as the output port 72 shown in FIG. 9A), which leads to
effective reduction of the number of pins required for an LSI image
processing apparatus.
It should be noted that, in order to reduce the time period .DELTA.t.sub.4,
modifications of this embodiment by employing pipeline registers and/or
delay registers are possible similar to the modified versions of the
embodiment of FIG. 9A, as shown in FIGS. 10A and 11A. Description on such
modified versions is omitted because they are readily understood by anyone
in the art from the foregoing description.
FIG. 13A shows another embodiment of the present invention. In the
embodiments described in the foregoing, each of the processor stages or
processor units is provided with weight data independent of one another.
In contrast thereto, the image data processing apparatus 2-III shown in
FIG. 13A is in a large scale integrated circuit configuration including
four image data processing modules (the same number of processing modules
as the row number) 100A to 100D in cascade connection.
Each of the image data processing modules 100A to 100D has one input port
71 through which image data to be processed is to be introduced and one
output port 74 through which processing results are to be delivered. Each
of the processing modules 100A to 100D includes a weight data memory 87
constituting means having four weight data (the same number of weight data
as the column number) stored therein, four juxtaposed processor stages
(the same number of processor stages as the column number) 88 each
including a multiplier coupled in common to the weight data memory 87, a
shift register 75 provided for each of the | | |