|
Description  |
|
|
BACKGROUND OF THE INVENTION
This invention relates generally to a processing system for document data,
and more particularly to a document image processing system suitable as an
input unit to an electronic document image file.
Conventional electronic document files merely store each page of a document
as an image, and secondary information for information retrieval must
separately be given from outside using code input means (e.g., a
keyboard). In order to automate a file input operation, however, it is
preferred that secondary information is generated by automatically reading
titles, author names and the like described in the documents. In order to
further improve information retrieval, it becomes necessary to realize
automatic input of the captions of tables and chapter captions, or
automatic keyword extraction by recognition of the text itself.
Segmentation of the image of the object document into portions such as
captions, authors, abstract, text, figures, pictures, and the like, has
also been required to reduce the memory space and to increase facets for
retrieval.
A system which understands the content of a document and processes the
document on the basis of the result of understanding to cope with the
problems described above has so far been investigated, and an example of
such a system is disclosed in "Basic Studies on System for Cuttings of
Newspaper Articles" by Yoji Noguchi and Junichi Toyota (Resume 6C-1 of the
23rd National Convention of Information Processing Society of Japan;
1981). However, since this document understanding system is directed to
the cuttings of newspapers, it is not clear whether or not the technique
can be applied to documents having arbitrary formats. In addition, the
portions of characters are merely segmented, but a method of combining
segmentation with recognition is not disclosed.
SUMMARY OF THE INVENTION
The present invention is directed to provision of an image understanding
system which deals with ordinary document images, segments them in
accordance with their structures, and makes it possible to recognize the
character portions, whenever necessary.
In order to accomplish the object described above, the present invention
employs grammar describing the structure of a document image, and parses
the statements (the structures of the document) expressed by the grammar
to recognize the structure of an unknown input image. The grammar
describes the image as substructures and the relative relation between
them. In the parsing process, after the substructures and their relative
relation are identified, a search is made as to whether or not the
substructures and the relative relation exist in the unknown input image,
and if they do, the inside of the substructures is further resolved to
continue the analysis. If they do not, other possibilities are searched.
The structure of the unknown input image is understood from the result of
such a search.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 shows an example of documents;
FIG. 2 shows an embodiment of the present invention;
FIGS. 3, 4, 5 and 6 are flowcharts useful for explaining the processing at
a control unit shown in FIG. 2;
FIG. 7 is a referential view showing an example of documents;
FIGS. 8, 9, 10, 11, 12 and 13 are explanatory views useful for explaining
the principle of a fourth embodiment of the present invention;
FIGS. 14 and 15 are flowcharts useful for explaining the processing at the
control unit 102 in the fourth embodiment of the present invention; and
FIGS. 16 and 17 are explanatory views useful for explaining the content of
processing shown in FIG. 15.
DESCRIPTION OF THE PREFERRED EMBODIMENTS
First of all, a parsing method in the embodiments of the invention will be
explained before the description of the embodiments. Though the following
description will deal with a technical paper as an example of documents,
the present invention can also be applied to other documents by changing
some parts of the grammar because the grammar formats are somewhat
different. Therefore, the present invention is not particularly limited to
the example of the technical paper.
FIG. 1 shows an example of one page of a technical paper having a
predetermined format. The following illustrates an example of the grammar
(hereinafter referred to as "document grammar") expressing the structures
of the documents.
__________________________________________________________________________
(textline)
__________________________________________________________________________
1 <document> ::=.vertline.
<technical>paper>.vertline.<paperback
novel>.vertline..about..vertline.<patent>
2 <technical paper>
::=
<title page>
3 <technical paper>
::=
<technical paper>[+<continued page>:]
4 <title page> ::=.vertline.
<UDC> .eta.<title content> .eta.<author abstract>
.eta.<text>.eta.<title page separator>
5 <continued page>
::=
<heading>.eta.<text>.eta.<page separator>
6 <UDC> ::=
<<UDC>> .xi.<period numeral>[.xi.<<CL>>.xi.<period
numeral>]
7 <heading> ::=
<Japanese title>.xi.<volume number>.xi.<numeral>
8 <volume number>
::=
<<VOL>>.xi.<numeral> .xi.<<NO>>.xi. <numeral>
9 <title content>
::=
<Japanese title>.eta. <English title>
10
<Japanese title>
::=
< Japanese textline region>
11
<English title>
::=
<English textline region>
12
<author abstract>
::=
<abstract>.xi.<author group>
13
<abstract> ::=
<English textline region>
14
<author group> ::=
<author>
15
<author group> ::=
<author group>[.eta.<author>]
16
<author> ::=
<Japanese textline> .xi.<English textline>
17
<page number> ::=
<numeral>
18
<text> ::=
<column> .xi. <column>
19
<column> ::=
<section>[.eta. <column>]
20
<section> ::=
<chapter caption>.eta.<section caption>.eta.
<section text>
21
<section> ::=
<section caption>.eta.<section text>
22
<section> ::=
<section text>
23
<section> ::=
<<reference>> .eta. <reference list>
24
<chapter caption>
::=
<<numeral>>.xi. <Japanese textline>
25
<section caption>
::=
<period numeral>.xi. <Japanese textline>
26
<section text >
::=
<paragraph> [ .eta.<section text>]
27
<paragraph> ::=
<Japanese textline region>
28
<paragraph> ::=
<figure table>
29
<Japanese textline
::=
<Japanese textline> .eta.[<Japanese textline
region>]
region>
30
<Japanese textline>
::=
<<Japanese character>> .xi.[<Japanese textline>]
31
<Japanese textline>
::=
<<Japanese character>> .alpha.[<Japanese textline>]
32
<Japanese textline>
::=
<<Japanese character>> .beta.[<Japanese textline>]
33
<English textline
::=
<English textline> .eta.[<English textline region>]
region>
34
<English textline>
::=
<word>.xi.<<DLM>>.xi.[<English textline region>]
35
<word> ::=
<<alphabet>>[.xi.<word>]
36
<word> ::=
<<alphabet>>[.alpha.<word>]
37
< word> ::=
<<alphabet>>[.beta.<word>]
38
<word> ::=
.vertline.{English person name}.vertline. {English
organization name}.vertline.
{English location name}.vertline.
.vertline.{general English word}.vertline.
39
<numeral> ::=
<<numeral>>[.xi. <numeral>]
40
<period numeral>
::=
<numeral>
41
<period numeral>
::=
<period numeral> .xi.<numeral>
42
<period numeral>
::=
<period numeral>.xi.<<PR>>
43
<<numeral>> ::=
.vertline.0.vertline.1.vertline..about..vertline.9.ver
tline.
44
<<alphabet>> ::=
.vertline.a.vertline.b.vertline.c.vertline.d.vertline.
.about..vertline.A.vertline.B.vertline..about..vertlin
e.0.vertline.1.vertline..about..vertline.
45
<<Japanese character>>
::=
.vertline. .vertline. .vertline..about..vertline.
.vertline. .vertline..about..vertline. .vertline.
.vertline..about..vertline.a.vertline.b.vertline.
.about..vertline.A.vertline.B.vertline..about..vertlin
e.0.vertline.1.vertline..about.1
46
<<DLM>> ::=
.vertline. .vertline.,.vertline...vertline..about..ver
tline.
46
<<CL>> ::=
.vertline.:.vertline.
47
<<PR>> ::=
.vertline...vertline.,.vertline.].vertline.
48
<figure table> ::=
.vertline.-figure-.vertline..eta. <Japanese
explanation> .eta.
<English explanation>
49
<figure table> ::=
<Japanese explanation> .eta.
<English explanation>.eta. <table>
50
<figure table> ::=
<box>
51
<box> ::=
.vertline.-field-.vertline. .circleincircle. <section>
N
52
<Japanese explanation>
::=
<<word-figure>>.xi.<numeral>.xi.
<Japanese textline>
53
<Japanese explanation>
::=
<<word-table>>.xi. <numeral>.xi.
<Japanese textline>
54
<Japanese explanation>
::=
<Japanese explanation> .eta.
<Japanese textline region>
55
<English explanation>
::=
<<FIG>> .xi.<numeral> .xi.
<English textline>
56
<English explanation>
::=
<<TAB>> .xi.<numeral> .xi. <English textline>
57
<English explanation>
::=
<English explanation>.eta.
<English textline region>
58
<<FIG>> ::=
.vertline.FIG..vertline.
59
<<TAB>> ::=
.vertline.Table.vertline.
60
<<word-figure>>
::=
.vertline.figure.vertline.
61
<word-table> ::=
.vertline.table.vertline.
63
<<VOL>> ::=
.vertline.VOL.vertline.
64
<<NO>> ::=
.vertline.No..vertline.
73
<<UDC>> ::=
.vertline.U. D. C..vertline.
66
<table> ::=
<box> .gamma.<table>[.delta. <table>]
67
<table> ::=
<box> .delta.<table>[.gamma. <table>]
68
<table> ::=
<box>
69
<<reference>> ::=
.vertline.reference.vertline.
70
<reference list>
::=
<Japanese reference>[.eta. <reference list>]
71
<reference list>
::=
<English reference> [.eta. <reference list>]
72
<Japanese reference>
::=
<numeral> .xi.<<PR>> .xi.<Japanese textline>
73
<Japanese reference>
::=
<Japanese reference>[.eta.
<Japanese textline group>]
74
<English reference>
::=
<numeral> .xi. <<PR>>.xi.<English textline>
75
<Japanese reference>
::=
<English reference>[.eta. <English
textline group>]
__________________________________________________________________________
The document grammar described above expresses the structure of an ordinary
document, but particularly extracts the portions relating to the technical
paper. The grammar will now be explained with reference to the example
shown in FIG. 1. First of all, the symbols used will be explained.
______________________________________
< > nonterminal symbol
(abstract concept)
<<>> terminal symbol
(character string)
{ } terminal symbol
(character string in dictionary)
.vertline.- -.vertline.
terminal symbol
(substructure in image)
::= rewriting rule
.vertline.
OR (or)
[ ] omissible
______________________________________
+, .xi., .eta., .alpha., .beta., .circleincircle. , .gamma., .delta. are
operators between substructures.
The operators are explained as follows: The operator+represents that a
paper of some document continues to other page(s) of the document. The
operator .eta. represents that a subregion in an image region is
vertically neighboring with another subregion in the region. The operator
.xi. represents that a subregion in an image region is horizontally
neighboring with another subregion in the region. The operator .alpha.
represents that a subregion, especially a character, in an image region is
neighboring horizontally with another subregion, especially a character,
in the region. The operator .alpha. is different from .xi. in that the two
subregions are touching each other. The operator .beta. represents that a
subregion, especially a character, in an image region is neighboring
horizontally with another subregion, especially a character in the region
horizontally. This operator .beta. is different from the operators of
.delta. and .xi. as stated above in that the two subregions are placed in
vertical kerning positions. The operator .circleincircle. represents that
a subregion is surrounded by other subregion, such as a rectangle, in an
image region. The operator .gamma. represents that a square subregion in
an image region is neighboring horizontally with another subregion, where
the two subregions are touching each other. The operator .delta.
represents that a square subregion in an image region is neighboring with
another subregion vertically, where the two subregions are touching each
other.
The first rule of the grammar described above expresses that various kinds
of documents are available and the technical paper is one of the kinds.
The second rule expresses that a technical paper consisting only of a
title page (FIG. 1, 1) exists, and the third rule represents that an
arbitrary number (inclusive of 0) of pages may be added to the last of a
certain paper. The fourth rule represents that on the title page, a title
content (FIG. 1, 3) lies below a UDC symbol, that is, universal decimal
classification (FIG. 1, 2), "author abstract" (FIG. 1, 4) lies below the
former, followed then by the text (FIG. 1, 7) and finally "page number"
(FIG. 1, 9). Here, the "author abstract" represents that the "author
group" (FIG. 1, 6) exists on the right side of the abstract (FIG. 1, 5) as
shown in the 12th rule. Furthermore, the abstract is "English textline
region" as shown in the 13th rule. The author group may consist of one
author as shown in the 14th rule, or may consist of a plurality of authors
by adding other authors (in an arbitrary number) below the author group as
shown in the 15th rule. The author consists of a horizontal combination of
Japanese textline (person name) with English textline (person name) as
shown in the 16th rule. Since the text (FIG. 1, 7) is provided on vertical
halves one page in this embodiment, the concept of "column" (FIG. 1, 8) is
introduced so that the text consists of a horizontal combination of the
columns, as shown in the 18th rule. Each column consists of a continuation
of sections as shown in the 19th rule. Section text consists of paragraphs
as shown in the 26th rule, and the paragraphs are either Japanese textline
groups or figure-tables as shown in the 27th and 28th rules. The Japanese
textline consists of a horizontal continuation of Japanese characters via
.xi., .alpha. and .beta. as shown in the 30th to 32nd rules. Here, .xi.
represents a simple horizontal continuation, .alpha. does horizontal touch
and .beta. does horizontal over-up, and any of them will occur. The
Japanese character includes hiragana, katakana, kanji, alphabet, numeral,
and the like, as shown in the 45th rule.
To understand a document, an input document is first assumed to be the
first document in the rule described in the document grammar, i.e., a
technical paper, and it is tested to determine whether the assumption can
be confirmed. To confirm the assumption that the input document is a
technical paper, the input document must be one of the plurality of a
title page (rule 2) or a continued page (rule 3). Thus, the subsidiary
assumptions must be tested, one of which is that the input document is a
title page, and the other is that the input document is a continued paper.
If neither of these two assumptions are confirmed, then the first
assumption, that the input document is a technical paper, is judged to be
false and the next assumption, that the input document is a paperback
novel, is tested. Continuing this process until some assumption is
confirmed, the input document will be "understood", i.e., it is identified
as one of the document types defined in the document grammar. If no
assumptions are confirmed, then understanding the input document fails and
the document will be rejected. It must be noted that to confirm one of the
subsidiary assumptions another subsidiary assumption is generated, and so
on. However, ultimate assumptions which can not be resolved anymore will
be reached at the last of the sequences of assumptions and will be easily
tested because they are related to basic constituents of document images,
such as characters, line drawings or photographs, and they can be tested
using character recognition, line drawing recognition or photograph
separation techniques. At each assumption, different image processing is
applied to each operator. For example, since the operator .eta. represents
that the substructures continue vertically, processing for detecting the
continuation of the vertical substructure corresponds to this operator
.eta.. As an example of such processings, there is a processing which
detects the continuation of horizontal white pixels. Similarly, a
processing which detects the continuation of the vertical white pixels and
segments a character corresponds to .xi., and a processing which detects
the inclined continuation of the white pixels and then segments the
character corresponds to .beta..
As stated above, different rules are selected automatically to confirm a
more global assumption and the image processing modules, each of which
corresponds to each operator, are involved to test each assumption at
various levels.
As can be understood from the description given above, the document grammar
proposed by the present invention describes the structure of a complicated
document hierarchically and recursively. Therefore, this grammar can
describe those objects which have not conventionally been easy to
describe, such as those having an indefinite number of textlines and those
having substructures whose appearance is indefinite. Understanding of a
wide variety of documents can be made by describing the physical relation
of the substructures by means of the operators and then verifying the
relation expressed by the operators by image processing.
Hereinafter, preferred embodiments of the invention will be described in
detail with reference to the drawings.
FIG. 2 is a block diagram showing the construction of an apparatus which
employs a document processing system in accordance with one embodiment of
the present invention. Each constitutent portion of the apparatus is
connected by a bus 101, and the overall operation of the apparatus is
controlled by a control unit 102. The information (document image) on the
document 103 is scanned by a photo-electric conversion device 104, is
digitized and is then stored in a memory 1051 through the bus 101. The
memory 1051 constitutes a part of a memory 105 in cooperation with
later-appearing memories 1052, 1053 and 1054. Heretofore known efficient
coding may be effected when digitizing the document information, and the
memory capacity of the memory for storing the document image can be saved
by so doing.
In the description to follow, digitizing is effected for one pixel per bit,
but one pixel may be expressed by a multi-value, and may further be
provided with color information by effecting photo-electric conversion
using a color scanner. The normalized image which is obtained by applying
heretofore known correction of position and correction of rotation to the
document image by the control unit 102 is stored in the memory 1052.
Document understanding is effected to this normalized image by the program
control of the control unit 102 in the following manner, and the result of
understanding is applied to a file device 106.
FIG. 3 is a flowchart showing the flow of processing of document
understanding in a PAD (Problem Analysis Diagram) style.
Before explanation of the figure, it will be necessary to explain the PAD
style to represent flows of processing. In a PAD, units of processes are
represented as square boxes and placed incident to vertical lines
according with the sequence of processing. Two special boxes are used to
control the processing flow. One of them is a control of repetition and
represented as a box with a pair of vertical lines at the right side of
the box, where a horizontal line runs in a right direction from the center
of the right side of the box. The processes included in the repetition are
placed at the right hand side of the control box and connected with
horizontal and vertical lines. There are three types of repetition which
are a finite loop of so-called "DO" type, and two infinite loops of
"INTIL" and "WHILE" types. These three types can be distinguished by the
sentences written in the box. Another special box is a control box used
for branching and is represented as a box whose right side has a saw-tooth
shaped line. Each corner point of the saw-tooth corresponds to a branch
selected by the condition(s) stated in the box. The equation(s) evaluated
in the test of branching is written in the box and the condition(s) of the
branching is written near the horizontal lines drawn from the corner
points of the saw-tooth shaped side.
In FIG. 3, at step 301 statements describing the formats of documents
written according to the document grammar stated above are read in from an
outside file device (not shown).
Step 302 is initialization of the whole. Step 303 is an iteration loop
which iterates the following processing until the end of the document. The
image of one page is applied at step 304. Step 305 is loop control which
interprets this page in accordance with the document grammar. At step 306,
one statement is extracted, and parsing is effected at steps 307 and so on
and whether or not this one statement is to be accepted or rejected is
decided. Initialization of the stack used for the subsequent parsing
operation is effected at 307. The stack is placed in the memory 1054. Step
308 controls the flow of processing from step 309 to step 313. Step 309
detects the existence of operators, and is a group of branches to the
processings that correspond to the operators 3091-3093, respectively.
3091, 3092 and 3093 are image processing corresponding to the operators
.xi., .eta. and .alpha., respectively. These image processings will be
described in detail elsewhere. Step 310 detects whether or not the
operator(s) exists, and if not, the processing exists at step 313 from the
loop of 308 and so on and shifts to the processing (307) of the next
textline. If the operator exists, the stack is pushed down and the
operator is placed on the top at step 311, and step 312 detects the
existence of a substructure. Detection of the substructure consists of a
portion 3121 which identifies the terminal symbol and a portion 3122 which
identifies the nonterminal symbol.
The processing of 3122 is made by recursively effecting the processing of
step 307 and so on for part of the statement. Identification of the
terminal symbol is a processing which effects character recognition in the
case of numerals, for example, and decides whether or not the recognized
result belongs to the group of numerals.
When the interpretation of all the substructures and operators is completed
in the manner described above, understanding of this page in the document
is completely finished. The result of document understanding includes the
substructures in the stack (memory 1054) and its content (character
string), and the operators between the substructures. After being
converted to prescribed codes at step 314, these results are outputted to
the file device 106. If interpretation is not possible in any statement of
the grammar, this document can not be understood. This is the case where
the procedure exits from the loop at step 313 for all the textlines, and
this state is decided at step 316. If the document can not be understood,
a reject procedure is effected at step 317. For instance, the final result
of document understanding is displayed on a display 107, and is corrected
by man-machine interaction using a keyboard 108.
FIG. 4 is a flowchart expressing image processing for the operator .eta.
described at 3092 in FIG. 3, that is, a processing which detects the
horizontal continuation of the white pixels, in the PAD style. In FIG. 4,
step 401 is an entry to the main processing, and a normalized image Q
stored in the memory 1052 is given. At step 402, processings of steps
403-409 are iterated for the ordinance of scan line j to obtain the sum of
black pixels in a long run A (j). Step 403 is an initialization step. Step
404 decides whether the pixel Q (i, j) in the scan line is 1 or 0, and if
it is 1, the run length B of black pixels is counted at step 406. If Q (i,
j) is 1, summation processing is effected at step 408 when the run length
B till the previous pixel is found greater than a threshold .epsilon. by
the decision at step 407, and the sum length B is reset at step 409. After
completion of the loop, B is added to the sum A (j) at step 410, because
in the loop from the step 404 and so on, summation at the rightmost pixel
(i=I-1) is not effected. Since the decision of step 407 is added,
summation is effected for A (j) only when a relatively long run of black
pixels exists, so that the influence of noise is not so much great.
The procedure from steps 411 to 420 is a processing which detects that a
region smaller than the threshold .delta.1 in the A (j) is interposed by a
region greater than the threshold .delta.2. Step 411 is initialization of
flags F1, F2. Step 412 iterates the procedures of 413-419 for the ordinant
of scan line j. Step 413 detects that A (j) at a first time goes over the
threshold .delta.2, and the flag F1 is set at step 414. Step 415 detects
that A (j) at a first time goes under the threshold .delta.1 under the
state of F1=1, and the flag F2 is set at step 416 and at the same time, j
at this time is stored as j1. Step 417 detects the point at which A (j)
goes over the threshold .delta.2 under the state F2=1, and the previous
value of J1 is stored as j2 at step 418 and procedure exits from the loop
of 412 and so on. Step 420 is a branch and selects a step 421 on success
of detection of the operator .eta. and a step 422 on failure. Whether the
detection has succeeded or fails can be decided by seeing the flat F2,
because F2=1 represents that both the beginning point j1 and the ending
point j2 of a white area which is sufficiently wide horizontally and
separates two black regions have been found and, on the contrary, F2=0
represents that such points have not been found. Step 421 is an exit when
the detection of the operation has succeeded and parameters F2, J1 and j2
are passed to the routine outside to show the existence of the operator
.eta. and the positions where the operator exists. Step 422 is an exit
when the detection of the operator .eta. failed and parameters F2, j1 and
j2 are also passed to the outside, but in this case only F2 has a meaning
and j1 and j2 have no sense.
Next, the second embodiment of the present invention will be described.
Though this embodiment is realized by the same block diagram as that of
the first embodiment, the document grammar to be used is somewhat
different. In other words, the operators representing the relation between
the substructures such as .xi., .eta., .alpha., .beta., .circleincircle. ,
.gamma., .delta. are connected by parameters representing the physical
quantities, and are expressed for example, in the following way.
.xi.(1, 5), .eta.(3, 10), . . .
In this case, .eta. (3, 10) represents that clearance of at least 3 mm and
up to 10 mm exists in the vertical direction. The flowchart of the first
embodiment for detecting the operator .eta. (FIG. 4) is changed to FIG. 5.
In FIG. 5, the procedures from the steps 501 to 519 are the same as those
of steps 401 to 419 of FIG. 4. Step 520 decides that the run of white
pixels detected at steps 512-519 is from 3 to 10. Step 521 is the same as
the step 420. The statement of the document grammar used in the second
embodiment is somewhat more complicated than the statement of the document
grammar of the first embodiment, but it has the advantage that the
erroneous judgement in document understanding can be more easily avoided.
This grammar is suitable for processing of documents having relatively
less fluctuation of formats.
Next, the third embodiment of the present invention will be described.
Though this embodiment can be realized by the same block diagram as that
of the first embodiment (FIG. 3), the flow of control is different from
that of the first embodiment (FIG. 3), and is such as shown in FIG. 6.
FIG. 6 is a flowchart representing the flow of processing of document
understanding in the third embodiment in the PAD style as explained with
reference to FIG. 3. First of all, statements written with document
grammar are read into the memory 53 from the file device (not shown) at
step 601, and initialization of the whole system is effected at step 602.
Step 603 is an iteration loop which iterates the following procedures till
the end. The image of one page of the document is inputted at step 604,
and an image processing routine is invoked at step 605. At this time,
which area in the image is to be processed is designated. The image
processing routine operates in parallel with the interpretation of
statements, which will be described below, using multiprogramming or a
multiprocessor, directly extracts figures, tables, characters and other
terminal symbols from the image to be processed, and presents the data,
which represent extraction, into a specific address in the memory.
Step 606 is a control loop which interprets this page in accordance with
the document grammar. At step 607, the result of image processing is
examined, and the statement describing its substructure is searched at
step 608 in accordance with the extracted result. Since this processing is
carried out in parallel with image processing, completion of the image
processing must be awaited.
Step 609 effects initialization of the stack used for the subsequent
processing. Step 610 processes for the textline, and controls the flow of
processing from steps 611 to 615. Step 610 is the same as step 309 in FIG.
3. Step 612 detects the existence of the operators and if they do not
exist, the procedure exits from the loop of steps 609 and so on. If they
exist, the stack is pushed down and the operator(s) is put on the top. At
step 613, one of the operators, say .eta., is detected and the parameters
showing the detection of the operator are put on the stack pushing down
the stack before doing so. At step 614, the parameters of substructures,
such as j1 and j2 for .eta. (in FIG. 4), are extracted by known methods
and the image region is divided into substructures using these parameters.
At step 615, if the detection of no operator has failed, exit from the
loop immediately. In this case, the stack has not been changed and the
failure of detection is represented implicitly by no existence of the
operator at the top of the stack. Detection of the substructure is
effected recursively for the rest of images other than the portion
detected by the image processing routine, but since it is fundamentally
the same as in FIG. 3, it is hereby omitted. Steps 616 (output of the
final result of document understanding) and 617 are the same as those in
FIG. 3. Although the third embodiment is more complicated than the first
embodiment, the processing is faster in this embodiment because the
results of image processing excitates the parsing of the grammar so that
the irrelevant portion of grammar may not be parsed.
Next, the parsing method will be explained before the description of the
fourth embodiment of the present invention. FIG. 7 shows an example of one
page of a technical paper having a predetermined format. Though the
following description is directed to technical paper, the present
invention can be applied also to other documents by changing a part of the
document grammar because the form of grammar is somewhat different.
Therefore, the present invention is not particularly limited to this
example of the technical paper.
The following is an example of a grammar describing the structure of the
document (hereinafter referred to as the "document grammar").
______________________________________
(defform F
(form F1 (10 90 10 40))
(form F2 --)
(form F3 --))
(defform F1
(form F11 (10 90 10 50))
(form F12 (10 90 60 90)))
(defmac LINE-1 (% 1)
(point ? Y1 (mode IN Y LESS)
(point ? Y2 (mode OUT Y LESS)
(form % 1 (0 ? W ? Y1 ? Y2)))
______________________________________
The grammar described above will be explained with reference to the example
of FIG. 7.
The first symbol "deform F . . . " represents that the format F consists of
a format F1 and a horizontal continuation of formats F2 and F3 below the
format F1 and shown in FIG. 8. In FIG. 7, the portions of F, F1, F2 and F3
corresponding to FIG. 8 are encompassed by dashed line. The four numeric
values in the parentheses 10, 90, 10, 40 next to the format F1 represent
the position of the region of the format F1 when the full region
corresponding to the format F is expressed as 100.times.100. Here, the
coordinate system has its origin at the upper left. The numeric values
representing the region are a minimum value of X-ordinate, a maximum value
of X-ordinate, a minimum value of Y-ordinate and a maximum value of the
Y-ordinate. When the parameter values are already known as in this
embodiment, the values may be directly written. Similarly, the formats F2
and F3 are described by rectangular regions.
The next symbol "deform F1 . . . " represents that the format F1 consists
of formats F11 and F12 that are located vertically. In other words, the
region of the format F11 in the Y direction is from 10 to 50, and that of
the format F12, from 60 to 90. The positions of the regions of the formats
F11 and F12 are described in the coordinate system using the origin at the
upper left of the format F1. Therefore, when viewed from the format F, it
is a relative coordinate system.
In the manner described above, when the format is described by the
rectangular region and is described hierarchically as a group of the
regions one after another, the image can be described in a general form.
It is of course possible to describe by the absolute coordinate system
with the format F being the reference without using the hierarchical
expression, as shown in FIG. 9. In such a case, the rectangular regions
can be designated in the following way in the same way as in FIG. 8.
______________________________________
(deform F
(form F11 (18, 82, 13, 25))
(form F12 (18, 82, 28, 38))
(form F2)
(form F3)
______________________________________
The subsequent symbols "defmac LINE-1 (%1)" and so on are definition of
macro-statement. The following description of the three textlines as the
main body of the defination of macro-statement expresses that the first
line from above the rectangular region is format %1.
______________________________________
(point ?Y1 (mode IN Y LESS))
(point ?Y2 (mode OUT Y LESS))
(form %1 (0 ?W ?Y1 ?Y2))
______________________________________
Here, symbol ?W represents the vertical size (height) of the format and
symbol ?H does the horizontal size (width) of the format. Symbols ?Y1 and
?Y2 are variables that are identified by search, as will be described
next.
Symbol "point" represents the search of a point that satisfies a certain
condition, and substitution into the variable. The search condition is
designated by "mode". "IN.multidot.OUT" represents that the search point
is a change point from a region of white pixels to a region of black
pixels, or a change point from the black pixels to the region of the white
pixels. "Y" represents the axis of search and "LESS" does the search
direction. Symbol "area" represents a region within the range of search.
The search method will be explained about the case of the statement of the
definition of macro-statement by way of example, with reference to FIG.
10.
Symbol (A) represents that the textline "Title . . . , Author . . . "
exists in the format. (B) and (C) presents the coordinate values of these
textlines in the Y direction, that is, the first and second lines. The
first line exists from ?Y1 to ?Y2, and the second line exists from ?Y3 to
?Y4. As described above, (B) is the macro-statement that defines that the
format of the first line is %1, and (C) is a macro-statement defining that
the format of the second line is %1. The usage of these macro-statement is
as follows.
__________________________________ | | |