|
Description  |
|
|
BACKGROUND OF THE INVENTION
(1) Field of the Invention
The present invention relates generally to a document processing apparatus,
and more specifically to a document processing apparatus which can form a
logical document architecture with respect to chapters, itemized
statements, paragraphs, etc. of a document.
(2) Description of the Prior Art
In general, a document is divided into a plurality of blocks, and headings
are assigned to the respective blocks to facilitate reading of the
document. Further, each block is divided into subblocks, and subheadings
are assigned to the respective subblocks. The headings and subheadings are
composed of short sentences, and additionally heading symbols are often
added to these introductory portions, for instance, such as "Chapter 1" or
"Section 3", respectively. When documents having the hierarchical
structure as described above are processed by a computer, the following
problem arises: In the conventional document processing systems, since
document data are processed in units of frames on the display or of pages
of printing sheets, where a given chapter is required to be replaced with
another, both the start and end positions of the document data to be moved
elsewhere should be designated by the cursor. In this case, if the
document data of the any given chapter is long, the display screen must be
scrolled many times from the start position to the end position to be
designated. The above screen scrolling is troublesome and tends to result
in operational errors.
When an operator drafts a document, he often wishes to see the previous
sentences, for instance, to check the contents of the previous sentences
and the kinds of the previous heading symbols. In this case, he must guess
the page and the position of the line which include the required sentence
and heading symbol to be checked, and thereafter must search the desired
sentence and heading symbol. The above search operation is troublesome and
therefore the document drafting efficiency is greatly degraded.
To solve the above-mentioned problem, the same applicant and the same
inventors have already filed a novel document processing apparatus which
comprises document data inputting means; heading dictionary means; heading
rule dictionary means; heading deciding means; document architecture rule
dictionary means; and document architecture deciding means.
The above document processing apparatus can prepare a logical document
hierarchical architecture list by handling documents in units of items in
order that the operator can readily designate any given headings, itemized
statements, paragraphs for easy document editing.
In the above document processing apparatus, however, since document
architecture is decided in accordance with only the document architecture
dictionary, there exists a problem such that heading is decided
erroneously. For instance, a heading "2.2 Class Training" can be decided
as an addition of "2.2" (Heading Symbol) and "Class Training" (Heading
Word) or of "2" (Heading Symbol) and "2 class Training" (Heading Word).
Further, where itemized statement of "1 . . . ", "2 . . . ", "3 . . . ",
and "4 . . . " exist under a chapter heading "4 . . . " and further "5 . .
. " follows, the "5 . . . " can be decided as a chapter heading or an
itemized heading.
Therefore, when the document architecture univocally decided by the
computer is different from that intended by the user, the user should
modifly the document architecture to an intended one, thus resulting in
maloperability.
SUMMARY OF THE INVENTION
With these problems in mind, therefore, it is the primary object of the
present invention to provide the document processing apparatus with an
additional function such that a plurality of document architecture
candidates can be decided and the operator can readily select any one of
desired candidate for providing a better operability.
To achieve the above-mentioned object, the document processing apparatus
according to the present invention comprises:
a unit for inputting document data; heading dictionary unit for storing
words and phrases frequently used as a heading; heading candidate
extraction unit for extracting, as a heading candidate, the word and
phrase corresponding to the heading stored in said heading dictionary unit
from the document data input through said input unit; heading rule
dictionary means for storing rules for deciding the headings; heading
deciding unit for checking the heading candidate extracted by said heading
candidate extracting unit in accordance with the heading rule stored in
said heading rule dictionary unit and for deciding whether the heading
candidate is a heading; document architecture rule dictionary unit for
storing rules associated with document logical architectures; document
architecture deciding unit for deciding the document architecture of a
heading decided by said heading deciding unit and the sentence decided as
a non-heading in accordance with document architecture rules stores in
said document architecture rule dictionary unit; and, in particular,
document architecture selecting and indicating unit for allowing an
operator to select a desired document architecture when the document
architecture deciding unit decides a plurality of document architecture
candidates in accordance with document architecture rules.
The document architecture selecting and indicating unit comprises rule
application deciding unit and candidate selecting and indicating unit. The
rule application deciding unit is accessible to the document architecture
rule dictionary to check a rule name requesting candidate selection and to
retrieve flags corresponding to the rule name from a table. The candidate
selecting and indicating unit is accessible to a candidate selecting key
provided in the document input unit to update flags so that any desired
document architecture can be selected.
Further, the document architecture rule dictionary stores rule application
record information indicative of past rule application situations, and the
document architecture deciding unit decides an architecture rule to be
applied with reference to the stored rule application record information.
The above record information can be updated.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a block diagram of the document processing apparatus according to
the present invention;
FIG. 2 is an example of documents;
FIG. 3 is an examplary heading word dictionary;
FIGS. 4A to 4D are an examplary heading rule dictionary;
FIG. 5 is a flowchart showing the operation procedure of the apparatus
shown in FIG. 1;
FIGS. 6A to 6F show an examplary sequence of logical document architecture
lists stored in the logical architecture storage,
FIGS. 7A to 7C show few examples of application of heading rules shown in
FIGS. 4A to 4D to the document shown in FIG. 2;
FIG. 8 shows an example of stored rule tables including flags corresponding
to rule names; and
FIG. 9 is a flowchard showing a procedure of the operation of the rule
application decision section.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT
With reference to the attached drawings, the document processing apparatus
according to the present invention will be described in detail
hereinbelow:
With reference to FIG. 1, a document processor 1 is connected to an input
device 2 including a keyboard 2a for achieving centralized handling and
processing input documents. The document processor 1 is also connected to
an original document storage 3 for storing input original documents and to
a display controller 4 for causing a display 5 to indicate the input
original document read out from the storage 3. The document processor 1 is
furhter connected to a heading extractor 6, a heading decision section 8,
a document architecture decision section 9, and a logical architecture
storage 10. The heading extractor 6 is connected to a heading word
dictionary 7 for storing many types of words representative of headings.
The heading decision section 8 includes a heading rule dictionary 8a. The
document architecture decision section 9 includes a document architecture
rule dictionary 9a.
The document processor 1 sequentially detects document data segmentation
codes stored in the original document storage 3, for example, such as a
line return code, and extracts sentences segmented by the segmentation
code. In this case, the document processor 1 measures each sentence
length. The extracted sentences are sequentially sent to the heading
extractor 6. The heading extractor 6 decides the heading word by
comparison of the input sentence with heading words stored in the heading
word dictionary 7, and the sentence length.
The heading word dictionary 7 stores frequently used words, phrases and
symbols, all of which are defined as heading words. The words, phrases and
symbols are classified into categories, as shown in FIG. 3, and are
registered in advance in the dictionary 7. Words such as "introduction"
and "abstract" are registered in a category of "reserved heading word". In
addition, frequently used numerals and symbols are also registered as
heading words being classified into the respective categories.
The heading extractor 6 decides whether the number of characters of an
extracted sentence is less than a predetermined number. In other words,
the extractor 6 decides whether an extracted heading word (a word and/or
phrase, and/or numeral and/or symbol represented as a code string)
corresponds to one of the words registered in the dictionary 7. If a
correspondence is detected, the extracted word is recognized as the
corresponding heading word.
The extracted words decided by the extractor 6 as being the heading word
are input, one by one, to the heading decision section 8, under the
control of the processor 1. The heading decision section 8 decides, in
accordance with the heading rules (FIGS. 4A to 4D) stored in the
dictionary 8a, whether the recognized heading word is a heading word or
another word.
The word discriminated as the heading word or any other word by the heading
decision section 8 is input to the document architecture decision section
9 under the control of the processor 1. The architecture decision section
9 decides whether the sentence or word sent from the heading decision
section 8 is a chapter heading, a section heading, or a paragraph, in
accordance with the document architecture rules (shown below) stored in
the document architecture rule dictionary 9a:
TABLE 1
______________________________________
Rules for Heading
______________________________________
Condition 1: A reserved word is not included.
Condition 1-1: A heading word is included.
Condition 1-1-1: A reserved heading word is included.
Condition 1-1-1-1: A chapter heading is not included in
the previous part.
(Result) Indicates a chapter heading.
.fwdarw.
A symbol portion, an alphanumeric
portion, a punctuation portion, or a
tail sysmbol is defined as a main
heading pattern.
Condition 1-1-2: A reserved heading word is not
included.
Condition 1-1-2-2: A chapter heading is present in the
previous part.
Condition 1-1-2-2-1:
Matching with a chapter heading
pattern is successful.
(Result) Indicates a chapter heading.
.fwdarw.
The order of the chapter heading
pattern is incremented by one.
Condition 1-1-2-2-2:
This heading pattern does not match
the previous chapter heading.
Condition 1-1-2-2-2-1:
An itemized pattern is not present in
the previous part.
(Result) Indicates an itemized pattern
candidate.
Condition 1-1-2-2-2-2:
An itemized pattern is present in the
previous part.
Condition 1-1-2-2-2-
This heading pattern matches the
2-1: itemized pattern candidate.
(Result) Indicates an itemized pattern.
The order of the itemized pattern is
incremented by one.
______________________________________
TABLE 2
______________________________________
Rules for Matching with Heading Patterns
______________________________________
Condition 1-1:
An alphanumeric portion is included.
Condition 2-1:
Alphanumeric portion are the same
kind.
Condition 3-1:
The order of the alphanumeric portion
is higher by one than that of a
heading pattern.
Condition 4-1:
A symbol portion, factors excluding
the order of an alphanumeric portion,
a punctuation portion, a tail symbol,
and the presence/absence of
parentheses in the heading word are
the same as those of the heading
pattern.
(Result) Indicates successful matching.
Condition 4-2:
A symbol portion, factors excluding
the order of an alphanumeric portion,
a punctuation portion, a tail
symbol, and the presence/absence of
parentheses in the heading word are
the same within the range of the
error pattern rules.
(Result) Indicates successful matching.
Condition 4-3:
A symbol portion, factors excluding
the order of an alphanumeric portion,
a punctuation portion, a tail symbol,
and the presence/absence of
parentheses in the heading word are
not the same as those of the heading
pattern.
(Result) Indicates failure matching.
Condition 3-2:
The order of the alphanumeric portion
is equal to or incremented by two
from the order of the heading
pattern.
Condition 4-1:
A symbol portion, factors excluding
the order of an alphanumeric portion,
a punctuation portion, a tail symbol,
and the presence/absence of
parentheses in the heading word are
the same within the range of the
error pattern rules.
(Result) Indicates successful matching
Condition 4-2:
A symbol portion, factors excluding
the order of an alphanumeric portion,
a punctuation portion, a tail symbol,
and the presence/absence of
parentheses in the heading word are
not the same as those of the heading
pattern.
(Result) Indicates failure matching.
Condition 1-2:
An alphanumeric pattern is not
included.
Condition 2-1:
A symbol portion, factors excluding
the order of an alphanumeric portion,
a punctuation portion, a tail symbol,
and the presence/absence of
parentheses in the heading word are
the same as those of the heading
pattern.
(Result) Indicates successful matching.
Condition 2-2:
A symbol portion, factors excluding
the order of an alphanumeric portion,
a punctuation portion, a tail symbol,
and the presence/absence of
parentheses in the heading word are
not the same as those of the heading
pattern.
(Result) Indicates failure matching.
______________________________________
TABLE 3
______________________________________
Paragraph Associated Format
______________________________________
Condition 1-1: A heading is not included.
(Result) Indicates a paragraph.
______________________________________
TABLE 4
______________________________________
Conjunction Associated Format
______________________________________
Condition 1-1:
A paragraph. Access rule application
decision section
Condition 2-1:
Applied flag information is X1.
(Result d.sub.1)
Set the level of the current heading
to the same as that of the previous
heading.
Condition 2-2:
Applied flag information is X2.
(Result d.sub.2)
Set the level of the current heading
to the same as that of the previous
chapter heading.
______________________________________
The logical architecture of a sentence or word, as determined by the
document architecture decision section 9 in accordance with the above
rules, is stored in the logical architecture storage 10.
The display controller 4 controls the display 5 to display the document
data according to the document logical architecture stored in the logical
architecture storage 10.
The operation of the document processing apparatus will now be described
with reference to the flow chart shown in FIG. 5. When document data is
input to the input device 2 (step a), the input document data is
sequentially stored in the original document storage 3. At the same time,
the input document data is segmented into a plurality of blocks by the
document processor 1, as shown in FIG. 2. In this segmentation processing,
a line return codes etc. are determined as segmentation codes. The input
document data is segmented in units of blocks at the segmentation codes.
In this case, the segmentation sentence length is measured by counting
characters. If the measured value falls within a predetermined value
(e.g., 40 characters), the sentence is determined as having the
possibility of being a heading sentence.
If the segmented sentence is determined as having the possibility of being
a heading sentence according to the measured number of characters, the
heading extractor 6 decides whether a character string (words, phrases, or
symbols) constituting the segmented sentence is registered in the heading
word dictionary 7 (step b). For example, when the sentence "1.
Introduction" in the input document data is extracted, it is checked as to
whether it is registered in the heading dictionary 7. In this case, "1",
"." and "Introduction" are retrieved from the heading dictionary 7, and
the sentence is determined as being a heading candidate A (step c).
When a heading candidate decision is performed, the heading decision
section 8 accesses the heading rule dictionary 8a to determine whether the
candidate A is a heading word (step d). If the candidate A is defined by
any one of the rules shown in FIGS. 4A to 4D, the candidate A is
determined as being heading word B (step e). In this case, the type of
heading word is determined according to the applied heading rule.
If the sentence segmented by the document processor 1 does not correspond
to any heading word registered in the dictionary 7, or if the segmented
sentence does not coincide with any heading rule although it is determined
as being a heading candidate word, the segmented sentence is determined as
being a sentence not included in the heading word rules (step f).
The sentence determined as being a heading word, and the sentence
determined as not being a heading word are input to the document
architecture decision section 9 in order to determine their document
architecture. When the document architectures are determined, the decision
section 9 determines whether the sentence architectures correspond to
document architecture rules (Tables 1 to 4) stored in the rule dictionary
9a (step g). If the architecture of the input document is defined by one
of the document architecture rules, the document architecture data
corresponding to the determined rule is stored in the storage 10 (step h
and i).
With reference to the example of segmented sentences as shown in FIG. 2,
the above method of determining the document architecture will be
described in further detail. In the segmented sentences in FIG. 2, the
sentence of the first line, i.e., "document understanding system", and the
sentence of the second line, i.e., "Okawa Tara" are not stored in the
dictionary 7. These sentences are decided by the extractor 6 not to be
heading words. However, the sentence of the first line is defined by a
rule representing a noun phrase appearing at the head of the document, and
the decision section 9 decides that "document understanding system" is a
title. The sentence of the second line, "Okawa Taro" is a proper noun
representing a male name. Since the male name follows the title, the name
is determined as being an author's name.
The results obtained by the document architecture decision as described
above are stored in a form, as shown in FIG. 6A, in logical architecture
storage 10.
In the sentence of the third line, i.e., "1. Introduction", three words,
i.e., "1", ".", and "Introduction" are stored in the dictionary 7.
Therefore, this sentence is determined as being a heading candidate
sentence A1 (See FIG. 7A). At the same time, the categories constituting
this sentence are recognized as a numeric portion, a punctuation portion,
and a heading candidate word, respectively.
The heading decision section 8 accesses the heading rule dictionary 8a to
determine whether the sentence determined as being heading candidate A1 is
defined by the heading rules. In this case, the order of the categories
constituting candidate word A1 is analyzed. The decision section 8
determines whether the order satisfies any one of the conditions in FIGS.
4A to 4D. The first numeral "1" is defined by the rule d shown in FIG. 4D.
The numeral "1" and punctuation portion "." are defined by the rule b
shown in FIG. 4B. Therefore, "1." is determined as being a heading symbol
according to the rule b shown in FIG. 4B. "Introduction" is defined by the
rule c shown in FIG. 4C, and is determined as being a heading word. The
relationship between the heading symbol and the heading word is defined by
the rule a shown in FIG. 4A. The heading candidate A1 is thus decided as
heading B1. The above decision process is shown in FIG. 7A.
In the above decision process, if the categories are not defined by the
rules a, b, c, d shown in FIGS. 4A to 4D, heading candidate A1 is
determined as not being a heading word.
The document architecture decision section 9 determines the document
architecture of heading B1 in accordance with the rules in table 1 to 4.
In this case, the logical architecture of the analyzed sentence is stored
in the storage 10, as shown in FIG. 6A. No chapter heading is indicated in
the stored logical architectures. Heading B1, i.e., "1. Introduction" is
defined by conditions (1), (1-1), (1-1-1), and (1-1-1-1) in Table 1 so
that "1. Introduction" is determined constituting chapter heading C1 as
shown in FIG. 7A. According to this decision, the logical architecture
containing the chapter heading is stored in the logical architecture
storage 10, as shown in FIG. 6B.
Since the number of characters of the sentence of the fourth and fifth
lines shown in FIG. 2 exceeds the number for determining the possibility
of a sentence being a heading word, this sentence is therefore determined
as being other than a heading. As defined by the rule in Table 3, the
sentence of the fourth and fifth lines is determined as being a sentence
constituting a paragraph.
The sentence of the sixth line "2. Features of System" is recognized as
heading candidate A2 in the same procedures as for heading candidate A1.
In this case, the sentence of the sixth line is analyzed by the steps in
FIG. 7B and is determined as being a heading B2. The heading B2 is
compared with the rules in Table 2 to determine it coincides with a
specific one of the rules. The heading B2 is defined by conditions (1-1),
(2-1), (3-1), and (4-1), and is determined as having the possibility of
being of the same level as that of chapter heading C1 "1. Introduction".
In this way, it is determined whether the heading B2 is defined by the
rules in Table 1. In other words, "2. Features of System" satisfies
conditions (1), (1-1), (1-1-2), and (1-1-2-2-1), and thus, the heading
word B2 is determined as constituting chapter heading C2. The resultant
logical architecture data is stored in the storage 10 as shown in FIG. 6C.
The same processing as described above is performed for the sentences of
the seventh and subsequent lines, and the document architectures of these
sentences are stored in the storage 10, as shown in FIGS. 6D and 6E. More
specifically, for the sentence of the seventh line, heading candidate A3
is analyzed, as shown in FIG. 7C, and then is determined as being heading
B3 according to the rules shown in FIGS. 4A to 4D.
In the document architecture decision section 9, the heading B3 is compared
with the rules in Table 2. Since the pattern of heading B2 does not
previously appear, matching is unsuccessful. As a result, heading B3 is
determined as being a heading having a level different from those of the
previous headings. Heading B3 is checked in accordance with document
architecture rules in Table 1 and is found to coincide with conditions
(1), (1-1), (1-1-2), (1-1-2-2), and (1-1-2-2-2-1). Therefore, heading B3
is determined as being itemizing heading C3.
Similarly, since the sentence of the eighth line satisfies conditions
(1-1), (2-1), (3-1), and (4-1), the level of the heading corresponding to
the sentence of the eighth line is determined as being possibly the same
as that of the itemized heading of the seventh line. The sentence of the
eighth line is determined as satisfying conditions (1), (1-1), (1-1-2),
(1-1-2-2), (1-1-2-2-2), (1-1-2-2-2-2), and (1-1-2-2-2-2-1) in Table 1 and
therefore determined as an itemized heading, being stored as shown in FIG.
6D.
With respect to the ninth line "This system is . . . ", it is possible to
consider this paragraph as having two cases or two candidates. That is,
the first case is that the ninth line is a part of the eighth line
itemized heading or "2 High recognition rate", while the second case is
that ninth line is a paragraph having the same level as that of the sixth
line chapter heading or "2. Feature of System".
Therefore, in the apparatus according to the present invention, the
apparatus is so configured as to allow the operator to select any one of
the candidates.
To achieve the above-mentioned object, the apparatus further comprises a
rule application decision section 12 and a candidate selection indication
section 14 as depicted in FIG. 1.
The rule application decision section 12 is allowed to be accessible to the
document architecture rule dictionary 9a to check a rule name requesting
candidate selection and to retrieve flags corresponding to the rule name
from a table (not shown) whenever two or more candidates are decided. The
candidate selecting and indicating section 14 is accessible to a candidate
selection key arranged in the document input device 2 to update flags so
that any desired document architecture can be selected.
In FIG. 5, when a decided candidate does not match with a single document
architecture rule or when plural candidates are created (in step g),
control allows the rule application decision section 12 to be accessible
to the document architecture rule dictionary 9a.
The above-mentioned candidate selection function is the feature of the
present invention.
As already explained, the document architecture decision section 9
determines whether the sentence architectures correspond to the document
architecture rules (Tables 1 to 4) stored in the document architecture
rule dictionary 9a. In this case, there exists the case where the
determined heading candidate word matches a plurality of rules and
therefore it is impossible to univocally determine the document
architecture. In this case, a plurality of artitecture candidates are
written in the logic | | |