|
Description  |
|
|
BACKGROUND OF THE INVENTION
A portion of the disclosure of this patent document contains material that
is subject to copyright protection. The copyright owner has no objection
to the facsimile reproduction by anyone of the patent document or the
patent disclosure, as it appears in the U.S. Patent and Trademark Office
records, but otherwise reserves all copyright rights whatsoever.
1. Cross-References to Related Applications
The following concurrently filed and related U.S. applications are hereby
cross referenced and incorporated by reference in their entirety.
"Method for Determining Boundaries of Words in Text" to Huttenlocher et
al., U.S. patent application Ser. No. 07/794,392.
"Detecting Function Words Without Converting a Document to Character Codes"
to Bloomberg et al., U.S. patent application Ser. No. 07/794,190.
"A Method of Deriving Wordshapes for Subsequent Comparison" to Huttenlocher
et al., U.S. patent application Ser. No. 07/794,391.
"Method and Apparatus for Determining the Frequency of Words in a Document
Without Document Image Decoding" to Cass et al., U.S. patent application
Ser. No. 07/795,173.
"Optical Word Recognition by Examination of Word Shape" to Huttenlocher et
al., U.S. patent application Ser. No. 07/796,119, Published European
Application No. 0543592, published May 26, 1993.
"A Method and Apparatus for Automatic Modification of Selected Semantically
Significant Image Segments Within a Document Without Document Image
Decoding" to Huttenlocher et al., U.S. patent application Ser. No.
07/795,174.
"Method for Comparing Word Shapes" to Huttenlocher et al., U.S. patent
application Ser. No. 07/795,169.
"Method and Apparatus for Determining the Frequency of Phrase in a Document
Without Document Image Decoding" to Withgott et al., U.S. patent
application Ser. No. 07/794,555 now U.S. Pat. No. 5,369,714.
2. Field of the Invention
This invention relates to improvements in methods and apparatuses for
automatic document processing, and more particularly to improvements in
methods and apparatuses for recognizing semantically significant words,
characters, images, or image segments in a document image without first
decoding the document image and automatically creating a summary version
of the document contents.
3. Background
It has long been the goal in computer based electronic document processing
to be able, easily and reliably, to identify, access and extract
information contained in electronically encoded data representing
documents; and to summarize and characterize the information contained in
a document or corpus of documents which has been electronically stored.
For example, to facilitate review and evaluation of the information
content of a document or corpus of documents to determine the relevance of
same for a particular user's needs, it is desirable to be able to identify
the semantically most significant portions of a document, in terms of the
information they contain; and to be able to present those portions in a
manner which facilitates the user's recognition and appreciation of the
document contents. However, the problem of identifying the significant
portions within a document is particularly difficult when dealing with
images of the documents (bitmap image data), rather than with code
representations thereof (e.g., coded representations of text such as
ASCII). As opposed to ASCII text files, which permit users to perform
operations such as Boolean algebraic key word searches in order to locate
text of interest, electronic documents which have been produced by
scanning an original without decoding to produce document images are
difficult to evaluate without exhaustive viewing of each document image,
or without hand-crafting a summary of the document for search purposes. Of
course, document viewing or creation of a document summary require
extensive human effort.
On the other hand, current image recognition methods, particularly
involving textual material, generally involve dividing an image segment to
be analyzed into individual characters which are then deciphered or
decoded and matched to characters in a character library. One general
class of such methods includes optical character recognition (OCR)
techniques. Typically, OCR techniques enable a word to be recognized only
after each of the individual characters of the word have been decoded, and
a corresponding word image retrieved from a library.
Moreover, optical character recognition decoding operations generally
require extensive computational effort, generally have a non-trivial
degree of recognition error, and often require significant amounts of time
for image processing, especially with regard to word recognition. Each
bitmap of a character must be distinguished from its neighbors, its
appearance analyzed, and identified in a decision making process as a
distinct character in a predetermined set of characters. Further, the
image quality of the original document and noise inherent in the
generation of a scanned image contribute to uncertainty regarding the
actual appearance of the bitmap for a character. Most character
identifying processes assume that a character is an independent set of
connected pixels. When this assumption fails due to the quality of the
image, identification also fails.
4. References
European patent application number 0-361-464 by Doi describes a method and
apparatus for producing an abstract of a document with correct meaning
precisely indicative of the content of the document. The method includes
listing hint words which are preselected words indicative of the presence
of significant phrases that can reflect content of the document, searching
all the hint words in the document, extracting sentences of the document
in which any one of the listed hint words is found by the search, and
producing an abstract of the document by juxtaposing the extracted
sentences. Where the number of hint words produces a lengthy excerpt, a
morphological language analysis of the abstracted sentences is performed
to delete unnecessary phrases and focus on the phrases using the hint
words as the right part of speech according to a dictionary containing the
hint words.
"A Business Intelligence System" by Luhn, IBM Journal, October 1958
describes a system which in part, auto-abstracts a document, by
ascertaining the most frequently occurring words (significant words) and
analyzes all sentences in the text containing such words. A relative value
of the sentence significance is then established by a formula which
reflects the number of significant words contained in a sentence and the
proximity of these words to each other within the sentence. Several
sentences which rank highest in value of significance are then extracted
from the text to constitute the auto-abstract.
SUMMARY OF THE INVENTION
Accordingly, it is an object of the invention to provide a method and
apparatus for automatically excerpting and summarizing a document image
without decoding or otherwise understanding the contents thereof.
It is another object of the invention to provide a method and apparatus for
automatically generating ancillary document images reflective of the
contents of an entire primary document image.
It is another object of the invention to provide a method and apparatus of
the type described for automatically extracting summaries of material and
providing links from the summary back to the original document.
It is another object of the invention to provide a method and apparatus of
the type described for producing Braille document summaries or speech
synthesized summaries of a document.
It is another object of the invention to provide a method and apparatus of
the type described which is useful for enabling document browsing through
the development of image gists, or for document categorization through the
use of lexical gists.
It is another object of the invention to provide a method and apparatus of
the type described that does not depend upon statistical properties of
large, pre-analyzed document corpora.
The invention provides a method and apparatus for segmenting an undecoded
document image into undecoded image units, identifying semantically
significant image units based on an evaluation of predetermined image
characteristics of the image units, without decoding the document image or
reference to decoded image data, and utilizing the identified significant
image units to create an ancillary document image of abbreviated
information content which is reflective of the subject matter content of
the original document image. In accordance with one aspect of the
invention, the ancillary document image is a condensation or summarization
of the original document image which facilitates browsing. In accordance
with another aspect of the invention, the identified significant image
units are presented as an index of key words, which may be in decoded
form, to permit document categorization.
Thus, in accordance with one aspect of the invention, a method is presented
for excerpting information from a document image containing word image
units. According to the invention, the document image is segmented into
word image units (word units), and the word units are evaluated in
accordance with morphological image properties of the word units, such as
word shape. Significant word units are then identified, in accordance with
one or more predetermined or user selected significance criteria, and the
identified significant word units are outputted.
In accordance with another aspect of the invention, an apparatus is
provided for excerpting information from a document containing a word unit
text. The apparatus includes an input means for inputting the document and
producing a document image electronic representation of the document, and
a data processing system for performing data driven processing and which
comprises execution processing means for performing functions by executing
program instructions in a predetermined manner contained in a memory
means. The program instructions operate the execution processing means to
identify significant word units in accordance with a predetermined
significance criteria from morphological properties of the word units, and
to output selected ones of the identified significant word units. The
output of the selected significant word units can be to an
electrostatographic reproduction machine, a speech synthesizer means, a
Braille printer, a bitmap display, or other appropriate output means.
These and other objects, features and advantages of the invention will be
apparent to those skilled in the art from the following detailed
description of the invention, when read in conjunction with the
accompanying drawings and appended claims.
BRIEF DESCRIPTION OF THE DRAWINGS
A preferred embodiment of the invention is illustrated in the accompanying
drawing, in which:
FIG. 1 is a flow chart of a method of the invention;
FIG. 2 is a block diagram of an apparatus according to the invention for
carrying out the method of FIG. 1;
FIG. 3 is a flow chart of a preferred embodiment of a method according to
the invention for detecting function words in a scanned document image
without first converting the document image to character codes;
FIGS. 4A-4F show three sets of character ascender structuring elements
where: FIGS. 4A-4B show a set of character ascender structuring elements
of height 3 and length 5, where the solid dots are ON pixels along the
bottom row and along one side column and there are one or more OFF pixels
in a remaining location preferably separated from the ON pixels; FIGS.
4C-4D show a set of character ascender structuring elements of height 4
and length 5; and FIGS. 4E-4F show a set of character ascender structuring
elements of height 5 and length 5.
FIGS. 5A-5F show three sets of character descender structuring elements
where: FIGS. 5A-5B show a set of character descender structuring elements
of height 3 and length 5; FIGS. 5C-5D show a set of character descender
structuring elements of height 4 and length 5; and FIGS. 5E-5F show a set
of character descender structuring elements of height 5 and length 5;
FIG. 6 shows a horizontal structuring element of length 5;
FIG. 7 shows a block system diagram of the arrangement of system components
forming a word shape recognition system;
FIG. 8 shows a block system diagram for identifying equivalence classes of
image units; and
FIG. 9 shows a block system diagram for identifying significant image
units.
FIG. 10 shows an image sample of example text over which the inventive
process will be demonstrated;
FIG. 11 is a copy of a scanned image of the example text;
FIGS. 12A, 12B and 12C graphically illustrate the process used to determine
the angle at which the example text is oriented in the image sample prior
for further processing, while FIG. 12D shows graphs of the responses taken
from the example text, which are used to determine the angle at which the
example text is oriented in the image sample prior to further processing;
FIGS. 13A and 13B respectively show the derivation and use of a graph
examining the sample image of the example text to determine baselines of
text within the image;
FIGS. 14A and 14B are flowcharts illustrating the procedures executed to
determine the baselines shown in FIG. 13A;
FIG. 15 shows the scanned image of the example text with baselines
indicated thereon after derivation from the data shown in FIGS. 13A and
13B;
FIG. 16 is a flowchart illustrating the steps used in the application of a
median filter to the image of FIG. 10;
FIG. 17 is an enlarged pictorial representation of a portion of the image
of FIG. 10, illustrating the application of the median filter;
FIG. 18 demonstrates the resulting image after application of a median
filter, a process known herein as blobifying, to the scanned image of the
example text, which tends to render character strings as a single set of
connected pixels;
FIG. 19 shows a subsequent step in the process, in which lines of white
pixels are added to the blurred image to clearly delineate a line of
character strings from adjacent lines of character strings;
FIG. 20 is a flowchart illustrating the steps required to add the white
lines of FIG. 19;
FIGS. 21A and 21B are flowcharts representing the procedure which is
followed to segment the image data in accordance with the blurred image of
FIG. 18;
FIG. 22 shows the sample text with bounding boxes placed around each word
group in a manner which uniquely identifies a subset of image pixels
containing each character string;
FIGS. 23A and 23B illustrate derivation of a single independent value
signal, using the example word "from", which appears in the sample image
of example text;
FIG. 24 illustrates the resulting contours formed by the derivation process
illustrated in FIGS. 23A and 23B;
FIG. 25 illustrates the steps associated with deriving the word shape
signals;
FIGS. 26A, 26B, 26C and 26D illustrate derivation of a single independent
value signal, using the example word "from";
FIGS. 27A, 27B, 27C and 27D illustrate derivation of a single independent
value signal, using the example word "red", which does not appear in the
sample image of example text;
FIG. 28 shows a simple comparison of the signals derived for the words
"red" and "from" using a signal normalization method;
FIGS. 29A, 29B, and 29C illustrate the details of the discrepancy in font
height, and the method for normalization of such discrepancies;
FIG. 30 is a flowchart detailing the steps used for one method of
determining the relative difference between word shape contours;
FIG. 31 is a flowchart detailing the steps of a second method for
determining the relative difference between word shape contours;
FIGS. 32A and 32B are respective illustrations of the relationship between
the relative difference values calculated and stored in an array, for both
a non-slope-constrained and a slope-constrained comparison; and
FIG. 33 is a block diagram of a preferred embodiment of an apparatus
according to the invention for detecting function words in a scanned
document image without first converting the document image to character
codes;
The Appendix contains source code listings for a series of image
manipulation and signal processing routines which have been implemented to
demonstrate the functionality of the present invention. Included in the
Appendix are four sections which are organized as follows:
Section A, beginning at page 1, comprises the declarative or "include"
files which are commonly shared among the functional code modules;
Section B, beginning at page 26, includes the listings for a series of
library type functions used for management of the images, error reporting,
argument parsing, etc.;
Section C, beginning at page 42, comprises numerous variations of the word
shape comparison code, and further includes code illustrating alternative
comparison techniques than those specifically cited in the following
description;
Section D, beginning at page 145, comprises various functions for the word
shape extraction operations that are further described in the following
description.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT
In contrast to prior techniques, such as those described above, the
invention is based upon the recognition that scanned image files and
character code files exhibit important differences for image processing,
especially in data retrieval. The method of a preferred embodiment of the
invention capitalizes on the visual properties of text contained in paper
documents, such as the presence or frequency of linguistic terms (such as
words of importance like "important", "significant", "crucial", or the
like) used by the author of the text to draw attention to a particular
phrase or a region of the text; the structural placement within the
document image of section titles and page headers, and the placement of
graphics; and so on. A preferred embodiment of the method of the invention
is illustrated in the flow chart of FIG. 1, and an apparatus for
performing the method is shown in FIG. 2. For the sake of clarity, the
invention will be described with reference to the processing of a single
document. However, it will be appreciated that the invention is applicable
to the processing of a corpus of documents containing a plurality of
documents. M o r e particularly, the invention provides a method and
apparatus for automatically excerpting semantically significant
information from the data or text of a document based on certain
morphological (structural) image characteristics of image units
corresponding to units of understanding contained within the document
image. The excerpted information can be used, among other things, to
automatically create a document index or summary. The selection of image
units for summarization can be based on frequency of occurrence, or
predetermined or user selected selection criteria, depending upon the
particular application in which the method and apparatus of the invention
is employed.
The invention is not limited to systems utilizing document scanning.
Rather, other systems such as a bitmap workstation (i.e., a workstation
with a bitmap display) or a system using both bitmapping and scanning
would work equally well for the implementation of the methods and
apparatus described herein.
With reference first to FIG. 2, the method is performed on an electronic
image of an original document 5, which may include lines of text 7,
titles, drawings, figures 8, or the like, contained in one or more sheets
or pages of paper 10 or other tangible form. The electronic document image
to be processed is created in any conventional manner, for example, by a
conventional scanning means such as those incorporated within a document
copier or facsimile machine, a Braille reading machine, or by an
electronic beam scanner or the like. Such scanning means are well known in
the art, and thus are not described in detail herein. An output derived
from the scanning is digitized to produce undecoded bit mapped image data
representing the document image for each page of the document, which data
is stored, for example, in a memory 15 of a special or general purpose
digital computer data processing system 13. The data processing system 13
can be a data driven processing system which comprises sequential
execution processing means 16 for performing functions by executing
program instructions in a predetermined sequence contained in a memory,
such as the memory 15. The output from the data processing system 13 is
delivered to an output device 17, such as, for example, a memory or other
form of storage unit; an output display 17A as shown, which may be, for
instance, a CRT display; a printer device 17B as shown, which may be
incorporated in a document copier machine or a Braille or standard form
printer; a facsimile machine, speech synthesizer or the like.
Through use of equipment such as illustrated in FIG. 2, the identified word
units are detected based on significant morphological image
characteristics inherent in the image units, without first converting the
scanned document image to character codes.
The method by which such image unit identification may be performed is
described with reference now to FIG. 1. The first phase of the image
processing technique of the invention involves a low level document image
analysis in which the document image for each page is segmented into
undecoded information containing image units (step 20) using conventional
image analysis techniques; or, in the case of text documents, preferably
using the bounding box method described in copending U.S. patent
application Ser. No. 07/794,392 filed concurrently herewith by
Huttenlocher and Hopcroft, and entitled "Method for Determining Boundaries
of Words in Text." The locations of and spatial relationships between the
image units on a page are then determined (step 25). For example, an
English language document image can be segmented into word image units
based on the relative difference in spacing between characters within a
word and the spacing between words. Sentence and paragraph boundaries can
be similarly ascertained. Additional region segmentation image analysis
can be performed to generate a physical document structure description
that divides page images into labelled regions corresponding to auxiliary
document elements like figures, tables, footnotes and the like. Figure
regions can be distinguished from text regions based on the relative lack
of image units arranged in a line within the region, for example. Using
this segmentation, knowledge of how the documents being processed are
arranged (e.g., left-to-right, top-to-bottom), and, optionally, other
inputted information such as document style, a "reading order" sequence
for word images can also be generated. The term "image unit" is thus used
herein to denote an identifiable segment of an image such as a number,
character, glyph, symbol, word, phrase or other unit that can be reliably
extracted. Advantageously, for purposes of document review and evaluation,
the document image is segmented into sets of signs, symbols or other
elements, such as words, which together form a single unit of
understanding. Such single units of understanding are generally
characterized in an image as being separated by a spacing greater than
that which separates the elements forming a unit, or by some predetermined
graphical emphasis, such as, for example, a surrounding box image or other
graphical separator, which distinguishes one or more image units from
other image units in the scanned document image. Such image units
representing single units of understanding will be referred to hereinafter
as "word units."
Advantageously, a discrimination step 30 is next performed to identify the
image units which have insufficient information content to be useful in
| | |