|
Claims  |
|
|
What is claimed is:
1. A method of transforming a document represented as a bitmap image into
an editable coded data stream defined using a machine readable document
description language that records coded information resulting from the
document transformation process and information regarding uncertainties in
the document transformation process, comprising:
performing a first transformation operation on at least a text portion of
said bitmap image using a character recognition apparatus, to transform at
least said text portion of said bitmap image into coded information
recognized with a level of confidence;
outputting and recording said coded information into one or more elements
that are defined using said document description language, each element
having a machine readable element type identifier that indicates the type
of said coded information regarding the recognized bitmap image recorded
in said element so that the type of coded information contained in each
element can be known without examining the coded information contained in
each element, the element type identifier for each element having been
defined based on the type of coded information recorded in the element and
the level of confidence with which said bitmap image represented by said
coded information was recognized so that each of said elements is
selectively identified, each element having coded information of a single
type recorded therein; and
when said character recognition apparatus determines that the recognized
bitmap image contained in an element has not been recognized with at least
a predetermined level of confidence, recording in said element uncertainty
information determined by said first recognition apparatus regarding said
recognized bitmap image contained in said element;
wherein said element type identifier is a character-string-element or a
questionable-character-element, each character-string-element containing a
string of consecutive characters recognized by said character recognition
apparatus with at least said predetermined level of confidence, each
questionable-character-element containing said uncertainty information
determined by said character recognition apparatus for a character which
was not recognized with at least said predetermined level of confidence by
said character recognition apparatus.
2. The method of claim 1, wherein said uncertainty information includes a
degree of uncertainty with which said character recognition apparatus
transformed said bitmap image.
3. The method of claim 1, further comprising:
for each questionable-character-element, using a word recognizer to
transform said questionable-character-element and adjacent confidently
recognized characters in a same word as said
questionable-character-element into one or more elements having an element
type identifier defined as a verified-word-element, by substituting
alternate characters for said questionable-character-element when one or
more words created by said substituting are recognized by said word
recognition apparatus; when more than one verified-word-element is
transformed for each questionable-character-element, said more than one
verified-word-elements being placed in an element having an element type
identifier defined as an alternate-word-element; said
question-able-character-element remaining when no verified words are
recognized by said word recognition apparatus.
4. The method of claim 3, further comprising:
for each alternate-word-element, using a semantics analyzer to transform
verified words of the verified-word-elements contained in each alternate
word element into an element identified as a character-string-element
corresponding to one of the verified words contained in said
alternate-word-element when said semantics analyzer determines that said
one of said verified words is a correct word, said alternate-word-element
remaining when none of said verified words is determined to be a correct
word by said semantics analyzer.
5. The method of claim 4 wherein for each verified-word-element contained
in each alternate word element, said semantics analyzer determines and
identifies a confidence level of each verified word contained in each
verified-word-element, and indicates the confidence level of each
verified-word-element when none of the verified words in an
alternate-word-element are determined to be a correct word by said
semantics analyzer.
6. The method of claim 1, wherein for each questionable-character-element,
said uncertainty information pertaining to a character not recognized with
at least said predetermined level of confidence includes a most likely
uncertain character identified by said character recognition apparatus.
7. The method of claim 6, wherein for each questionable-character-element,
said uncertainty information pertaining to a character not recognized with
at least said predetermined level of confidence also includes a degree of
confidence determined by said character recognition apparatus for said
most likely uncertain character.
8. The method of claim 1, wherein for each questionable-character-element,
said uncertainty information pertaining to a character not recognized with
at least said predetermined level of confidence includes alternate
possible uncertain characters identified by said character recognition
apparatus.
9. The method of claim 8, wherein for each questionable-character-element,
said uncertainty information pertaining to a character not recognized with
at least said predetermined level of confidence also includes a degree of
confidence determined by said character recognition apparatus for each
said alternate possible uncertain character.
10. A method of transforming a document represented as a bitmap image into
an editable coded data stream defined using a machine readable document
description language that records coded information resulting from the
document transformation process and information regarding uncertainties in
the document transformation process, comprising the steps of:
a) analyzing text portions of said bitmap image using a character
recognizer to transform said text portions of said bitmap image into said
coded information recognized with a level of confidence;
b) outputting and recording said coded information into a series of
elements that are defined using said document description language, each
element having a machine readable element type identifier that indicates
the type of said coded information regarding the recognized bitmap image
recorded in said element so that the type of coded information contained
in each element can be known without examining the coded information
contained in each element;
c) the element type identifier for each element having been defined using
said document description language based on the type of coded information
recorded in the elements by defining the type identifier of each of said
elements as character-string-elements or questionable-character-elements,
each element defined as a character-string-element containing a string of
consecutive characters recognized by said character recognizer with at
least a predetermined level of confidence, each element defined as a
questionable-character-element containing information pertaining to a
character not recognized with at least said predetermined level of
confidence, said character-string-elements and said
questionable-character-elements being arranged as a continuous stream of
elements based on an order of the characters in said bitmap image, each
element having coded information of a single type recorded therein;
d) for each element defined as a questionable-character-element, analyzing
the coded information pertaining to the character contained therein and
adjacent confidently recognized characters contained in a same word as
said questionable-character-element using a word recognizer, to transform
and record said coded information and adjacent confidently recognized
characters into one or more elements having type identifiers defined as
verified-word-elements by substituting alternate characters for said
questionable-character-element and determining whether one or more
verified words created by said substituting are recognized by said word
recognizer; and
e) when more than one verified-word-elements are transformed for each
questionable-character-element, placing said more than one elements
defined as verified-word-elements in an element having an element type
identifier defined as an alternate-word-element, said
questionable-character-element remaining when no verified words are
recognized by said word recognizer.
11. The method of claim 10, further comprising:
f) for each alternate-word-element, analyzing the verified words of the
verified-word-elements contained therein and surrounding words using a
semantics analyzer, to transform said alternate-word-element into an
element identified as a character-string-element corresponding to one of
verified words contained in said alternate-word-element when said
semantics analyzer determines that one of said verified words is a correct
word, said alternate-word-element remaining when none of said verified
words is determined to be a correct word by said semantics analyzer.
12. The method of claim 11, wherein for each verified-word-element
contained in each alternate-word-element, said semantics analyzer
determines and identifies a confidence level of each verified word
contained in each verified-word-element, and indicates the confidence
level for each verified-word-element when none of the verified words in an
alternate-word-element are determined to be a correct word by said
semantics analyzer.
13. The method of claim 10, wherein for each questionable character
element, said information pertaining to a character not recognized with at
least said predetermined level of confidence includes a most likely
uncertain character identified by said character recognizer.
14. The method of claim 13, wherein for each
questionable-character-element, said information pertaining to a character
not recognized with at least said predetermined level of confidence also
includes a degree of confidence determined by said character recognizer
for said most likely uncertain character.
15. The method of claim 10, wherein for each
questionable-character-element, said information pertaining to a character
not recognized with at least said predetermined level of confidence
includes alternate possible uncertain characters identified by said
character recognizer.
16. The method of claim 15, wherein for each
questionable-character-element, said information pertaining to a character
not recognized with at least said predetermined level of confidence also
includes a degree of confidence determined by said character recognizer
for each said alternate possible uncertain character.
17. An automatic document recognition apparatus for transforming text
documents represented as bitmap image data into an editable coded data
stream defined using a machine readable document description language that
records coded data resulting from the document transformation process and
information regarding uncertainties in the document transformation
process, said apparatus comprising:
a character recognizer having:
a) first transformation means for performing a character recognition
operation on said bitmap image representation of said document to
transform said document into coded character data recognized with a level
of confidence; and
b) first identification means for outputting and recording said coded
character data into one or more elements that are defined using said
document description language, said first identification means further
defining a machine readable element type identifier for each element, said
element type identifier indicating the type of said coded character data
regarding the recognized bitmap image recorded in said element so that the
type of coded character data contained in each element can be known
without examining the coded character data contained in each element, said
identification means selectively identifying said one or more elements by
defining the element type identifier for each element based on the type of
coded data recorded in the element and the level of confidence with which
said bitmap image was recognized, each element having coded character data
of a single type recorded therein, and, when said first transformation
means determines that the coded character data contained in the element
has not been transformed with a predetermined level of confidence, said
identification means also recording in said element uncertainty
information determined by said first transformation means regarding said
coded character data contained in said element;
wherein said element type identifier is a character-string-element or a
questionable-character-element, each character-string-element containing a
string of consecutive characters recognized by said character recognition
apparatus with at least said predetermined level of confidence, each
questionable-character-element containing said uncertainty information
determined by said character recognition apparatus for a character which
was not recognized with at least said predetermined level of confidence
by..said character recognition apparatus.
18. The apparatus of claim 17, wherein said uncertainty information
includes a confidence level with which said first transformation means
determined said coded data.
19. The apparatus of claim 17, further comprising:
a word recognizer having:
i) second transformation means for transforming each
questionable-character-element and adjacent confidently recognized
characters in a same word as the question-able-character-element into one
or more verified words by substituting alternate characters for the
questionable-character-element and verifying that a word resulting from
said substituting exists in a lexicon; and
ii) second identification means using said document description language
for placing each verified word in an element having an element type
identifier defined as a verified-word-element, and when more than one
verified-word-elements are created for a questionable-character-element,
placing said more than one verified-word-elements in an element having an
element type identifier defined as an alternate-word-element, said
questionable-character-element remaining when no verified words are
determined to exist.
20. The apparatus of claim 19, further comprising:
a semantics analyzer including:
1) means for determining which verified word within an
alternate-word-element is a correct verified word based on words
surrounding the alternate-word-element; and
2) third identification means for identifying said correct verified word
and for replacing said alternate-word-element with an element identified
as a character-string-element containing said correct verified word. |
|
|
|
|
Claims  |
|
|
Description  |
|
|
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to document recognition, and in particular to
methods and apparatus for recognizing textual and graphics structures in
documents originally represented as bitmap images, and for recording the
results of the recognition process.
2. Description of Related Art
Document recognition is the automatic transformation of paper documents
into editable electronic documents. It entails the gradual transformation
of bitmaps into structured components, through successive and recursive
interventions of various processes. These processes include: page
segmentation, character recognition, graphics recognition, logical
structure reconstruction, spelling correction, semantic analysis, etc. All
these processes are prone to misinterpretation. Not all processes keep a
record of the misinterpretations they are aware of, and the ones that do
keep a record have no standard way of doing so. As a consequence,
downstream processes are generally not prepared to handle the record of
ambiguities handed to them by upstream processes, and simply discard them.
Valuable information is lost instead of being exploited for automatic
improvement of the document recognition function. If, on the other hand,
the ambiguity record is passed in its raw state to the user, the chore of
making manual corrections can quickly outweigh the advantages of automatic
recognition over a manual reconstruction of the entire document.
U.S. Pat. Nos. 4,914,709 and 4,974,260 to Rudak disclose an apparatus and
method for identifying and correcting characters which cannot be machine
read. A bitmap video image of the unrecognized character(s) is inserted in
an ASCII data line of neighboring characters, thereby allowing an operator
to view the character(s) in question in context to aid in proper
identification of the character(s). Subsequently, with the aid of the
video image, the operator enters the correct character(s) via a keyboard
or other means. This apparatus and method require operator interaction to
clarify any ambiguities resulting from an automatic document recognition
process. The results of these ambiguities are not recorded in a notation
that can be used by other downstream automatic devices.
U.S. Pat. No. 4,907,285 to Nakano et al discloses an image recognition
system which uses a grammar for describing a document image, and parses
statements expressed by the grammar to recognize the structure of an
unknown input image. The grammar describes the image as substructures and
the relative relation between them. In the parsing process, after the
substructures and their relative relation are identified, a search is made
as to whether the substructures and their relative relation exist in the
unknown input image, and if they do, the inside of the substructures are
further resolved to continue the analysis. If the substructures do not
exist, other possibilities are searched and the structure of the unknown
input image is thus represented from the result of the search. For
example, the location of a rectangular region of the document which
contains a statement defined by the document grammar (for example "TITLE"
and "AUTHOR") is initially represented by variables. See FIG. 10 of U.S.
Pat. No. 4,907,285. After locating this region in the document, the
appropriate numeric values are substituted for the variables.
U.S. Pat. No. 4,949,188 to Sato discloses an image processing apparatus for
synthesizing a character or graphic pattern represented by a page
description language and an original image. The image processing apparatus
generates a page description language including code data which represents
characters, graphics patterns, and the like, and command data which causes
a printer to print the original image. Ambiguities from previous document
recognition processes are not recorded in the page description language.
See, for example, the table in column 4, lines 5-10. Accordingly, any
downstream device receiving the page description language cannot determine
whether any ambiguities occurred in the previously performed document
recognition processes.
U.S. Pat. No. 4,654,875 to Srihari et al discloses a method of automatic
language recognition for optical character readers. Language in the form
of input strings or structures is analyzed on the basis of: channel
characteristics in the form of probabilities that a letter in the input is
a corruption of another letter; the probabilities of the letter occurring
serially with other recognized letters that precede the letter being
analyzed or particular strings of letters occurring serially; and lexical
information in the form of acceptable words represented as a graph
structure. Ambiguities from upstream recognition processes are not
recorded.
"Word Association Norms, Mutual Information, and Lexicography", by Kenneth
W. Church and Patrick Hanks, Computational Linguistics, Vol. 16, No. 1
(March 1990) discloses a measure, referred to as an "association ratio"
based on the information theoretic notion of mutual information, for
estimating word association norms from computer readable corpora. This
association ratio can be used by a semantics analyzer to determine a most
likely word from a choice of two or more words that have been identified
as possible words.
"On the Recognition of Printed Characters of Any Font and Size", by Simon
Kahan, Theo Pavlidis and Henry S. Baird, IEEE Transactions on Pattern
Analysis and Machine Intelligence, Vol. PAM1-9, No. 2 (March 1987),
discloses a system that recognizes printed text of various fonts and sizes
for the Roman alphabet. Thinning and shape extraction are performed
directly on a graph of the run-length encoding of the binary image. The
resulting strokes and other shapes are mapped, using a shape-clustering
approach, into binary features which are then fed into a statistical
Bayesian classifier. This system identifies multiple possible characters
or words, and scores them. However, the uncertainty in the recognition
processes is not recorded using the standard notation of the present
invention.
In summary, a number of systems exist which can recognize graphics
structures, text (characters, words, semantics, fonts) and logical
structures (pages, paragraphs, footnotes), and which can determine the
uncertainty with which the recognized feature was recognized. Accordingly,
the above-identified patents and papers are incorporated herein by
reference. However, none of these systems record the results of the
recognition process (including uncertainties)in a manner which can be used
by other devices. This results in much information (particularly regarding
uncertainty) being lost, especially when different recognition systems
(e.g., character recognizers, word recognizers, semantics analyzers) are
used at different times (as opposed to being integrated into one system).
OBJECTS AND SUMMARY OF THE INVENTION
It is an object of the present invention to provide methods and apparatus
for recording ambiguities in document recognition processes in a standard
format which can be used by a variety of document recognizers.
It is another object of the present invention to provide methods and
apparatus for converting bitmap images into editable coded data, wherein
information regarding ambiguities in the transformation processes
performed by upstream recognizers can be recorded and thus used by
downstream, higher level recognizers which attempt to resolve these
ambiguities.
To achieve the foregoing and other objects, and to overcome the
shortcomings discussed above, methods and apparatus are provided for
converting documents represented as bitmap image data into editable coded
data, wherein a standard notation in a document description language is
utilized for recording document recognition ambiguities by each document
recognizer. When the results of document recognition processes are
recorded using this standard notation, any ambiguities are identified in a
uniform manner so that downstream, higher level document recognition
processes can attempt to resolve these ambiguities by using all
information about the ambiguities obtained by upstream document
recognition processes.
In particular, when using the standard notation of the present invention,
each document recognizer can record the results of its recognition process
in one or more elements, selectively identified using the document
description language. Each element includes a type-identifier indicating a
type of coded data (information) regarding the recognized (transformed)
bitmap image contained therein. Each element also includes editable coded
data therein of the type identified by the type-identifier, and can also
include uncertainty information identifying any coded data which was not
transformed with a predetermined level of confidence. This uncertainty
information is determined by the document recognizer, and is recorded in a
format that is readable by higher level, downstream document recognizers.
This uncertainty information can include the level of confidence with
which the uncertain coded data was recognized by the document recognizer,
in order to further assist the higher level document recognizers in
resolving ambiguities. The uncertainty information can also include
alternative coded data for each uncertain recognition.
When the document recognizer is a character recognizer, any characters
which are not recognized with a predetermined level of confidence are
identified and recorded by placing them in
questionable-character-elements. The degree of certainty as well as
alternative possible characters and their degree of certainty can also be
recorded for each questionable character. Characters which were recognized
with at least the predetermined level of confidence are placed in
character-string-elements.
When the document recognizer includes a word recognizer (such as, for
example, a spelling checker), the word recognizer attempts to resolve any
existing questionable characters by determining whether any words exist in
a lexicon based upon each questionable character and the certain
characters in the word containing each questionable character. If a word
is identified in the lexicon for the word containing a questionable
character, that word is identified as a verified word, and is recorded in
a verified-word-element. If more than one verified words are found, they
are placed in individual verified-word-elements which are collectively
grouped together in an alternate-word-element. If no verified words are
found for the word containing a questionable character, the
questionable-character-element remains.
When the document recognizer includes a semantics analyzer, any identified
alternate verified words are resolved by analyzing the words surrounding
the alternate verified words. If one of the alternate verified words can
be confirmed with a predetermined level of confidence based on the
semantics analysis, it is returned and merged with the surrounding
character-string-elements. If the semantics analyzer cannot determine
which of the alternate verified words is correct, it returns the
alternate-word-element (and included verified-word-elements) as such, and
can include data indicative of the probability that each verified word
therein is the correct word.
When the document recognizer includes a graphics structure image
recognizer, it outputs graphics-elements containing coded data
representative of graphics structures recognized in the graphics image.
These structures can include: lines defined between endpoints; circles;
arcs; etc. Additionally, line thickness information can also be returned
and recorded. Ambiguities in the recognition process such as x and y
direction offsets and line thickness variations can also be recorded. This
data can be used by downstream higher level graphics recognition processes
to resolve any ambiguities, or to recognize more complex graphics
structures. For example, four lines recognized by a low level graphics
recognizer could be determined to be a box by a higher level graphics
recognizer if, for example, the endpoints can be determined with a high
degree of certainty to be coincident.
Additional image recognition elements are produced for recording
information relating to larger portions (or subimages) of the document
image. For example, data related to font text blocks, frames, pages,
documents, and large and small pieces of unresolved bitmap images can also
be recorded.
BRIEF DESCRIPTION OF THE DRAWINGS
The invention will be described in detail with reference to the following
drawings in which like reference numerals refer to like elements, and
wherein:
FIG. 1 is a sample page image used to illustrate the present invention;
FIG. 2 illustrates a character-string-element for collecting streams of
characters recognized with or above a predetermined confidence level;
FIG. 3 illustrates a questionable-character-element for collecting
questionable characters recognized with a low confidence level;
FIG. 4 illustrates a questionable-word-element for collecting a
questionable word which contains characters recognized with high
confidence, but which was not found in a lexicon;
FIG. 5 illustrates verified-word-elements for collecting verified words
found in a lexicon by resolving a word containing one or more questionable
characters, and an alternate-word-element for collecting alternate words
when two or more verified words are found for one word containing
questionable characters;
FIG. 6 illustrates a text-element for collecting text elements having the
same font;
FIG. 7 illustrates a fontDef-element for collecting data relating to a font
type;
FIG. 8 illustrates one type of graphics-element which is a segment-element
for collecting data relating to a line segment;
FIG. 9 illustrates another type of graphics-element which is an arc-element
for collecting data relating to an arc;
FIG. 10 illustrates another type of graphics-element which is an
image-element for collecting data relating to a large unresolved bitmap
image;
FIG. 11 illustrates another type of graphics-element which is a
spot-element for collecting information relating to a small unresolved
bitmap image referred to as a spot, and for storing this information as a
hexidecimal value;
FIG. 12 illustrates examples of elements referring to other elements;
FIG. 13 illustrates a tBlock-element for collecting information relating to
blocks of text;
FIG. 14 illustrates a frame-element for collecting information relating to
frames which can include text blocks, images, spots, arcs and segments, as
well as other frames;
FIG. 15 illustrates a page-element for collecting data relating to a page;
FIG. 16 illustrates a group-element for collecting information relating to
a group of elements which extend across page boundaries;
FIG. 17 illustrates a drStream-element for collecting data relating to an
entire document;
FIGS. 18A-C are a collection of all syntax necessary for describing a
document;
FIG. 19 is a block diagram of a system for inputting and converting a
bitmap image into coded data streams using the present invention;
FIG. 20 is a flowchart illustrating a procedure performed by the system of
FIG. 19 when using the present invention; and
FIG. 21 is a flowchart illustrating a procedure performed by the word
recognizer of FIG. 19 when using the present invention.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
The present invention utilizes a straight forward procedure for recording
ambiguities through the successive stages of the document recognition
process. These ambiguities are in the context of:
characters processed by character recognizers;
words processed by character recognizers, spelling checkers, and semantics
analyzers;
text flow processed by logical structure reconstructers; and
geometry of line segments and arcs processed by graphics recognizers.
Each of these processes produce and/or consume a byte-oriented data stream
(hereinafter referred to as a document recognition stream or DRstream),
and bitmap streams (hereinafter referred to as image files), referenced by
the DRstream. The DRstream carries information about one or several pages
of a digitized document. The information describes text with font, certain
graphics primitives, and half tone images, as well as their relationships,
and the ambiguities about them.
The present invention does not provide any new document recognition
processes (or document recognizers) in the sense that it can be used with
existing recognizers which recognize, for example, characters or graphics
structures, or determine words (by comparing sequences of characters
against a lexicon of known words), or determine which word from a choice
of possible words is correct. However, the present invention improves the
efficiency and compatibility with which these different types of
recognizers function by providing a standard notation for recording the
results obtained by the recognizers in a document description language.
FIGS. 2-18C illustrate this document recognition notation in ISO 8879
Standard Generalized Mark-up Language (SGML), according to the Document
Type Definition discussed below. According to the present invention, each
recognizer records coded data, corresponding to the results of the
recognition process which it performs, as coded information, referred to
in SGML as elements. Each element contains coded data which has been
recognized as being similar in some way (for example: text, graphics, same
page, all certain characters, etc.). Each element includes: a) a
type-identifier which indicates the type of coded data contained in that
element; b) an optional identification number, unique amongst all similar
type elements of a document, which distinguishes that element from other
similar type elements so that an element can be referenced by other
elements (most elements will have an identification number); c) coded data
obtained by the document recognition process (this could be strings of
characters or parameters defining graphics structures); and d) optional
contents (referred to as attributes) for providing additional information
(for example, uncertainty information) about the coded data included in
that element. Although the attributes of an element can be used to record
uncertainty information about coded data in an element (information such
as, for example, levels of confidence with which the coded data was
recognized, or possible offsets for parameters (e.g. endpoints defining a
line segment) of a graphics structure), the type-identification in some
cases also serves to convey uncertainty information by indicating that the
contents of that element was determined with a level of confidence below a
predetermined level of confidence. In the illustrated examples, the coded
data is recorded as human readable ASCII, however other codes could also
be used.
One familiar with SGML will understand the generic contents of the elements
to be described below. Thus, only a brief discussion of a generic element
will be provided with reference to FIGS. 18A-C. Then, each type of element
will be specifically described with reference to FIGS. 2-17. FIGS. 18A-C
illustrate a complete syntax of elements which can be used to describe a
document according to the present invention. This list of elements would
be located at the start of each DRstream, and would be used by
conventional parsers, programmed to parse streams written in SGML, to
parse the DRstream contained therebelow. That is, after the syntax list of
elements, a continuous stream of elements describing a specific document
would be provided. As used herein, the terminology continuous stream of
elements' refers to a group of elements which are identified as belonging
together. Thus, in a markup language such as SGML, where white-space is
permitted (and, in fact, encouraged for readability), tabs, breakage into
separate lines constitute white-space that the parser will ignore. In this
sense, white-space is part of the continuous stream of elements. Other
systems may have a limit on the size of character streams. In these
systems, long DRstreams would be split across several files which would be
identified as belonging together. Such a DRstream, where several files are
identified as belonging together, is also intended to be covered by the
terminology "continuous stream of elements". (Some of the elements in
FIGS. 18A-C include attributes (to be described below) which also would be
listed at the start of the DRstream.) Of course, all of the elements
listed in FIGS. 18A-C are not required to record the results of a d | | |