|
Claims  |
|
|
What is claimed:
1. A method of operating a machine to align image words in a text image to
transcription words in a transcription associated with the text image; the
machine including a processor and a memory device for storing data; the
data stored in the memory device including instruction data the processor
executes to operate the machine; the processor being connected to the
memory device for accessing the data stored therein; the method
comprising:
operating the processor to obtain an image definition data structure
defining a text image including a plurality of glyphs representing
characters in an image character set;
operating the processor to obtain a transcription data structure associated
with the text image and including a plurality of transcription labels
indicating character codes representing characters in the image character
set;
operating the processor to produce an ordered image word sequence of image
words occurring in the text image; each image word being an image region
in the text image including at least one glyph;
operating the processor to produce an ordered transcription word sequence
of transcription words occurring in the transcription data structure; each
transcription word being a sequence of at least one transcription label
indicating a character code in the image character set; and
operating the processor to perform an alignment operation to align image
words in the ordered image word sequence with transcription words in the
ordered transcription word sequence subject to a constraint of maintaining
word order in each of the ordered transcription word sequence and the
ordered image word sequence during alignment; the alignment operation
producing an image-transcription alignment data structure indicating each
image word in the ordered image word sequence paired with either no (a
null) transcription word or with at most one transcription word in the
ordered transcription word sequence.
2. The method of operating a machine of claim 1 wherein
operating the processor to produce the ordered image word sequence of image
words occurring in the text image includes determining an image word
length value for each image word in the ordered image word sequence;
operating the processor to produce the ordered transcription word sequence
of transcription words occurring in the transcription data structure
includes producing a transcription word length value for each
transcription word in the ordered transcription word sequence; and
the alignment operation aligns transcription words in the ordered
transcription word sequence with image words in the ordered image word
sequence by matching image word lengths to transcription word lengths.
3. The method of operating a machine of claim 2 further including operating
the processor to determine an image character width value indicating an
estimated character width of a glyph in the text image; and wherein the
transcription word length value produced for each transcription word in
the ordered transcription word sequence is produced using the image
character width value.
4. The method of operating a machine of claim 1 wherein, when the text
image includes a first image word and the transcription data structure,
when visually inspected and compared with the text image, does not include
transcription labels visually paired with the first image word thereby
indicating a missing transcription word, the alignment operation detects
the missing transcription word; the alignment operation indicating in the
image-transcription alignment data structure that the first image word is
paired with no (a null) transcription word.
5. The method of operating a machine of claim 1
wherein operating the processor to produce the ordered image word sequence
of image words occurring in the text image includes determining an image
word parameter for each image word in the ordered image word sequence;
wherein operating the processor to produce the ordered transcription word
sequence of transcription words occurring in the transcription data
structure includes producing a transcription word parameter for each
transcription word in the ordered transcription word sequence; and
wherein the alignment operation uses similarity measurements between the
image word parameters and the transcription word parameters to perform the
alignment.
6. The method of operating a machine of claim 1 wherein
the transcription data structure includes a transcription label error
causing the processor to identify a sequence of transcription labels as a
transcription word, referred to as an errorful transcription word, in the
ordered transcription word sequence that cannot be identically matched to
an image word in the text image by visual inspection of the text image;
the errorful transcription word including at least one of a transcription
label deletion, a transcription label substitution, or a transcription
label insertion; and
the alignment operation, when pairing image words in the ordered image word
sequence to transcription words in the ordered transcription word sequence
including the errorful transcription word, determines whether the errorful
transcription word can be paired with an image word in the ordered image
word sequence and if so, determines with which image word to pair the
errorful transcription word.
7. The method of operating a machine of claim 1 wherein
the alignment operation is implemented as a dynamic programming operation
that represents the ordered transcription word sequence and the ordered
image word sequence as a two-dimensional lattice data structure, hereafter
referred to as a lattice, including a plurality of nodes and transitions
between nodes; a transition between two nodes representing a pair of
sequences of zero or more image words in the ordered image word sequence
and zero or more transcription words in the ordered transcription word
sequence; the dynamic programming operation producing a path through the
nodes in the lattice from a first node to a final node using scores
computed for nodes and computed for transitions between nodes in the
lattice; a score for a transition indicating a similarity measurement
between the zero or more transcription words and the zero or more image
words associated with the transition; the dynamic programming operation
storing node identification information for each node and transition in
the path, as the path through the lattice is produced; and
the alignment operation produces the image-transcription alignment data
structure indicating transcription words paired to image words by
backtracing through the nodes included in the path from the final node to
the first node using the node identification information.
8. The method of operating a machine of claim 7 wherein the similarity
measurement indicated by the score for a transition is a function of image
word lengths of the zero or more image words and of the zero or more
transcription words associated with the transition.
9. The method of operating a machine of claim 1 wherein
each image word in the ordered image word sequence is represented as an
image word location defined according to an image coordinate system
describing the text image;
each transcription word in the ordered transcription word sequence is
represented as a transcription word location in the ordered label sequence
of the transcription labels included in the transcription data structure;
and
transcription words aligned with image words are represented in the
image-transcription alignment data structure as transcription word
locations in the transcription and as image word locations in the text
image, respectively.
10. A method of operating a machine to label image locations of image
regions in a text image with label data available in a transcription data
structure associated with the text image; the machine including a
processor and for memory device for storing data; the data stored in the
memory device including instruction data the processor executes to operate
the machine; the processor being connected to the memory device for
accessing the data stored therein; the method comprising:
operating the processor to store, in the memory device of the machine, an
image definition data structure defining a text image including a
plurality of glyphs;
operating the processor to store, in the memory device of the machine, a
transcription data structure associated with the text image and including
an ordered sequence of transcription labels indicating character code
information about image regions occurring in the text image;
operating the processor to produce an ordered image region sequence of
image regions occurring in the text image; each image region being defined
with respect to its location in the text image relative to at least one
glyph occurring in the text image;
operating the processor to produce an ordered transcription region sequence
of transcription regions occurring in the transcription data structure;
each transcription region being a sequence of at least one transcription
label;
operating the processor to perform an alignment operation to align image
regions in the ordered image region sequence with transcription regions in
the ordered transcription region sequence; the alignment operation
computing similarity measurements measuring a similarity between image
regions and transcription regions; the alignment operation determining a
best pairing between the image regions in the ordered image region
sequence and transcription regions in the ordered transcription region
sequence that optimizes the similarity measurements such that each one of
the image regions in the ordered image region sequence is paired with
either no (a null) transcription region or with at most one transcription
region in the ordered transcription region sequence; and
operating the processor to produce an image transcription alignment data
structure indicating transcription regions paired with image region
locations of paired image regions.
11. The method of operating a machine of claim 10 further including
operating the processor to determine an image region parameter value for
each image region; the image region parameter value indicating a
measurement of a geometric characteristic of the image region;
operating the processor to determine an image glyph parameter value
indicating a measurement of a geometric characteristic of a glyph
occurring in the text image; and
operating the processor to determine a transcription region parameter for
each transcription region in the ordered transcription region sequence
using the image glyph parameter value; and
wherein the alignment operation is implemented as a dynamic programing
operation that aligns image regions in the ordered image region sequence
to transcription regions in the ordered transcription region sequence
using similarity measurements between image region parameters and
transcription region parameters, subject to a constraint of maintaining
region order in each of the ordered transcription region sequence and the
ordered image region sequence during the dynamic programming operation.
12. The method of operating a machine of claim 10 wherein, when the ordered
transcription region sequence includes fewer transcription regions than
image regions included in the ordered image region sequence, the
transcription data structure indicates a missing transcription region; and
wherein the alignment operation uses the similarity measurements to detect
the missing transcription region so that the best pairing between the
image regions in the ordered image region sequence and transcription
regions in the ordered transcription region sequence indicates the missing
transcription region.
13. A method of operating a machine to pair image words in a
two-dimensional (2D) text image with transcription words in a
transcription associated with the text image; the machine including a
processor and a memory device for storing data; the data stored in the
memory device including instruction data the processor executes to operate
the machine; the processor being connected to the memory device for
accessing the data stored therein; the method comprising:
operating the processor to receive and store, in the memory device of the
machine, an image definition data structure defining a two-dimensional
(2D) text image of pixels and including a plurality of glyphs; the 2D text
image having a vertical dimension size larger than a single line; the
plurality of glyphs representing characters in an image character set;
operating the processor to receive and store, in the memory device of the
machine, a transcription data structure associated with the 2D text image
and including a plurality of transcription labels indicating character
codes representing characters in the image character set;
operating the processor to determine an image word location in the 2D text
image for each of a plurality of image words occurring therein; each image
word being an image region including at least one glyph positioned with
respect to an image baseline; the plurality of image word locations
identifying an ordered image word sequence of the image words occurring in
the 2D text image;
operating the processor to determine an image word length value for each
image word in the 2D text image indicating a length of the image word;
operating the processor to determine an image character width value
indicating an estimated character width of a glyph in the 2D text image;
operating the processor to determine a transcription word location for each
of a plurality of transcription words occurring in the transcription data
structure; each transcription word being a sequence of at least one
transcription label indicating a character code in the image character set
and separated from other transcription labels indicating character codes
in the image character set by a transcription word boundary character
code; the plurality of transcription word locations identifying an ordered
transcription word sequence of the transcription words occurring in the
transcription data structure;
operating the processor to determine a transcription word length value for
each transcription word using the image character width value and a count
of the transcription labels included in the transcription word;
operating the processor to perform an alignment operation to align
transcription word lengths in the ordered transcription word sequence with
image word lengths in the ordered image word sequence; the alignment
operation maintaining the ordered image word sequence and the ordered
transcription word sequence when aligning the image word lengths and the
transcription word lengths; the alignment operation producing a list of
word pairs indicating transcription words in the ordered transcription
word sequence paired to matching image words in the ordered image word
sequence; and
operating the processor to produce an image-transcription alignment data
structure using the list of word pairs, the image word locations and the
transcription word locations; the image-transcription alignment data
structure indicating, for each word pair, the transcription word location
of a transcription word and the image word location of a paired image
word.
14. The method of operating a machine of claim 13 wherein, when the 2D text
image includes a first image word and the transcription dam structure,
when visually inspected and compared with the 2D text image, does not
include transcription labels visually paired with the first image word
thereby indicating a missing transcription word, the alignment operation
detects the missing transcription word; the alignment operation further
producing missing word data indicating that the first image word is not
paired with a transcription word; the image-transcription alignment data
structure using the missing word data to indicate that an image word
location of the first image word is paired with no transcription word
location of a transcription word.
15. The method of operating a machine of claim 13 wherein
the alignment operation is implemented as a dynamic programming operation
that represents the ordered transcription word sequence and the ordered
image word sequence as a two-dimensional lattice data structure, hereafter
referred to as a lattice, including a plurality of nodes and transitions
between nodes; a transition between two nodes representing a pair of
sequences of zero or more image words in the ordered image word sequence
and zero or more transcription words in the ordered transcription word
sequence; the dynamic programming operation producing a path through the
nodes in the lattice from a first node to a final node using scores
computed for nodes and computed for transitions between nodes in the
lattice; a score for a transition indicating a similarity measurement
between the zero or more transcription words and the zero or more image
words associated with the transition; the dynamic programming operation
storing node identification information for each node and transition in
the path, as the path through the lattice is produced; and
the alignment operation produces the image-transcription alignment data
structure indicating the list of word pairs by backtracing through the
nodes included in the path from the final node to the first node using the
node identification information. |
|
|
|
|
Claims  |
|
|
Description  |
|
|
FIELD OF THE INVENTION
The present invention relates generally to text image layout analysis, and
more particularly to a method for aligning image region locations in a
text image to strings in a text transcription of the image.
BACKGROUND
Document, or page, layout analysis is a document processing technique used
to determine the physical and logical structures of a document in terms of
the geometric, spatial and functional relationships of its document
components, and in particular, its text document components. Its typical
purpose is to automatically transform document structures from a
nonelectronic medium, like paper, into an electronic medium. Document
layout analysis is usually characterized as having two aspects: a
structural segmentation process that analyzes the document to find the
physical locations of document components, and a functional labeling
process that identifies the document components in some manner with
respect to their logical function in the document. Document components
typically include single connected components and groups of connected
components. Single connected components are regions of foreground (e.g.,
black) image pixel values such as noise specks, dots, dashes, lines and
parts of a character that are touching. Groups of connected components,
often called blocks, include a complete character consisting of one or
more connected components, a word, text line, or group of text lines. The
structural segmentation process typically produces the image locations of
blocks. Blocks may be further analyzed in a manner that assigns functional
labels to them according to their physical features or their "meaning," as
imposed by the functional requirements of a particular type of document.
For example, text blocks in a technical journal document may be identified
and labeled as the title block, abstract, section titles, paragraphs and
footnotes, while the blocks in a business letter document may be labeled
as the return address, the inside address, the salutation, the body or
message, and the closing. Functional layout analysis is sometimes referred
to as syntactic or logical layout analysis. The structural segmentation of
the text image and the functional labeling of the blocks may occur
sequentially or simultaneously, depending on the particular implementation
of a document layout analysis method.
Many document or page layout analysis techniques employ various kinds of a
priori knowledge about the document domain in order to analyze a document
layout from its image in order to make inferences about its logical
structure and content. In many cases this a priori knowledge needs to be
constructed manually by a person of relative skill in document layout
description or understanding. For example, some of these techniques use,
as an input to layout analysis, a form of a document model or document
specification that describes some aspect of the document image structure
of the class of document images that is expected as the input text image.
An example of a document layout analysis system that uses a formal model
as a document description is disclosed in G. Nagy, et al in "A prototype
document image analysis system for technical journals", IEEE Computer,
July, 1992, pp. 10-22. Nagy et al. disclose a document layout analysis
system that uses a collection of publication-specific document grammars
that are formal descriptions of all valid page formats that articles in a
given publication can assume. The system uses the document grammar to
guide a segmentation and labeling process that subdivides the page image
into a collection of nested rectangular blocks. The document grammar for a
specific journal consists of a set of block grammars, each of which
subdivides a block horizontally or vertically into a set of subblocks.
Nagy, et al acknowledge at pg. 16 that constructing the block grammars by
hand is very time-consuming and is considered a weakness of their
approach.
A. Dengel, et al. in "From Paper to Office Document Standard
Representation" Computer, Vol 25, No. 7, July 1992, pp. 63-67, discloses
another example of a document layout analysis system that uses a document
model of a priori knowledge about a class of documents as input to the
layout analysis process. In particular, this layout analysis system has as
its purpose the transformation of printed documents, in this case business
letters, into an ODA-conforming format. ODA, or Office Document
Architecture, is an international electronic document representation
standard designed for generating and exchanging electronic documents. The
layout analysis process disclosed in Dengel et al involves three steps:
layout analysis, logical labeling of the layout structures, and text
recognition. The system includes two generic tree structures that provide
different but complementary views of the structure of a class of
documents.
Conventional document layout analysis systems typically include a text
recognition operation as a component that produces the encoded content
included in the image regions identified by the structural and logical
layout components. Typically, image regions in an image are identified
first, and then a conventional recognition operation is invoked to produce
the text string content of the image regions and to link that content to
the identified image regions. Linking image locations to specific encoded
content is important when it is necessary or desirable to be able to
reconstruct the original formatting of the document image from the output
recognition file or to have the capability to edit the original document
in a conventional word processing operation. It is also important for
purposes of browsing an image to locate a particular region or to search
for specific information.
Use of a priori knowledge of some kind about the document is likely to
improve the results of the document layout analysis process simply because
some possibilities about the identity of a particular document block can
be excluded from consideration during the process, and accuracy and
efficiency are likely to improve as a result. However, when the a priori
information about the document takes the form of a complex document model,
the benefit of having this knowledge in advance is often outweighed by the
amount of time and effort and the level of skill involved in creating the
model, as well as by the significant and additional computational
complexity associated with handling the model.
The encoded content data of a document, in the form of a text string of the
characters in a specific text image produced by a conventional text
recognition system, has conventionally been part of the document layout
analysis process, and not heretofore been considered to be a suitable form
of a priori knowledge about the image because it has been thought to
contain insufficient information to be a useful aid in the process.
However, a transcription is relatively easy to produce in comparison to
the more formal document models used by the examples of document layout
analysis systems discussed previously.
SUMMARY OF THE INVENTION
The present invention is a type of document layout analysis technique that
makes use of a very prevalent and available source of a priori knowledge
about a text image: namely, a transcription of the image. The present
invention identifies image regions in a text image and associates, or
labels, them with string components of a transcription of the text image,
a process that will be referred to as "aligning" the image to the
transcription.
The image-transcription alignment technique of the present invention is
premised on the discovery that a transcription available for a text image
provides important information about how glyphs occurring in the text
image are arranged. In particular, portions of an image, called image
regions, that are identified by image locations may be matched to strings
in the transcription, thereby associating, in an output data structure,
the strings, or labels, in the transcription with the identified image
regions. This association may then be used by a subsequent operation on
the text image to facilitate the location of identified glyphs in the
image region, to improve the efficiency of the subsequent operation, or to
accomplish any other operation that benefits from prior labeling of image
regions with encoded text strings. Examples of image regions include
regions that bound various geometric, syntactic or semantic image units
such as words, text lines, sentences or paragraphs, formatted or itemized
lists, headings, and footnotes. The ability to link the content of a
transcription to various ones of these types of image regions may depend,
to a certain extent, on the content and interpretation of the information
in the transcription.
The present invention, then, provides a method of operating a machine to
label image locations of image regions in a text image with label data
available in a transcription associated with the text image. The machine
the method operates includes a processor and a memory device for storing
data. The data stored in the memory device includes the instruction data
the processor executes to operate the machine, and the processor is
connected to the memory device so that it can access the stored data. The
image-transcription alignment method includes operating the processor to
store, in the memory device of the machine, an image definition data
structure defining an image, referred to as a text image, including a
plurality of glyphs, and a transcription data structure, referred to as a
transcription, associated with the text image. The transcription includes
an ordered sequence of transcription label data items, referred to as
transcription labels, indicating information about the glyphs occurring in
the text image and arranged in an ordered label sequence indicating an
expected positional glyph sequence of the plurality of glyphs occurring in
the text image. The processor is then operated to produce an ordered image
region sequence of image regions occurring in the text image. Each image
region is defined with respect to its location in the text image relative
to at least one glyph occurring in the text image, and indicates an image
region location defined according to an image coordinate system describing
the text image. An ordered transcription region sequence of transcription
regions occurring in the transcription is also produced. Each
transcription region is a sequence of at least one transcription label. A
matching operation is then performed to match image regions in the ordered
image region sequence to transcription regions in the ordered
transcription region sequence. The matching operation computes similarity
measurements measuring the similarity between image regions and
transcription regions, and finds a best match between the image regions in
the ordered image region sequence and transcription regions in the ordered
transcription region sequence that optimizes the similarity measurements.
An image-transcription alignment data structure is produced from the
results of the matching operation that indicates transcription regions
paired to image region locations of matching image regions. In one
embodiment, the matching operation is implemented as a dynamic programming
operation that aligns image regions in the ordered image region sequence
to transcription regions in the ordered transcription region sequence
using similarity measurements, called scores, between image region
parameters and transcription region parameters, subject to the constraint
that region order in each of the ordered transcription region sequence and
the ordered image region sequence is maintained during the dynamic
programming operation.
A particularly common and useful implementation of the image-transcription
alignment technique of the present invention is as a supervised method for
finding image word locations and labeling each word location with a string
of character codes in the transcription. The strings of character codes in
the transcription that represent words provide a priori information about
regions in the text image that are likely to contain the glyphs indicated
by the strings of character codes in the transcription.
In this aspect of the method, the text image and transcription data
structures are read and stored in memory as previously described. The
glyphs in the text image represent characters in an image character set.
The processor is operated to produce an ordered image word sequence of
image words occurring in the text image, with each image word being an
image region in the text image that includes at least one glyph. The
processor is also operated to produce an ordered transcription word
sequence of transcription words occurring in the transcription, with each
transcription word being a sequence of at least one transcription label
indicating a character code in the image character set. Then, a matching
operation is performed that matches image words in the ordered image word
sequence to transcription words in the ordered transcription word
sequence. The matching operation produces an image-transcription alignment
data structure indicating image words paired to matching transcription
words.
As previously described, a particular embodiment of the matching operation
is implemented as a dynamic programming operation that aligns image words
in the ordered image word sequence to transcription words in the ordered
transcription word sequence by finding a path from an initial node to a
final node in a lattice structure that represents the two word sequences.
Image and transcription parameters are used for the matching, or
alignment, process; in particular, word lengths are used, where image word
lengths are pixel distances estimated from the image and transcription
word lengths are estimated from the number of characters in the
transcription word multiplied by an estimated character image width of the
characters in the input text image. Scores in the path through the lattice
structure are computed using a function of the word lengths, subject to
the constraint that word order in each of the ordered transcription word
sequence and the ordered image word sequence is maintained during the
dynamic programming operation.
One use for this word alignment technique specifically involves an
invention for the automatic training of character templates disclosed in
copending U.S. patent application, Ser. No. 08/431,223, assigned to the
same assignee as the present invention. That application discloses a
method for automatically training font-specific binary character templates
using character image samples identified in an input image, and using a
transcription associated with the input image to label the image samples.
An implementation of the template training technique models the input text
image and the transcription, which may be errorful, as Markov sources in
the form of stochastic finite state transition networks that are merged
into a single network. A dynamic programming technique is then used to
decode the input image to produce image locations of character image
samples labeled according to the transcription labels of the associated
transcription. The decoding process disclosed functionally compares an
ideal image of character templates positioned in an image plane to the
observed input image by determining a best path through the merged network
from an initial node to a final node; the path is determined as a result
of maximizing a log likelihood score that is the sum of individual log
likelihood scores computed for individual transitions taken in the path.
Decoding in this manner is essentially done on a pixel by pixel basis, is
accomplished through successive iterations, and can be computationally
demanding and complex. The present invention can reduce the computational
complexity and the number of iterations required of this pixel by pixel
decoding to decoding on a word by word basis, since the present invention
produces an association between image regions that have been determined to
correspond to words and strings in the transcription that, as a result of
the alignment process, indicate character codes that should properly
identify the glyphs in the image words.
In addition, with reference to the copending template training technique, a
particularly errorful input transcription may have a significant effect on
the number of iterations required to achieve accurate image sample
identification and labeling; for example, deletions of entire words or
lines in the transcription are very likely to adversely affect training
efficiency. Thus, the present invention can also provide, in a
preprocessing step to the decoding process in the template training
invention, a measure of confidence in the success of decoding by word
based on the quality of the transcription; the total score produced by the
matching operation as a result of finding the best path through the
lattice structure in the present invention can provide an indication of
how well portions of the transcription have been aligned with portions of
the image at the word level. If this information indicates that word level
alignment has been largely successful and that there is a good correlation
between portions of the transcription and portions of the input image, the
training technique can either proceed with merging the image and
transcription networks, and perform template-level, pixel by pixel
decoding of the image to produce the training data, or the technique can
perform decoding on a word by word basis. For example, the path | | |