WikiPatents - Community Patent Review
Create Free Account  |  License or Sell Your Patent  |  WikiPatents Marketplace  |  WikiPatents Blog
Username:  Password:  
    
Advanced Search
Method for aligning a text image to a transcription of the image    
United States Patent5689585   
Link to this pagehttp://www.wikipatents.com/5689585.html
Inventor(s)Bloomberg; Dan S. (Palo Alto, CA); Niles; Leslie T. (Palo Alto, CA); Kopec; Gary E. (Belmont, CA); Chou; Philip Andrew (Menlo Park, CA)
AbstractA method for establishing a relationship between a text image and a transcription associated with the text image uses conventional image processing techniques to identify one or more geometric attributes, or image parameters, of each of a sequence of regions of the text image. The transcription labels in the transcription are analyzed to determine a comparable set of parameters in transcription label sequence. A matching operation then matches the respective parameters of the two sequences to identify image regions that match with transcription regions. The result is an output data structure that minimally identifies image locations of interest to a subsequent operation that processes the text image. The output data structure may also pair each of the image locations of interest to a transcription location, in effect producing a set of labeled image locations. In one embodiment, the sequence of locations of words and their observed lengths in the text image are determined. The transcription is analyzed to identify words, and transcription word lengths are computed using an estimated image character width of glyphs in the text image. The sequence of observed image word lengths is then matched to the sequence of computed transcription word lengths using a dynamic programming algorithm that finds a best path through a two-dimensional lattice of nodes and transitions between nodes, where the transitions represent pairs of sequences of zero or more word lengths. An output data structure contains entries, each of which pairs a transcription word with a matching image word location.
   














 Title Information Submit all comments and votes
 
Patent Text Patent PDF Print Page Summary File History
Plain text PDF images Print Summary File History
Drawing from US Patent 5689585
Method for aligning a text image to a transcription of the image - US Patent 5689585 Drawing
Method for aligning a text image to a transcription of the image
Inventor     Bloomberg; Dan S. (Palo Alto, CA); Niles; Leslie T. (Palo Alto, CA); Kopec; Gary E. (Belmont, CA); Chou; Philip Andrew (Menlo Park, CA)
Owner/Assignee     Xerox Corporation (Stamford, CT)
Patent assignment
All assignments
Publication Date     November 18, 1997
Application Number     08/431,004
PAIR File History     Application Data   Transaction History
Image File Wrapper   Patent Term   Fees
Litigation
Filing Date     April 28, 1995
US Classification     382/229 715/500
Int'l Classification     G06K 009/72
Examiner     Mancuso; Joseph
Assistant Examiner     Kahng; Anthony H.
Attorney/Law Firm     Bares; Judith C.
Address
Parent Case    
Priority Data    
USPTO Field of Search     382/229 382/224 364/419.13 364/419.1
Patent Tags     aligning text image transcription image
   
Enter a comma (,) or semicolon (;) between multiple tag words/phrases.
Describe this patent:
 Amusing   
 Clever   
 Complex   
 Efficient   
 Historic   
 Important   
 Innovative   
 Interesting   
 Practical   
 Simple   
[no votes]
Patent WIKI

Share information and news about this patent, including information and news about the technology, inventors, company, ligation and licensing.

 References Submit all comments and votes
 
*references marked with an asterisk below are user-added references
 U.S. References
 
Add a new US reference:  
ReferenceRelevancyCommentsReferenceRelevancyComments
5544050
Abe
715/532
Aug,1996

[0 after 0 votes]
5526444
Kopec
382/233
Jun,1996

[0 after 0 votes]
5524066
Kaplan
382/229
Jun,1996

[0 after 0 votes]
5513304
Spitz
715/500
Apr,1996

[0 after 0 votes]
5473705
Abe
382/100
Dec,1995

[0 after 0 votes]
5455871
Bloomberg

Oct,1995

[0 after 0 votes]
5438630
Chen
382/159
Aug,1995

[0 after 0 votes]
5438628
Spitz
382/181
Aug,1995

[0 after 0 votes]
5438512
Mantha
715/517
Aug,1995

[0 after 0 votes]
5333275
Wheatley
704/243
Jul,1994

[0 after 0 votes]
5321770
Huttenlocher
382/174
Jun,1994

[0 after 0 votes]
4905287
Segawa
704/254
Feb,1990

[0 after 0 votes]
 Foreign References
 Other References
 Market Review Submit all comments and votes
   
Market Size
Estimate the gross annual revenues of the relevant market sector:
> $10B
$5B - $10B
$2B - $5B
$500M - $2B
$100M - $500M
$10M - $100M
$1M - $10M
$500K - $1M
$100K - $500K
< $100K
[No votes]
$0
 
$0   $2.5B   $5B   $7.5B   $10B
Market Share
Estimate the percentage of the relevant market sector this invention will capture:
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Reasonable Royalty
What percentage of gross sales should the inventor or assignee be paid?
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Public's "Guesstimation" of Royalty Value
Market SizeN/A[No votes]
xMarket ShareN/A[No votes]
xReasonable RoyaltyN/A[No votes]

N/A

License Availablity
If you are NOT the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
License Availablity
If you ARE the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
Competitive Advantage
Does this invention have a significant competitive advantage over similar technologies?
Yes

No



[No votes]
Most helpful competitive advantage comment
[No comments]

Commercial Alternatives
Are there viable commercial alternatives for this invention?
Yes

No



[No votes]
Most helpful commercial alternative comment
[No comments]

 Technical Review Submit all comments and votes
 Claims Submit all comments and votes
 


What is claimed:

1. A method of operating a machine to align image words in a text image to transcription words in a transcription associated with the text image; the machine including a processor and a memory device for storing data; the data stored in the memory device including instruction data the processor executes to operate the machine; the processor being connected to the memory device for accessing the data stored therein; the method comprising:

operating the processor to obtain an image definition data structure defining a text image including a plurality of glyphs representing characters in an image character set;

operating the processor to obtain a transcription data structure associated with the text image and including a plurality of transcription labels indicating character codes representing characters in the image character set;

operating the processor to produce an ordered image word sequence of image words occurring in the text image; each image word being an image region in the text image including at least one glyph;

operating the processor to produce an ordered transcription word sequence of transcription words occurring in the transcription data structure; each transcription word being a sequence of at least one transcription label indicating a character code in the image character set; and

operating the processor to perform an alignment operation to align image words in the ordered image word sequence with transcription words in the ordered transcription word sequence subject to a constraint of maintaining word order in each of the ordered transcription word sequence and the ordered image word sequence during alignment; the alignment operation producing an image-transcription alignment data structure indicating each image word in the ordered image word sequence paired with either no (a null) transcription word or with at most one transcription word in the ordered transcription word sequence.

2. The method of operating a machine of claim 1 wherein

operating the processor to produce the ordered image word sequence of image words occurring in the text image includes determining an image word length value for each image word in the ordered image word sequence;

operating the processor to produce the ordered transcription word sequence of transcription words occurring in the transcription data structure includes producing a transcription word length value for each transcription word in the ordered transcription word sequence; and

the alignment operation aligns transcription words in the ordered transcription word sequence with image words in the ordered image word sequence by matching image word lengths to transcription word lengths.

3. The method of operating a machine of claim 2 further including operating the processor to determine an image character width value indicating an estimated character width of a glyph in the text image; and wherein the transcription word length value produced for each transcription word in the ordered transcription word sequence is produced using the image character width value.

4. The method of operating a machine of claim 1 wherein, when the text image includes a first image word and the transcription data structure, when visually inspected and compared with the text image, does not include transcription labels visually paired with the first image word thereby indicating a missing transcription word, the alignment operation detects the missing transcription word; the alignment operation indicating in the image-transcription alignment data structure that the first image word is paired with no (a null) transcription word.

5. The method of operating a machine of claim 1

wherein operating the processor to produce the ordered image word sequence of image words occurring in the text image includes determining an image word parameter for each image word in the ordered image word sequence;

wherein operating the processor to produce the ordered transcription word sequence of transcription words occurring in the transcription data structure includes producing a transcription word parameter for each transcription word in the ordered transcription word sequence; and

wherein the alignment operation uses similarity measurements between the image word parameters and the transcription word parameters to perform the alignment.

6. The method of operating a machine of claim 1 wherein

the transcription data structure includes a transcription label error causing the processor to identify a sequence of transcription labels as a transcription word, referred to as an errorful transcription word, in the ordered transcription word sequence that cannot be identically matched to an image word in the text image by visual inspection of the text image; the errorful transcription word including at least one of a transcription label deletion, a transcription label substitution, or a transcription label insertion; and

the alignment operation, when pairing image words in the ordered image word sequence to transcription words in the ordered transcription word sequence including the errorful transcription word, determines whether the errorful transcription word can be paired with an image word in the ordered image word sequence and if so, determines with which image word to pair the errorful transcription word.

7. The method of operating a machine of claim 1 wherein

the alignment operation is implemented as a dynamic programming operation that represents the ordered transcription word sequence and the ordered image word sequence as a two-dimensional lattice data structure, hereafter referred to as a lattice, including a plurality of nodes and transitions between nodes; a transition between two nodes representing a pair of sequences of zero or more image words in the ordered image word sequence and zero or more transcription words in the ordered transcription word sequence; the dynamic programming operation producing a path through the nodes in the lattice from a first node to a final node using scores computed for nodes and computed for transitions between nodes in the lattice; a score for a transition indicating a similarity measurement between the zero or more transcription words and the zero or more image words associated with the transition; the dynamic programming operation storing node identification information for each node and transition in the path, as the path through the lattice is produced; and

the alignment operation produces the image-transcription alignment data structure indicating transcription words paired to image words by backtracing through the nodes included in the path from the final node to the first node using the node identification information.

8. The method of operating a machine of claim 7 wherein the similarity measurement indicated by the score for a transition is a function of image word lengths of the zero or more image words and of the zero or more transcription words associated with the transition.

9. The method of operating a machine of claim 1 wherein

each image word in the ordered image word sequence is represented as an image word location defined according to an image coordinate system describing the text image;

each transcription word in the ordered transcription word sequence is represented as a transcription word location in the ordered label sequence of the transcription labels included in the transcription data structure; and

transcription words aligned with image words are represented in the image-transcription alignment data structure as transcription word locations in the transcription and as image word locations in the text image, respectively.

10. A method of operating a machine to label image locations of image regions in a text image with label data available in a transcription data structure associated with the text image; the machine including a processor and for memory device for storing data; the data stored in the memory device including instruction data the processor executes to operate the machine; the processor being connected to the memory device for accessing the data stored therein; the method comprising:

operating the processor to store, in the memory device of the machine, an image definition data structure defining a text image including a plurality of glyphs;

operating the processor to store, in the memory device of the machine, a transcription data structure associated with the text image and including an ordered sequence of transcription labels indicating character code information about image regions occurring in the text image;

operating the processor to produce an ordered image region sequence of image regions occurring in the text image; each image region being defined with respect to its location in the text image relative to at least one glyph occurring in the text image;

operating the processor to produce an ordered transcription region sequence of transcription regions occurring in the transcription data structure; each transcription region being a sequence of at least one transcription label;

operating the processor to perform an alignment operation to align image regions in the ordered image region sequence with transcription regions in the ordered transcription region sequence; the alignment operation computing similarity measurements measuring a similarity between image regions and transcription regions; the alignment operation determining a best pairing between the image regions in the ordered image region sequence and transcription regions in the ordered transcription region sequence that optimizes the similarity measurements such that each one of the image regions in the ordered image region sequence is paired with either no (a null) transcription region or with at most one transcription region in the ordered transcription region sequence; and

operating the processor to produce an image transcription alignment data structure indicating transcription regions paired with image region locations of paired image regions.

11. The method of operating a machine of claim 10 further including

operating the processor to determine an image region parameter value for each image region; the image region parameter value indicating a measurement of a geometric characteristic of the image region;

operating the processor to determine an image glyph parameter value indicating a measurement of a geometric characteristic of a glyph occurring in the text image; and

operating the processor to determine a transcription region parameter for each transcription region in the ordered transcription region sequence using the image glyph parameter value; and

wherein the alignment operation is implemented as a dynamic programing operation that aligns image regions in the ordered image region sequence to transcription regions in the ordered transcription region sequence using similarity measurements between image region parameters and transcription region parameters, subject to a constraint of maintaining region order in each of the ordered transcription region sequence and the ordered image region sequence during the dynamic programming operation.

12. The method of operating a machine of claim 10 wherein, when the ordered transcription region sequence includes fewer transcription regions than image regions included in the ordered image region sequence, the transcription data structure indicates a missing transcription region; and wherein the alignment operation uses the similarity measurements to detect the missing transcription region so that the best pairing between the image regions in the ordered image region sequence and transcription regions in the ordered transcription region sequence indicates the missing transcription region.

13. A method of operating a machine to pair image words in a two-dimensional (2D) text image with transcription words in a transcription associated with the text image; the machine including a processor and a memory device for storing data; the data stored in the memory device including instruction data the processor executes to operate the machine; the processor being connected to the memory device for accessing the data stored therein; the method comprising:

operating the processor to receive and store, in the memory device of the machine, an image definition data structure defining a two-dimensional (2D) text image of pixels and including a plurality of glyphs; the 2D text image having a vertical dimension size larger than a single line; the plurality of glyphs representing characters in an image character set;

operating the processor to receive and store, in the memory device of the machine, a transcription data structure associated with the 2D text image and including a plurality of transcription labels indicating character codes representing characters in the image character set;

operating the processor to determine an image word location in the 2D text image for each of a plurality of image words occurring therein; each image word being an image region including at least one glyph positioned with respect to an image baseline; the plurality of image word locations identifying an ordered image word sequence of the image words occurring in the 2D text image;

operating the processor to determine an image word length value for each image word in the 2D text image indicating a length of the image word;

operating the processor to determine an image character width value indicating an estimated character width of a glyph in the 2D text image;

operating the processor to determine a transcription word location for each of a plurality of transcription words occurring in the transcription data structure; each transcription word being a sequence of at least one transcription label indicating a character code in the image character set and separated from other transcription labels indicating character codes in the image character set by a transcription word boundary character code; the plurality of transcription word locations identifying an ordered transcription word sequence of the transcription words occurring in the transcription data structure;

operating the processor to determine a transcription word length value for each transcription word using the image character width value and a count of the transcription labels included in the transcription word;

operating the processor to perform an alignment operation to align transcription word lengths in the ordered transcription word sequence with image word lengths in the ordered image word sequence; the alignment operation maintaining the ordered image word sequence and the ordered transcription word sequence when aligning the image word lengths and the transcription word lengths; the alignment operation producing a list of word pairs indicating transcription words in the ordered transcription word sequence paired to matching image words in the ordered image word sequence; and

operating the processor to produce an image-transcription alignment data structure using the list of word pairs, the image word locations and the transcription word locations; the image-transcription alignment data structure indicating, for each word pair, the transcription word location of a transcription word and the image word location of a paired image word.

14. The method of operating a machine of claim 13 wherein, when the 2D text image includes a first image word and the transcription dam structure, when visually inspected and compared with the 2D text image, does not include transcription labels visually paired with the first image word thereby indicating a missing transcription word, the alignment operation detects the missing transcription word; the alignment operation further producing missing word data indicating that the first image word is not paired with a transcription word; the image-transcription alignment data structure using the missing word data to indicate that an image word location of the first image word is paired with no transcription word location of a transcription word.

15. The method of operating a machine of claim 13 wherein

the alignment operation is implemented as a dynamic programming operation that represents the ordered transcription word sequence and the ordered image word sequence as a two-dimensional lattice data structure, hereafter referred to as a lattice, including a plurality of nodes and transitions between nodes; a transition between two nodes representing a pair of sequences of zero or more image words in the ordered image word sequence and zero or more transcription words in the ordered transcription word sequence; the dynamic programming operation producing a path through the nodes in the lattice from a first node to a final node using scores computed for nodes and computed for transitions between nodes in the lattice; a score for a transition indicating a similarity measurement between the zero or more transcription words and the zero or more image words associated with the transition; the dynamic programming operation storing node identification information for each node and transition in the path, as the path through the lattice is produced; and

the alignment operation produces the image-transcription alignment data structure indicating the list of word pairs by backtracing through the nodes included in the path from the final node to the first node using the node identification information.
 Description Submit all comments and votes
 


FIELD OF THE INVENTION

The present invention relates generally to text image layout analysis, and more particularly to a method for aligning image region locations in a text image to strings in a text transcription of the image.

BACKGROUND

Document, or page, layout analysis is a document processing technique used to determine the physical and logical structures of a document in terms of the geometric, spatial and functional relationships of its document components, and in particular, its text document components. Its typical purpose is to automatically transform document structures from a nonelectronic medium, like paper, into an electronic medium. Document layout analysis is usually characterized as having two aspects: a structural segmentation process that analyzes the document to find the physical locations of document components, and a functional labeling process that identifies the document components in some manner with respect to their logical function in the document. Document components typically include single connected components and groups of connected components. Single connected components are regions of foreground (e.g., black) image pixel values such as noise specks, dots, dashes, lines and parts of a character that are touching. Groups of connected components, often called blocks, include a complete character consisting of one or more connected components, a word, text line, or group of text lines. The structural segmentation process typically produces the image locations of blocks. Blocks may be further analyzed in a manner that assigns functional labels to them according to their physical features or their "meaning," as imposed by the functional requirements of a particular type of document. For example, text blocks in a technical journal document may be identified and labeled as the title block, abstract, section titles, paragraphs and footnotes, while the blocks in a business letter document may be labeled as the return address, the inside address, the salutation, the body or message, and the closing. Functional layout analysis is sometimes referred to as syntactic or logical layout analysis. The structural segmentation of the text image and the functional labeling of the blocks may occur sequentially or simultaneously, depending on the particular implementation of a document layout analysis method.

Many document or page layout analysis techniques employ various kinds of a priori knowledge about the document domain in order to analyze a document layout from its image in order to make inferences about its logical structure and content. In many cases this a priori knowledge needs to be constructed manually by a person of relative skill in document layout description or understanding. For example, some of these techniques use, as an input to layout analysis, a form of a document model or document specification that describes some aspect of the document image structure of the class of document images that is expected as the input text image. An example of a document layout analysis system that uses a formal model as a document description is disclosed in G. Nagy, et al in "A prototype document image analysis system for technical journals", IEEE Computer, July, 1992, pp. 10-22. Nagy et al. disclose a document layout analysis system that uses a collection of publication-specific document grammars that are formal descriptions of all valid page formats that articles in a given publication can assume. The system uses the document grammar to guide a segmentation and labeling process that subdivides the page image into a collection of nested rectangular blocks. The document grammar for a specific journal consists of a set of block grammars, each of which subdivides a block horizontally or vertically into a set of subblocks. Nagy, et al acknowledge at pg. 16 that constructing the block grammars by hand is very time-consuming and is considered a weakness of their approach.

A. Dengel, et al. in "From Paper to Office Document Standard Representation" Computer, Vol 25, No. 7, July 1992, pp. 63-67, discloses another example of a document layout analysis system that uses a document model of a priori knowledge about a class of documents as input to the layout analysis process. In particular, this layout analysis system has as its purpose the transformation of printed documents, in this case business letters, into an ODA-conforming format. ODA, or Office Document Architecture, is an international electronic document representation standard designed for generating and exchanging electronic documents. The layout analysis process disclosed in Dengel et al involves three steps: layout analysis, logical labeling of the layout structures, and text recognition. The system includes two generic tree structures that provide different but complementary views of the structure of a class of documents.

Conventional document layout analysis systems typically include a text recognition operation as a component that produces the encoded content included in the image regions identified by the structural and logical layout components. Typically, image regions in an image are identified first, and then a conventional recognition operation is invoked to produce the text string content of the image regions and to link that content to the identified image regions. Linking image locations to specific encoded content is important when it is necessary or desirable to be able to reconstruct the original formatting of the document image from the output recognition file or to have the capability to edit the original document in a conventional word processing operation. It is also important for purposes of browsing an image to locate a particular region or to search for specific information.

Use of a priori knowledge of some kind about the document is likely to improve the results of the document layout analysis process simply because some possibilities about the identity of a particular document block can be excluded from consideration during the process, and accuracy and efficiency are likely to improve as a result. However, when the a priori information about the document takes the form of a complex document model, the benefit of having this knowledge in advance is often outweighed by the amount of time and effort and the level of skill involved in creating the model, as well as by the significant and additional computational complexity associated with handling the model.

The encoded content data of a document, in the form of a text string of the characters in a specific text image produced by a conventional text recognition system, has conventionally been part of the document layout analysis process, and not heretofore been considered to be a suitable form of a priori knowledge about the image because it has been thought to contain insufficient information to be a useful aid in the process. However, a transcription is relatively easy to produce in comparison to the more formal document models used by the examples of document layout analysis systems discussed previously.

SUMMARY OF THE INVENTION

The present invention is a type of document layout analysis technique that makes use of a very prevalent and available source of a priori knowledge about a text image: namely, a transcription of the image. The present invention identifies image regions in a text image and associates, or labels, them with string components of a transcription of the text image, a process that will be referred to as "aligning" the image to the transcription.

The image-transcription alignment technique of the present invention is premised on the discovery that a transcription available for a text image provides important information about how glyphs occurring in the text image are arranged. In particular, portions of an image, called image regions, that are identified by image locations may be matched to strings in the transcription, thereby associating, in an output data structure, the strings, or labels, in the transcription with the identified image regions. This association may then be used by a subsequent operation on the text image to facilitate the location of identified glyphs in the image region, to improve the efficiency of the subsequent operation, or to accomplish any other operation that benefits from prior labeling of image regions with encoded text strings. Examples of image regions include regions that bound various geometric, syntactic or semantic image units such as words, text lines, sentences or paragraphs, formatted or itemized lists, headings, and footnotes. The ability to link the content of a transcription to various ones of these types of image regions may depend, to a certain extent, on the content and interpretation of the information in the transcription.

The present invention, then, provides a method of operating a machine to label image locations of image regions in a text image with label data available in a transcription associated with the text image. The machine the method operates includes a processor and a memory device for storing data. The data stored in the memory device includes the instruction data the processor executes to operate the machine, and the processor is connected to the memory device so that it can access the stored data. The image-transcription alignment method includes operating the processor to store, in the memory device of the machine, an image definition data structure defining an image, referred to as a text image, including a plurality of glyphs, and a transcription data structure, referred to as a transcription, associated with the text image. The transcription includes an ordered sequence of transcription label data items, referred to as transcription labels, indicating information about the glyphs occurring in the text image and arranged in an ordered label sequence indicating an expected positional glyph sequence of the plurality of glyphs occurring in the text image. The processor is then operated to produce an ordered image region sequence of image regions occurring in the text image. Each image region is defined with respect to its location in the text image relative to at least one glyph occurring in the text image, and indicates an image region location defined according to an image coordinate system describing the text image. An ordered transcription region sequence of transcription regions occurring in the transcription is also produced. Each transcription region is a sequence of at least one transcription label. A matching operation is then performed to match image regions in the ordered image region sequence to transcription regions in the ordered transcription region sequence. The matching operation computes similarity measurements measuring the similarity between image regions and transcription regions, and finds a best match between the image regions in the ordered image region sequence and transcription regions in the ordered transcription region sequence that optimizes the similarity measurements. An image-transcription alignment data structure is produced from the results of the matching operation that indicates transcription regions paired to image region locations of matching image regions. In one embodiment, the matching operation is implemented as a dynamic programming operation that aligns image regions in the ordered image region sequence to transcription regions in the ordered transcription region sequence using similarity measurements, called scores, between image region parameters and transcription region parameters, subject to the constraint that region order in each of the ordered transcription region sequence and the ordered image region sequence is maintained during the dynamic programming operation.

A particularly common and useful implementation of the image-transcription alignment technique of the present invention is as a supervised method for finding image word locations and labeling each word location with a string of character codes in the transcription. The strings of character codes in the transcription that represent words provide a priori information about regions in the text image that are likely to contain the glyphs indicated by the strings of character codes in the transcription.

In this aspect of the method, the text image and transcription data structures are read and stored in memory as previously described. The glyphs in the text image represent characters in an image character set. The processor is operated to produce an ordered image word sequence of image words occurring in the text image, with each image word being an image region in the text image that includes at least one glyph. The processor is also operated to produce an ordered transcription word sequence of transcription words occurring in the transcription, with each transcription word being a sequence of at least one transcription label indicating a character code in the image character set. Then, a matching operation is performed that matches image words in the ordered image word sequence to transcription words in the ordered transcription word sequence. The matching operation produces an image-transcription alignment data structure indicating image words paired to matching transcription words.

As previously described, a particular embodiment of the matching operation is implemented as a dynamic programming operation that aligns image words in the ordered image word sequence to transcription words in the ordered transcription word sequence by finding a path from an initial node to a final node in a lattice structure that represents the two word sequences. Image and transcription parameters are used for the matching, or alignment, process; in particular, word lengths are used, where image word lengths are pixel distances estimated from the image and transcription word lengths are estimated from the number of characters in the transcription word multiplied by an estimated character image width of the characters in the input text image. Scores in the path through the lattice structure are computed using a function of the word lengths, subject to the constraint that word order in each of the ordered transcription word sequence and the ordered image word sequence is maintained during the dynamic programming operation.

One use for this word alignment technique specifically involves an invention for the automatic training of character templates disclosed in copending U.S. patent application, Ser. No. 08/431,223, assigned to the same assignee as the present invention. That application discloses a method for automatically training font-specific binary character templates using character image samples identified in an input image, and using a transcription associated with the input image to label the image samples. An implementation of the template training technique models the input text image and the transcription, which may be errorful, as Markov sources in the form of stochastic finite state transition networks that are merged into a single network. A dynamic programming technique is then used to decode the input image to produce image locations of character image samples labeled according to the transcription labels of the associated transcription. The decoding process disclosed functionally compares an ideal image of character templates positioned in an image plane to the observed input image by determining a best path through the merged network from an initial node to a final node; the path is determined as a result of maximizing a log likelihood score that is the sum of individual log likelihood scores computed for individual transitions taken in the path. Decoding in this manner is essentially done on a pixel by pixel basis, is accomplished through successive iterations, and can be computationally demanding and complex. The present invention can reduce the computational complexity and the number of iterations required of this pixel by pixel decoding to decoding on a word by word basis, since the present invention produces an association between image regions that have been determined to correspond to words and strings in the transcription that, as a result of the alignment process, indicate character codes that should properly identify the glyphs in the image words.

In addition, with reference to the copending template training technique, a particularly errorful input transcription may have a significant effect on the number of iterations required to achieve accurate image sample identification and labeling; for example, deletions of entire words or lines in the transcription are very likely to adversely affect training efficiency. Thus, the present invention can also provide, in a preprocessing step to the decoding process in the template training invention, a measure of confidence in the success of decoding by word based on the quality of the transcription; the total score produced by the matching operation as a result of finding the best path through the lattice structure in the present invention can provide an indication of how well portions of the transcription have been aligned with portions of the image at the word level. If this information indicates that word level alignment has been largely successful and that there is a good correlation between portions of the transcription and portions of the input image, the training technique can either proceed with merging the image and transcription networks, and perform template-level, pixel by pixel decoding of the image to produce the training data, or the technique can perform decoding on a word by word basis. For example, the path