WikiPatents - Community Patent Review
Create Free Account  |  License or Sell Your Patent  |  WikiPatents Marketplace  |  WikiPatents Blog
Username:  Password:  
    
Advanced Search
Method and apparatus for converting bitmap image documents to editable coded data using a standard notation to record document recognition ambiguities    
United States Patent5359673   
Link to this pagehttp://www.wikipatents.com/5359673.html
Inventor(s)de La Beaujardiere; Jean-Marie R. (Palo Alto, CA)
AbstractDocuments represented as bitmap images are transformed into coded textual data and coded graphics data by graphics and textual recognizers, which use a standard notation for recording the results of the document recognition processes, including any ambiguities, in a document description language. Recognized portions of the document, represented as editable coded data, such as for example ASCII, are placed in elements, defined in the document description language, with all contents of an element sharing some common characteristic. Elements can include, for example: character-string-elements, questionable-character-elements, questionable-word-elements, verified-word-elements, alternate-word-elements, segment-elements, and arc-elements.
   














 Title Information Submit all comments and votes
 
Patent Text Patent PDF Print Page Summary File History
Plain text PDF images Print Summary File History
Inventor     de La Beaujardiere; Jean-Marie R. (Palo Alto, CA)
Owner/Assignee     Xerox Corporation (Stamford, CT)
Patent assignment
All assignments
Publication Date     October 25, 1994
Application Number     07/814,347
PAIR File History     Application Data   Transaction History
Image File Wrapper   Patent Term   Fees
Litigation
Filing Date     December 27, 1991
US Classification     382/229 382/310 715/513
Int'l Classification     G06K 009/72 G06K 009/03 G06F 015/38
Examiner     Razavi; Michael T.
Assistant Examiner     Fox; David
Attorney/Law Firm     Oliff & Berridge
Address
Parent Case    
Priority Data    
USPTO Field of Search     382/40 382/57 382/36 382/39 382/61 364/419 364/419.08 364/419.14 395/145 395/146 395/147
Patent Tags     converting bitmap image documents editable coded data standard notation record document recognition ambiguities
   
Enter a comma (,) or semicolon (;) between multiple tag words/phrases.
Describe this patent:
 Amusing   
 Clever   
 Complex   
 Efficient   
 Historic   
 Important   
 Innovative   
 Interesting   
 Practical   
 Simple   
[no votes]
Patent WIKI

Share information and news about this patent, including information and news about the technology, inventors, company, ligation and licensing.

 References Submit all comments and votes
 
*references marked with an asterisk below are user-added references
 U.S. References
 
Add a new US reference:  
ReferenceRelevancyCommentsReferenceRelevancyComments
5257323
Melen
382/310
Oct,1993

[0 after 0 votes]
5251273
Betts
382/311
Oct,1993

[0 after 0 votes]
4974260
Rudak
382/311
Nov,1990

[0 after 0 votes]
4949188
Sato
358/448
Aug,1990

[0 after 0 votes]
4914709
Rudak
382/311
Apr,1990

[0 after 0 votes]
4907285
Nakano
382/176
Mar,1990

[0 after 0 votes]
4760604
Cooper
382/155
Jul,1988

[0 after 0 votes]
4754489
Bokser
382/230
Jun,1988

[0 after 0 votes]
4674065
Lange
382/311
Jun,1987

[0 after 0 votes]
4654875
Srihari
382/229
Mar,1987

[0 after 0 votes]
4136395
Kolpek
715/533
Jan,1979

[0 after 0 votes]
4058795
Balm
382/230
Nov,1977

[0 after 0 votes]
 Foreign References
 Other References
 Market Review Submit all comments and votes
   
Market Size
Estimate the gross annual revenues of the relevant market sector:
> $10B
$5B - $10B
$2B - $5B
$500M - $2B
$100M - $500M
$10M - $100M
$1M - $10M
$500K - $1M
$100K - $500K
< $100K
[No votes]
$0
 
$0   $2.5B   $5B   $7.5B   $10B
Market Share
Estimate the percentage of the relevant market sector this invention will capture:
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Reasonable Royalty
What percentage of gross sales should the inventor or assignee be paid?
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Public's "Guesstimation" of Royalty Value
Market SizeN/A[No votes]
xMarket ShareN/A[No votes]
xReasonable RoyaltyN/A[No votes]

N/A

License Availablity
If you are NOT the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
License Availablity
If you ARE the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
Competitive Advantage
Does this invention have a significant competitive advantage over similar technologies?
Yes

No



[No votes]
Most helpful competitive advantage comment
[No comments]

Commercial Alternatives
Are there viable commercial alternatives for this invention?
Yes

No



[No votes]
Most helpful commercial alternative comment
[No comments]

 Technical Review Submit all comments and votes
 Claims Submit all comments and votes
 


What is claimed is:

1. A method of transforming a document represented as a bitmap image into an editable coded data stream defined using a machine readable document description language that records coded information resulting from the document transformation process and information regarding uncertainties in the document transformation process, comprising:

performing a first transformation operation on at least a text portion of said bitmap image using a character recognition apparatus, to transform at least said text portion of said bitmap image into coded information recognized with a level of confidence;

outputting and recording said coded information into one or more elements that are defined using said document description language, each element having a machine readable element type identifier that indicates the type of said coded information regarding the recognized bitmap image recorded in said element so that the type of coded information contained in each element can be known without examining the coded information contained in each element, the element type identifier for each element having been defined based on the type of coded information recorded in the element and the level of confidence with which said bitmap image represented by said coded information was recognized so that each of said elements is selectively identified, each element having coded information of a single type recorded therein; and

when said character recognition apparatus determines that the recognized bitmap image contained in an element has not been recognized with at least a predetermined level of confidence, recording in said element uncertainty information determined by said first recognition apparatus regarding said recognized bitmap image contained in said element;

wherein said element type identifier is a character-string-element or a questionable-character-element, each character-string-element containing a string of consecutive characters recognized by said character recognition apparatus with at least said predetermined level of confidence, each questionable-character-element containing said uncertainty information determined by said character recognition apparatus for a character which was not recognized with at least said predetermined level of confidence by said character recognition apparatus.

2. The method of claim 1, wherein said uncertainty information includes a degree of uncertainty with which said character recognition apparatus transformed said bitmap image.

3. The method of claim 1, further comprising:

for each questionable-character-element, using a word recognizer to transform said questionable-character-element and adjacent confidently recognized characters in a same word as said questionable-character-element into one or more elements having an element type identifier defined as a verified-word-element, by substituting alternate characters for said questionable-character-element when one or more words created by said substituting are recognized by said word recognition apparatus; when more than one verified-word-element is transformed for each questionable-character-element, said more than one verified-word-elements being placed in an element having an element type identifier defined as an alternate-word-element; said question-able-character-element remaining when no verified words are recognized by said word recognition apparatus.

4. The method of claim 3, further comprising:

for each alternate-word-element, using a semantics analyzer to transform verified words of the verified-word-elements contained in each alternate word element into an element identified as a character-string-element corresponding to one of the verified words contained in said alternate-word-element when said semantics analyzer determines that said one of said verified words is a correct word, said alternate-word-element remaining when none of said verified words is determined to be a correct word by said semantics analyzer.

5. The method of claim 4 wherein for each verified-word-element contained in each alternate word element, said semantics analyzer determines and identifies a confidence level of each verified word contained in each verified-word-element, and indicates the confidence level of each verified-word-element when none of the verified words in an alternate-word-element are determined to be a correct word by said semantics analyzer.

6. The method of claim 1, wherein for each questionable-character-element, said uncertainty information pertaining to a character not recognized with at least said predetermined level of confidence includes a most likely uncertain character identified by said character recognition apparatus.

7. The method of claim 6, wherein for each questionable-character-element, said uncertainty information pertaining to a character not recognized with at least said predetermined level of confidence also includes a degree of confidence determined by said character recognition apparatus for said most likely uncertain character.

8. The method of claim 1, wherein for each questionable-character-element, said uncertainty information pertaining to a character not recognized with at least said predetermined level of confidence includes alternate possible uncertain characters identified by said character recognition apparatus.

9. The method of claim 8, wherein for each questionable-character-element, said uncertainty information pertaining to a character not recognized with at least said predetermined level of confidence also includes a degree of confidence determined by said character recognition apparatus for each said alternate possible uncertain character.

10. A method of transforming a document represented as a bitmap image into an editable coded data stream defined using a machine readable document description language that records coded information resulting from the document transformation process and information regarding uncertainties in the document transformation process, comprising the steps of:

a) analyzing text portions of said bitmap image using a character recognizer to transform said text portions of said bitmap image into said coded information recognized with a level of confidence;

b) outputting and recording said coded information into a series of elements that are defined using said document description language, each element having a machine readable element type identifier that indicates the type of said coded information regarding the recognized bitmap image recorded in said element so that the type of coded information contained in each element can be known without examining the coded information contained in each element;

c) the element type identifier for each element having been defined using said document description language based on the type of coded information recorded in the elements by defining the type identifier of each of said elements as character-string-elements or questionable-character-elements, each element defined as a character-string-element containing a string of consecutive characters recognized by said character recognizer with at least a predetermined level of confidence, each element defined as a questionable-character-element containing information pertaining to a character not recognized with at least said predetermined level of confidence, said character-string-elements and said questionable-character-elements being arranged as a continuous stream of elements based on an order of the characters in said bitmap image, each element having coded information of a single type recorded therein;

d) for each element defined as a questionable-character-element, analyzing the coded information pertaining to the character contained therein and adjacent confidently recognized characters contained in a same word as said questionable-character-element using a word recognizer, to transform and record said coded information and adjacent confidently recognized characters into one or more elements having type identifiers defined as verified-word-elements by substituting alternate characters for said questionable-character-element and determining whether one or more verified words created by said substituting are recognized by said word recognizer; and

e) when more than one verified-word-elements are transformed for each questionable-character-element, placing said more than one elements defined as verified-word-elements in an element having an element type identifier defined as an alternate-word-element, said questionable-character-element remaining when no verified words are recognized by said word recognizer.

11. The method of claim 10, further comprising:

f) for each alternate-word-element, analyzing the verified words of the verified-word-elements contained therein and surrounding words using a semantics analyzer, to transform said alternate-word-element into an element identified as a character-string-element corresponding to one of verified words contained in said alternate-word-element when said semantics analyzer determines that one of said verified words is a correct word, said alternate-word-element remaining when none of said verified words is determined to be a correct word by said semantics analyzer.

12. The method of claim 11, wherein for each verified-word-element contained in each alternate-word-element, said semantics analyzer determines and identifies a confidence level of each verified word contained in each verified-word-element, and indicates the confidence level for each verified-word-element when none of the verified words in an alternate-word-element are determined to be a correct word by said semantics analyzer.

13. The method of claim 10, wherein for each questionable character element, said information pertaining to a character not recognized with at least said predetermined level of confidence includes a most likely uncertain character identified by said character recognizer.

14. The method of claim 13, wherein for each questionable-character-element, said information pertaining to a character not recognized with at least said predetermined level of confidence also includes a degree of confidence determined by said character recognizer for said most likely uncertain character.

15. The method of claim 10, wherein for each questionable-character-element, said information pertaining to a character not recognized with at least said predetermined level of confidence includes alternate possible uncertain characters identified by said character recognizer.

16. The method of claim 15, wherein for each questionable-character-element, said information pertaining to a character not recognized with at least said predetermined level of confidence also includes a degree of confidence determined by said character recognizer for each said alternate possible uncertain character.

17. An automatic document recognition apparatus for transforming text documents represented as bitmap image data into an editable coded data stream defined using a machine readable document description language that records coded data resulting from the document transformation process and information regarding uncertainties in the document transformation process, said apparatus comprising:

a character recognizer having:

a) first transformation means for performing a character recognition operation on said bitmap image representation of said document to transform said document into coded character data recognized with a level of confidence; and

b) first identification means for outputting and recording said coded character data into one or more elements that are defined using said document description language, said first identification means further defining a machine readable element type identifier for each element, said element type identifier indicating the type of said coded character data regarding the recognized bitmap image recorded in said element so that the type of coded character data contained in each element can be known without examining the coded character data contained in each element, said identification means selectively identifying said one or more elements by defining the element type identifier for each element based on the type of coded data recorded in the element and the level of confidence with which said bitmap image was recognized, each element having coded character data of a single type recorded therein, and, when said first transformation means determines that the coded character data contained in the element has not been transformed with a predetermined level of confidence, said identification means also recording in said element uncertainty information determined by said first transformation means regarding said coded character data contained in said element;

wherein said element type identifier is a character-string-element or a questionable-character-element, each character-string-element containing a string of consecutive characters recognized by said character recognition apparatus with at least said predetermined level of confidence, each questionable-character-element containing said uncertainty information determined by said character recognition apparatus for a character which was not recognized with at least said predetermined level of confidence by..said character recognition apparatus.

18. The apparatus of claim 17, wherein said uncertainty information includes a confidence level with which said first transformation means determined said coded data.

19. The apparatus of claim 17, further comprising:

a word recognizer having:

i) second transformation means for transforming each questionable-character-element and adjacent confidently recognized characters in a same word as the question-able-character-element into one or more verified words by substituting alternate characters for the questionable-character-element and verifying that a word resulting from said substituting exists in a lexicon; and

ii) second identification means using said document description language for placing each verified word in an element having an element type identifier defined as a verified-word-element, and when more than one verified-word-elements are created for a questionable-character-element, placing said more than one verified-word-elements in an element having an element type identifier defined as an alternate-word-element, said questionable-character-element remaining when no verified words are determined to exist.

20. The apparatus of claim 19, further comprising:

a semantics analyzer including:

1) means for determining which verified word within an alternate-word-element is a correct verified word based on words surrounding the alternate-word-element; and

2) third identification means for identifying said correct verified word and for replacing said alternate-word-element with an element identified as a character-string-element containing said correct verified word.
 Description Submit all comments and votes
 


BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention relates to document recognition, and in particular to methods and apparatus for recognizing textual and graphics structures in documents originally represented as bitmap images, and for recording the results of the recognition process.

2. Description of Related Art

Document recognition is the automatic transformation of paper documents into editable electronic documents. It entails the gradual transformation of bitmaps into structured components, through successive and recursive interventions of various processes. These processes include: page segmentation, character recognition, graphics recognition, logical structure reconstruction, spelling correction, semantic analysis, etc. All these processes are prone to misinterpretation. Not all processes keep a record of the misinterpretations they are aware of, and the ones that do keep a record have no standard way of doing so. As a consequence, downstream processes are generally not prepared to handle the record of ambiguities handed to them by upstream processes, and simply discard them. Valuable information is lost instead of being exploited for automatic improvement of the document recognition function. If, on the other hand, the ambiguity record is passed in its raw state to the user, the chore of making manual corrections can quickly outweigh the advantages of automatic recognition over a manual reconstruction of the entire document.

U.S. Pat. Nos. 4,914,709 and 4,974,260 to Rudak disclose an apparatus and method for identifying and correcting characters which cannot be machine read. A bitmap video image of the unrecognized character(s) is inserted in an ASCII data line of neighboring characters, thereby allowing an operator to view the character(s) in question in context to aid in proper identification of the character(s). Subsequently, with the aid of the video image, the operator enters the correct character(s) via a keyboard or other means. This apparatus and method require operator interaction to clarify any ambiguities resulting from an automatic document recognition process. The results of these ambiguities are not recorded in a notation that can be used by other downstream automatic devices.

U.S. Pat. No. 4,907,285 to Nakano et al discloses an image recognition system which uses a grammar for describing a document image, and parses statements expressed by the grammar to recognize the structure of an unknown input image. The grammar describes the image as substructures and the relative relation between them. In the parsing process, after the substructures and their relative relation are identified, a search is made as to whether the substructures and their relative relation exist in the unknown input image, and if they do, the inside of the substructures are further resolved to continue the analysis. If the substructures do not exist, other possibilities are searched and the structure of the unknown input image is thus represented from the result of the search. For example, the location of a rectangular region of the document which contains a statement defined by the document grammar (for example "TITLE" and "AUTHOR") is initially represented by variables. See FIG. 10 of U.S. Pat. No. 4,907,285. After locating this region in the document, the appropriate numeric values are substituted for the variables.

U.S. Pat. No. 4,949,188 to Sato discloses an image processing apparatus for synthesizing a character or graphic pattern represented by a page description language and an original image. The image processing apparatus generates a page description language including code data which represents characters, graphics patterns, and the like, and command data which causes a printer to print the original image. Ambiguities from previous document recognition processes are not recorded in the page description language. See, for example, the table in column 4, lines 5-10. Accordingly, any downstream device receiving the page description language cannot determine whether any ambiguities occurred in the previously performed document recognition processes.

U.S. Pat. No. 4,654,875 to Srihari et al discloses a method of automatic language recognition for optical character readers. Language in the form of input strings or structures is analyzed on the basis of: channel characteristics in the form of probabilities that a letter in the input is a corruption of another letter; the probabilities of the letter occurring serially with other recognized letters that precede the letter being analyzed or particular strings of letters occurring serially; and lexical information in the form of acceptable words represented as a graph structure. Ambiguities from upstream recognition processes are not recorded.

"Word Association Norms, Mutual Information, and Lexicography", by Kenneth W. Church and Patrick Hanks, Computational Linguistics, Vol. 16, No. 1 (March 1990) discloses a measure, referred to as an "association ratio" based on the information theoretic notion of mutual information, for estimating word association norms from computer readable corpora. This association ratio can be used by a semantics analyzer to determine a most likely word from a choice of two or more words that have been identified as possible words.

"On the Recognition of Printed Characters of Any Font and Size", by Simon Kahan, Theo Pavlidis and Henry S. Baird, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. PAM1-9, No. 2 (March 1987), discloses a system that recognizes printed text of various fonts and sizes for the Roman alphabet. Thinning and shape extraction are performed directly on a graph of the run-length encoding of the binary image. The resulting strokes and other shapes are mapped, using a shape-clustering approach, into binary features which are then fed into a statistical Bayesian classifier. This system identifies multiple possible characters or words, and scores them. However, the uncertainty in the recognition processes is not recorded using the standard notation of the present invention.

In summary, a number of systems exist which can recognize graphics structures, text (characters, words, semantics, fonts) and logical structures (pages, paragraphs, footnotes), and which can determine the uncertainty with which the recognized feature was recognized. Accordingly, the above-identified patents and papers are incorporated herein by reference. However, none of these systems record the results of the recognition process (including uncertainties)in a manner which can be used by other devices. This results in much information (particularly regarding uncertainty) being lost, especially when different recognition systems (e.g., character recognizers, word recognizers, semantics analyzers) are used at different times (as opposed to being integrated into one system).

OBJECTS AND SUMMARY OF THE INVENTION

It is an object of the present invention to provide methods and apparatus for recording ambiguities in document recognition processes in a standard format which can be used by a variety of document recognizers.

It is another object of the present invention to provide methods and apparatus for converting bitmap images into editable coded data, wherein information regarding ambiguities in the transformation processes performed by upstream recognizers can be recorded and thus used by downstream, higher level recognizers which attempt to resolve these ambiguities.

To achieve the foregoing and other objects, and to overcome the shortcomings discussed above, methods and apparatus are provided for converting documents represented as bitmap image data into editable coded data, wherein a standard notation in a document description language is utilized for recording document recognition ambiguities by each document recognizer. When the results of document recognition processes are recorded using this standard notation, any ambiguities are identified in a uniform manner so that downstream, higher level document recognition processes can attempt to resolve these ambiguities by using all information about the ambiguities obtained by upstream document recognition processes.

In particular, when using the standard notation of the present invention, each document recognizer can record the results of its recognition process in one or more elements, selectively identified using the document description language. Each element includes a type-identifier indicating a type of coded data (information) regarding the recognized (transformed) bitmap image contained therein. Each element also includes editable coded data therein of the type identified by the type-identifier, and can also include uncertainty information identifying any coded data which was not transformed with a predetermined level of confidence. This uncertainty information is determined by the document recognizer, and is recorded in a format that is readable by higher level, downstream document recognizers. This uncertainty information can include the level of confidence with which the uncertain coded data was recognized by the document recognizer, in order to further assist the higher level document recognizers in resolving ambiguities. The uncertainty information can also include alternative coded data for each uncertain recognition.

When the document recognizer is a character recognizer, any characters which are not recognized with a predetermined level of confidence are identified and recorded by placing them in questionable-character-elements. The degree of certainty as well as alternative possible characters and their degree of certainty can also be recorded for each questionable character. Characters which were recognized with at least the predetermined level of confidence are placed in character-string-elements.

When the document recognizer includes a word recognizer (such as, for example, a spelling checker), the word recognizer attempts to resolve any existing questionable characters by determining whether any words exist in a lexicon based upon each questionable character and the certain characters in the word containing each questionable character. If a word is identified in the lexicon for the word containing a questionable character, that word is identified as a verified word, and is recorded in a verified-word-element. If more than one verified words are found, they are placed in individual verified-word-elements which are collectively grouped together in an alternate-word-element. If no verified words are found for the word containing a questionable character, the questionable-character-element remains.

When the document recognizer includes a semantics analyzer, any identified alternate verified words are resolved by analyzing the words surrounding the alternate verified words. If one of the alternate verified words can be confirmed with a predetermined level of confidence based on the semantics analysis, it is returned and merged with the surrounding character-string-elements. If the semantics analyzer cannot determine which of the alternate verified words is correct, it returns the alternate-word-element (and included verified-word-elements) as such, and can include data indicative of the probability that each verified word therein is the correct word.

When the document recognizer includes a graphics structure image recognizer, it outputs graphics-elements containing coded data representative of graphics structures recognized in the graphics image. These structures can include: lines defined between endpoints; circles; arcs; etc. Additionally, line thickness information can also be returned and recorded. Ambiguities in the recognition process such as x and y direction offsets and line thickness variations can also be recorded. This data can be used by downstream higher level graphics recognition processes to resolve any ambiguities, or to recognize more complex graphics structures. For example, four lines recognized by a low level graphics recognizer could be determined to be a box by a higher level graphics recognizer if, for example, the endpoints can be determined with a high degree of certainty to be coincident.

Additional image recognition elements are produced for recording information relating to larger portions (or subimages) of the document image. For example, data related to font text blocks, frames, pages, documents, and large and small pieces of unresolved bitmap images can also be recorded.

BRIEF DESCRIPTION OF THE DRAWINGS

The invention will be described in detail with reference to the following drawings in which like reference numerals refer to like elements, and wherein:

FIG. 1 is a sample page image used to illustrate the present invention;

FIG. 2 illustrates a character-string-element for collecting streams of characters recognized with or above a predetermined confidence level;

FIG. 3 illustrates a questionable-character-element for collecting questionable characters recognized with a low confidence level;

FIG. 4 illustrates a questionable-word-element for collecting a questionable word which contains characters recognized with high confidence, but which was not found in a lexicon;

FIG. 5 illustrates verified-word-elements for collecting verified words found in a lexicon by resolving a word containing one or more questionable characters, and an alternate-word-element for collecting alternate words when two or more verified words are found for one word containing questionable characters;

FIG. 6 illustrates a text-element for collecting text elements having the same font;

FIG. 7 illustrates a fontDef-element for collecting data relating to a font type;

FIG. 8 illustrates one type of graphics-element which is a segment-element for collecting data relating to a line segment;

FIG. 9 illustrates another type of graphics-element which is an arc-element for collecting data relating to an arc;

FIG. 10 illustrates another type of graphics-element which is an image-element for collecting data relating to a large unresolved bitmap image;

FIG. 11 illustrates another type of graphics-element which is a spot-element for collecting information relating to a small unresolved bitmap image referred to as a spot, and for storing this information as a hexidecimal value;

FIG. 12 illustrates examples of elements referring to other elements;

FIG. 13 illustrates a tBlock-element for collecting information relating to blocks of text;

FIG. 14 illustrates a frame-element for collecting information relating to frames which can include text blocks, images, spots, arcs and segments, as well as other frames;

FIG. 15 illustrates a page-element for collecting data relating to a page;

FIG. 16 illustrates a group-element for collecting information relating to a group of elements which extend across page boundaries;

FIG. 17 illustrates a drStream-element for collecting data relating to an entire document;

FIGS. 18A-C are a collection of all syntax necessary for describing a document;

FIG. 19 is a block diagram of a system for inputting and converting a bitmap image into coded data streams using the present invention;

FIG. 20 is a flowchart illustrating a procedure performed by the system of FIG. 19 when using the present invention; and

FIG. 21 is a flowchart illustrating a procedure performed by the word recognizer of FIG. 19 when using the present invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

The present invention utilizes a straight forward procedure for recording ambiguities through the successive stages of the document recognition process. These ambiguities are in the context of:

characters processed by character recognizers;

words processed by character recognizers, spelling checkers, and semantics analyzers;

text flow processed by logical structure reconstructers; and

geometry of line segments and arcs processed by graphics recognizers.

Each of these processes produce and/or consume a byte-oriented data stream (hereinafter referred to as a document recognition stream or DRstream), and bitmap streams (hereinafter referred to as image files), referenced by the DRstream. The DRstream carries information about one or several pages of a digitized document. The information describes text with font, certain graphics primitives, and half tone images, as well as their relationships, and the ambiguities about them.

The present invention does not provide any new document recognition processes (or document recognizers) in the sense that it can be used with existing recognizers which recognize, for example, characters or graphics structures, or determine words (by comparing sequences of characters against a lexicon of known words), or determine which word from a choice of possible words is correct. However, the present invention improves the efficiency and compatibility with which these different types of recognizers function by providing a standard notation for recording the results obtained by the recognizers in a document description language.

FIGS. 2-18C illustrate this document recognition notation in ISO 8879 Standard Generalized Mark-up Language (SGML), according to the Document Type Definition discussed below. According to the present invention, each recognizer records coded data, corresponding to the results of the recognition process which it performs, as coded information, referred to in SGML as elements. Each element contains coded data which has been recognized as being similar in some way (for example: text, graphics, same page, all certain characters, etc.). Each element includes: a) a type-identifier which indicates the type of coded data contained in that element; b) an optional identification number, unique amongst all similar type elements of a document, which distinguishes that element from other similar type elements so that an element can be referenced by other elements (most elements will have an identification number); c) coded data obtained by the document recognition process (this could be strings of characters or parameters defining graphics structures); and d) optional contents (referred to as attributes) for providing additional information (for example, uncertainty information) about the coded data included in that element. Although the attributes of an element can be used to record uncertainty information about coded data in an element (information such as, for example, levels of confidence with which the coded data was recognized, or possible offsets for parameters (e.g. endpoints defining a line segment) of a graphics structure), the type-identification in some cases also serves to convey uncertainty information by indicating that the contents of that element was determined with a level of confidence below a predetermined level of confidence. In the illustrated examples, the coded data is recorded as human readable ASCII, however other codes could also be used.

One familiar with SGML will understand the generic contents of the elements to be described below. Thus, only a brief discussion of a generic element will be provided with reference to FIGS. 18A-C. Then, each type of element will be specifically described with reference to FIGS. 2-17. FIGS. 18A-C illustrate a complete syntax of elements which can be used to describe a document according to the present invention. This list of elements would be located at the start of each DRstream, and would be used by conventional parsers, programmed to parse streams written in SGML, to parse the DRstream contained therebelow. That is, after the syntax list of elements, a continuous stream of elements describing a specific document would be provided. As used herein, the terminology continuous stream of elements' refers to a group of elements which are identified as belonging together. Thus, in a markup language such as SGML, where white-space is permitted (and, in fact, encouraged for readability), tabs, breakage into separate lines constitute white-space that the parser will ignore. In this sense, white-space is part of the continuous stream of elements. Other systems may have a limit on the size of character streams. In these systems, long DRstreams would be split across several files which would be identified as belonging together. Such a DRstream, where several files are identified as belonging together, is also intended to be covered by the terminology "continuous stream of elements". (Some of the elements in FIGS. 18A-C include attributes (to be described below) which also would be listed at the start of the DRstream.) Of course, all of the elements listed in FIGS. 18A-C are not required to record the results of a d