WikiPatents - Community Patent Review
Create Free Account  |  License or Sell Your Patent  |  WikiPatents Marketplace  |  WikiPatents Blog
Username:  Password:  
    
Advanced Search
Method and apparatus for summarizing a document without document image decoding    
United States Patent5491760   
Link to this pagehttp://www.wikipatents.com/5491760.html
Inventor(s)Withgott; M. Margaret (Los Altos, CA); Bagley; Steven C. (Palo Alto, CA); Bloomberg; Dan S. (Palo Alto, CA); Halvorsen; Per-Kristian (Los Altos, CA); Huttenlocher; Daniel P. (Ithaca, NY); Cass; Todd A. (Cambridge, MA); Kaplan; Ronald M. (Palo Alto, CA); Rao; Ramana R. (San Francisco, CA)
AbstractA method and apparatus for excerpting and summarizing an undecoded document image, without first converting the document image to optical character codes such as ASCII text, identifies significant words, phrases and graphics in the document image using automatic or interactive morphological image recognition techniques, document summaries or indices are produced based on the identified significant portions of the document image. The disclosed method is particularly adept for improvement of reading machines for the blind.
   














 Title Information Submit all comments and votes
 
Patent Text Patent PDF Print Page Summary File History
Plain text PDF images Print Summary File History
Drawing from US Patent 5491760
Method and apparatus for summarizing a document without document image

     decoding - US Patent 5491760 Drawing
Method and apparatus for summarizing a document without document image decoding
Inventor     Withgott; M. Margaret (Los Altos, CA); Bagley; Steven C. (Palo Alto, CA); Bloomberg; Dan S. (Palo Alto, CA); Halvorsen; Per-Kristian (Los Altos, CA); Huttenlocher; Daniel P. (Ithaca, NY); Cass; Todd A. (Cambridge, MA); Kaplan; Ronald M. (Palo Alto, CA); Rao; Ramana R. (San Francisco, CA)
Owner/Assignee     Xerox Corporation (Stamford, CT)
Patent assignment
All assignments
Publication Date     February 13, 1996
Application Number     08/240,284
PAIR File History     Application Data   Transaction History
Image File Wrapper   Patent Term   Fees
Litigation
Filing Date     May 9, 1994
US Classification     382/203 382/177 382/229
Int'l Classification     G06K 009/46
Examiner     Boudreau; Leo
Assistant Examiner     Tran; Phuoc
Attorney/Law Firm     Oliff & Berridge
Address
Parent Case     This is a continuation of application Ser. No. 07/794,543 filed Nov. 19, 1991, now abandoned.
Priority Data    
USPTO Field of Search     382/9 382/55 382/1 382/28 382/30 382/25 382/40 382/177 382/190 382/114 382/198 382/199 382/200 382/203 382/209 382/206 382/229 382/257 382/308 364/419.03 364/419.19
Patent Tags     summarizing document without document image decoding
   
Enter a comma (,) or semicolon (;) between multiple tag words/phrases.
Describe this patent:
 Amusing   
 Clever   
 Complex   
 Efficient   
 Historic   
 Important   
 Innovative   
 Interesting   
 Practical   
 Simple   
[no votes]
Patent WIKI

Share information and news about this patent, including information and news about the technology, inventors, company, ligation and licensing.

 References Submit all comments and votes
 
*references marked with an asterisk below are user-added references
 U.S. References
 
Add a new US reference:  
ReferenceRelevancyCommentsReferenceRelevancyComments
5384863
Huttenlocher
382/173
Jan,1995

[0 after 0 votes]
5325444
Cass
382/177
Jun,1994

[0 after 0 votes]
5216725
McCubbrey
382/102
Jun,1993

[0 after 0 votes]
5202933
Bloomberg
382/176
Apr,1993

[0 after 0 votes]
5181255
Bloomberg
382/176
Jan,1993

[0 after 0 votes]
5131049
Bloomberg
382/257
Jul,1992

[0 after 0 votes]
5077668
Doi

Dec,1991

[0 after 0 votes]
5048109
Bloomberg
382/164
Sep,1991

[0 after 0 votes]
4994987
Baldwin
434/305
Feb,1991

[0 after 0 votes]
4972349
Kleinberger
707/1
Nov,1990

[0 after 0 votes]
4752772
Litt
345/160
Jun,1988

[0 after 0 votes]
4685135
Lin
704/260
Aug,1987

[0 after 0 votes]
4654873
Fujisawa
382/178
Mar,1987

[0 after 0 votes]
3659354
Sutherland
434/113
May,1972

[0 after 0 votes]
 Foreign References
 Other References
 Market Review Submit all comments and votes
   
Market Size
Estimate the gross annual revenues of the relevant market sector:
> $10B
$5B - $10B
$2B - $5B
$500M - $2B
$100M - $500M
$10M - $100M
$1M - $10M
$500K - $1M
$100K - $500K
< $100K
[No votes]
$0
 
$0   $2.5B   $5B   $7.5B   $10B
Market Share
Estimate the percentage of the relevant market sector this invention will capture:
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Reasonable Royalty
What percentage of gross sales should the inventor or assignee be paid?
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Public's "Guesstimation" of Royalty Value
Market SizeN/A[No votes]
xMarket ShareN/A[No votes]
xReasonable RoyaltyN/A[No votes]

N/A

License Availablity
If you are NOT the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
License Availablity
If you ARE the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
Competitive Advantage
Does this invention have a significant competitive advantage over similar technologies?
Yes

No



[No votes]
Most helpful competitive advantage comment
[No comments]

Commercial Alternatives
Are there viable commercial alternatives for this invention?
Yes

No



[No votes]
Most helpful commercial alternative comment
[No comments]

 Technical Review Submit all comments and votes
 Claims Submit all comments and votes
 


We claim:

1. A method for electronically processing an electronic document image without first decoding the electronic document image, comprising:

segmenting the document image into word image units without decoding the document image;

deriving a word shape representation for each of a plurality of said word image units without decoding any characters making up the plurality of word image units, thereby deriving a plurality of said word shape representations;

comparing said word shape representations to at least one other word shape representation to identify significant word image units from amongst said plurality of word image units; and

creating an abbreviated document image that is smaller than the electronic document image based on said identified significant word image units, said abbreviated document image including a plurality of said identified significant word image units.

2. The method of claim 1 wherein said step of comparing includes classifying said word image units according to frequency of occurrence based on comparing said word shape representations with each other.

3. The method of claim 1 wherein said step of comparing includes classifying said word image units according to location within the document image.

4. The method of claim 1 wherein said step of deriving a word shape representation includes utilization of at least one of an image unit shape dimension, font, typeface, number of ascender elements, number of descender elements, pixel density, pixel cross-sectional characteristic, the location of word image units with respect to neighboring word image units, vertical position, horizontal interimage unit spacing, and contour characteristic of said word image units.

5. The method of claim 1, wherein said comparing step includes comparing said word shape representations with each other.

6. The method of claim 1, wherein said comparing step includes comparing said word shape representations with at least one predetermined word shape representation.

7. The method of claim 1, wherein said comparing step includes comparing said word shape representations with at least one user-selected word shape representation.

8. A method of excerpting significant information from an undecoded document image without decoding the document image, comprising:

segmenting the document image into word image units without decoding the document image;

deriving a word shape representation for each of a plurality of said word image units without decoding any characters making up said plurality of word image units, thereby deriving a plurality of said word shape representations;

comparing said word shape representations to at least one other word shape representation to identify significant word image units from amongst said word image units; and

outputting a plurality of said identified significant word image units for further processing.

9. The method of claim 8 wherein said step of outputting a plurality of identified significant image units comprises generating a document index based on said significant identified word image units.

10. The method of claim 8 wherein said step of outputting a plurality of identified significant image units comprises producing a speech synthesized output corresponding to said identified significant word image units.

11. The method of claim 8 wherein said step of outputting a plurality of identified significant word image units comprises producing said identified significant word image units in printed Braille format.

12. The method of claim 8 wherein said step of outputting said a plurality of identified significant word image units comprises generating a document summary from said identified significant word image units.

13. A method for electronically processing an undecoded document image containing word text, comprising:

segmenting the document image into word image units without decoding the document image;

deriving a word shape representation for each of a plurality of said word image units without decoding any characters making up said plurality of word image units, thereby deriving a plurality of said word shape representations;

comparing said word shape representations to at least one other word shape representation to identify significant word image units from amongst said plurality of word image units;

forming phrase image units based on a plurality of said identified significant word image units, said phrase image units each incorporating one of said identified significant word image units and adjacent word image units linked in reading order sequence; and

outputting said phrase image units.

14. An apparatus for automatically summarizing the information content of an undecoded document image without decoding the document image, comprising:

means for segmenting the document image into word image units without decoding the document image;

means for deriving a word shape representation for each of a plurality of said word image units without decoding any characters making up said plurality of word image units, thereby deriving a plurality of said word shape representations;

means for comparing said word shape representations to at least one other word shade representation to identify significant word image units from amongst said plurality of word image units; and

means for creating a supplemental document image based on said identified significant word image units.

15. The apparatus of claim 14 wherein said means for segmenting the document image, said means for deriving a word shape representation, said means for comparing, said means for creating a supplemental document image comprise a programmed digital computer.

16. The apparatus of claim 15 further comprising scanning means for scanning an original document to produce said document image, said scanning means being incorporated in a document copier machine which produces printed document copies; and means for controlling said document copier machine to produce a printed document copy of said supplemental document image.

17. The apparatus of claim 15 further comprising scanning means for scanning an original document to produce said document image, said scanning means being incorporated in a reading machine for the blind having means for communicating data to the user; and means for controlling said reading machine communication means to communicate the contents of said supplemental document image.

18. The apparatus of claim 17 wherein said communicating means comprises a printer for producing document copies in Braille format.

19. The apparatus of claim 17 wherein said communicating means comprises a speech synthesizer for producing synthesized speech output corresponding to said supplemental document image.

20. The apparatus of claim 17 wherein said reading machine includes operator responsive means for accessing the scanned document or a selected portion thereof corresponding to a supplemental document image following communication of the supplemental document image to the user.
 Description Submit all comments and votes
 


BACKGROUND OF THE INVENTION

A portion of the disclosure of this patent document contains material that is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the U.S. Patent and Trademark Office records, but otherwise reserves all copyright rights whatsoever.

1. Cross-References to Related Applications

The following concurrently filed and related U.S. applications are hereby cross referenced and incorporated by reference in their entirety.

"Method for Determining Boundaries of Words in Text" to Huttenlocher et al., U.S. patent application Ser. No. 07/794,392.

"Detecting Function Words Without Converting a Document to Character Codes" to Bloomberg et al., U.S. patent application Ser. No. 07/794,190.

"A Method of Deriving Wordshapes for Subsequent Comparison" to Huttenlocher et al., U.S. patent application Ser. No. 07/794,391.

"Method and Apparatus for Determining the Frequency of Words in a Document Without Document Image Decoding" to Cass et al., U.S. patent application Ser. No. 07/795,173.

"Optical Word Recognition by Examination of Word Shape" to Huttenlocher et al., U.S. patent application Ser. No. 07/796,119, Published European Application No. 0543592, published May 26, 1993.

"A Method and Apparatus for Automatic Modification of Selected Semantically Significant Image Segments Within a Document Without Document Image Decoding" to Huttenlocher et al., U.S. patent application Ser. No. 07/795,174.

"Method for Comparing Word Shapes" to Huttenlocher et al., U.S. patent application Ser. No. 07/795,169.

"Method and Apparatus for Determining the Frequency of Phrase in a Document Without Document Image Decoding" to Withgott et al., U.S. patent application Ser. No. 07/794,555 now U.S. Pat. No. 5,369,714.

2. Field of the Invention

This invention relates to improvements in methods and apparatuses for automatic document processing, and more particularly to improvements in methods and apparatuses for recognizing semantically significant words, characters, images, or image segments in a document image without first decoding the document image and automatically creating a summary version of the document contents.

3. Background

It has long been the goal in computer based electronic document processing to be able, easily and reliably, to identify, access and extract information contained in electronically encoded data representing documents; and to summarize and characterize the information contained in a document or corpus of documents which has been electronically stored. For example, to facilitate review and evaluation of the information content of a document or corpus of documents to determine the relevance of same for a particular user's needs, it is desirable to be able to identify the semantically most significant portions of a document, in terms of the information they contain; and to be able to present those portions in a manner which facilitates the user's recognition and appreciation of the document contents. However, the problem of identifying the significant portions within a document is particularly difficult when dealing with images of the documents (bitmap image data), rather than with code representations thereof (e.g., coded representations of text such as ASCII). As opposed to ASCII text files, which permit users to perform operations such as Boolean algebraic key word searches in order to locate text of interest, electronic documents which have been produced by scanning an original without decoding to produce document images are difficult to evaluate without exhaustive viewing of each document image, or without hand-crafting a summary of the document for search purposes. Of course, document viewing or creation of a document summary require extensive human effort.

On the other hand, current image recognition methods, particularly involving textual material, generally involve dividing an image segment to be analyzed into individual characters which are then deciphered or decoded and matched to characters in a character library. One general class of such methods includes optical character recognition (OCR) techniques. Typically, OCR techniques enable a word to be recognized only after each of the individual characters of the word have been decoded, and a corresponding word image retrieved from a library.

Moreover, optical character recognition decoding operations generally require extensive computational effort, generally have a non-trivial degree of recognition error, and often require significant amounts of time for image processing, especially with regard to word recognition. Each bitmap of a character must be distinguished from its neighbors, its appearance analyzed, and identified in a decision making process as a distinct character in a predetermined set of characters. Further, the image quality of the original document and noise inherent in the generation of a scanned image contribute to uncertainty regarding the actual appearance of the bitmap for a character. Most character identifying processes assume that a character is an independent set of connected pixels. When this assumption fails due to the quality of the image, identification also fails.

4. References

European patent application number 0-361-464 by Doi describes a method and apparatus for producing an abstract of a document with correct meaning precisely indicative of the content of the document. The method includes listing hint words which are preselected words indicative of the presence of significant phrases that can reflect content of the document, searching all the hint words in the document, extracting sentences of the document in which any one of the listed hint words is found by the search, and producing an abstract of the document by juxtaposing the extracted sentences. Where the number of hint words produces a lengthy excerpt, a morphological language analysis of the abstracted sentences is performed to delete unnecessary phrases and focus on the phrases using the hint words as the right part of speech according to a dictionary containing the hint words.

"A Business Intelligence System" by Luhn, IBM Journal, October 1958 describes a system which in part, auto-abstracts a document, by ascertaining the most frequently occurring words (significant words) and analyzes all sentences in the text containing such words. A relative value of the sentence significance is then established by a formula which reflects the number of significant words contained in a sentence and the proximity of these words to each other within the sentence. Several sentences which rank highest in value of significance are then extracted from the text to constitute the auto-abstract.

SUMMARY OF THE INVENTION

Accordingly, it is an object of the invention to provide a method and apparatus for automatically excerpting and summarizing a document image without decoding or otherwise understanding the contents thereof.

It is another object of the invention to provide a method and apparatus for automatically generating ancillary document images reflective of the contents of an entire primary document image.

It is another object of the invention to provide a method and apparatus of the type described for automatically extracting summaries of material and providing links from the summary back to the original document.

It is another object of the invention to provide a method and apparatus of the type described for producing Braille document summaries or speech synthesized summaries of a document.

It is another object of the invention to provide a method and apparatus of the type described which is useful for enabling document browsing through the development of image gists, or for document categorization through the use of lexical gists.

It is another object of the invention to provide a method and apparatus of the type described that does not depend upon statistical properties of large, pre-analyzed document corpora.

The invention provides a method and apparatus for segmenting an undecoded document image into undecoded image units, identifying semantically significant image units based on an evaluation of predetermined image characteristics of the image units, without decoding the document image or reference to decoded image data, and utilizing the identified significant image units to create an ancillary document image of abbreviated information content which is reflective of the subject matter content of the original document image. In accordance with one aspect of the invention, the ancillary document image is a condensation or summarization of the original document image which facilitates browsing. In accordance with another aspect of the invention, the identified significant image units are presented as an index of key words, which may be in decoded form, to permit document categorization.

Thus, in accordance with one aspect of the invention, a method is presented for excerpting information from a document image containing word image units. According to the invention, the document image is segmented into word image units (word units), and the word units are evaluated in accordance with morphological image properties of the word units, such as word shape. Significant word units are then identified, in accordance with one or more predetermined or user selected significance criteria, and the identified significant word units are outputted.

In accordance with another aspect of the invention, an apparatus is provided for excerpting information from a document containing a word unit text. The apparatus includes an input means for inputting the document and producing a document image electronic representation of the document, and a data processing system for performing data driven processing and which comprises execution processing means for performing functions by executing program instructions in a predetermined manner contained in a memory means. The program instructions operate the execution processing means to identify significant word units in accordance with a predetermined significance criteria from morphological properties of the word units, and to output selected ones of the identified significant word units. The output of the selected significant word units can be to an electrostatographic reproduction machine, a speech synthesizer means, a Braille printer, a bitmap display, or other appropriate output means.

These and other objects, features and advantages of the invention will be apparent to those skilled in the art from the following detailed description of the invention, when read in conjunction with the accompanying drawings and appended claims.

BRIEF DESCRIPTION OF THE DRAWINGS

A preferred embodiment of the invention is illustrated in the accompanying drawing, in which:

FIG. 1 is a flow chart of a method of the invention;

FIG. 2 is a block diagram of an apparatus according to the invention for carrying out the method of FIG. 1;

FIG. 3 is a flow chart of a preferred embodiment of a method according to the invention for detecting function words in a scanned document image without first converting the document image to character codes;

FIGS. 4A-4F show three sets of character ascender structuring elements where: FIGS. 4A-4B show a set of character ascender structuring elements of height 3 and length 5, where the solid dots are ON pixels along the bottom row and along one side column and there are one or more OFF pixels in a remaining location preferably separated from the ON pixels; FIGS. 4C-4D show a set of character ascender structuring elements of height 4 and length 5; and FIGS. 4E-4F show a set of character ascender structuring elements of height 5 and length 5.

FIGS. 5A-5F show three sets of character descender structuring elements where: FIGS. 5A-5B show a set of character descender structuring elements of height 3 and length 5; FIGS. 5C-5D show a set of character descender structuring elements of height 4 and length 5; and FIGS. 5E-5F show a set of character descender structuring elements of height 5 and length 5;

FIG. 6 shows a horizontal structuring element of length 5;

FIG. 7 shows a block system diagram of the arrangement of system components forming a word shape recognition system;

FIG. 8 shows a block system diagram for identifying equivalence classes of image units; and

FIG. 9 shows a block system diagram for identifying significant image units.

FIG. 10 shows an image sample of example text over which the inventive process will be demonstrated;

FIG. 11 is a copy of a scanned image of the example text;

FIGS. 12A, 12B and 12C graphically illustrate the process used to determine the angle at which the example text is oriented in the image sample prior for further processing, while FIG. 12D shows graphs of the responses taken from the example text, which are used to determine the angle at which the example text is oriented in the image sample prior to further processing;

FIGS. 13A and 13B respectively show the derivation and use of a graph examining the sample image of the example text to determine baselines of text within the image;

FIGS. 14A and 14B are flowcharts illustrating the procedures executed to determine the baselines shown in FIG. 13A;

FIG. 15 shows the scanned image of the example text with baselines indicated thereon after derivation from the data shown in FIGS. 13A and 13B;

FIG. 16 is a flowchart illustrating the steps used in the application of a median filter to the image of FIG. 10;

FIG. 17 is an enlarged pictorial representation of a portion of the image of FIG. 10, illustrating the application of the median filter;

FIG. 18 demonstrates the resulting image after application of a median filter, a process known herein as blobifying, to the scanned image of the example text, which tends to render character strings as a single set of connected pixels;

FIG. 19 shows a subsequent step in the process, in which lines of white pixels are added to the blurred image to clearly delineate a line of character strings from adjacent lines of character strings;

FIG. 20 is a flowchart illustrating the steps required to add the white lines of FIG. 19;

FIGS. 21A and 21B are flowcharts representing the procedure which is followed to segment the image data in accordance with the blurred image of FIG. 18;

FIG. 22 shows the sample text with bounding boxes placed around each word group in a manner which uniquely identifies a subset of image pixels containing each character string;

FIGS. 23A and 23B illustrate derivation of a single independent value signal, using the example word "from", which appears in the sample image of example text;

FIG. 24 illustrates the resulting contours formed by the derivation process illustrated in FIGS. 23A and 23B;

FIG. 25 illustrates the steps associated with deriving the word shape signals;

FIGS. 26A, 26B, 26C and 26D illustrate derivation of a single independent value signal, using the example word "from";

FIGS. 27A, 27B, 27C and 27D illustrate derivation of a single independent value signal, using the example word "red", which does not appear in the sample image of example text;

FIG. 28 shows a simple comparison of the signals derived for the words "red" and "from" using a signal normalization method;

FIGS. 29A, 29B, and 29C illustrate the details of the discrepancy in font height, and the method for normalization of such discrepancies;

FIG. 30 is a flowchart detailing the steps used for one method of determining the relative difference between word shape contours;

FIG. 31 is a flowchart detailing the steps of a second method for determining the relative difference between word shape contours;

FIGS. 32A and 32B are respective illustrations of the relationship between the relative difference values calculated and stored in an array, for both a non-slope-constrained and a slope-constrained comparison; and

FIG. 33 is a block diagram of a preferred embodiment of an apparatus according to the invention for detecting function words in a scanned document image without first converting the document image to character codes;

The Appendix contains source code listings for a series of image manipulation and signal processing routines which have been implemented to demonstrate the functionality of the present invention. Included in the Appendix are four sections which are organized as follows:

Section A, beginning at page 1, comprises the declarative or "include" files which are commonly shared among the functional code modules;

Section B, beginning at page 26, includes the listings for a series of library type functions used for management of the images, error reporting, argument parsing, etc.;

Section C, beginning at page 42, comprises numerous variations of the word shape comparison code, and further includes code illustrating alternative comparison techniques than those specifically cited in the following description;

Section D, beginning at page 145, comprises various functions for the word shape extraction operations that are further described in the following description.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

In contrast to prior techniques, such as those described above, the invention is based upon the recognition that scanned image files and character code files exhibit important differences for image processing, especially in data retrieval. The method of a preferred embodiment of the invention capitalizes on the visual properties of text contained in paper documents, such as the presence or frequency of linguistic terms (such as words of importance like "important", "significant", "crucial", or the like) used by the author of the text to draw attention to a particular phrase or a region of the text; the structural placement within the document image of section titles and page headers, and the placement of graphics; and so on. A preferred embodiment of the method of the invention is illustrated in the flow chart of FIG. 1, and an apparatus for performing the method is shown in FIG. 2. For the sake of clarity, the invention will be described with reference to the processing of a single document. However, it will be appreciated that the invention is applicable to the processing of a corpus of documents containing a plurality of documents. M o r e particularly, the invention provides a method and apparatus for automatically excerpting semantically significant information from the data or text of a document based on certain morphological (structural) image characteristics of image units corresponding to units of understanding contained within the document image. The excerpted information can be used, among other things, to automatically create a document index or summary. The selection of image units for summarization can be based on frequency of occurrence, or predetermined or user selected selection criteria, depending upon the particular application in which the method and apparatus of the invention is employed.

The invention is not limited to systems utilizing document scanning. Rather, other systems such as a bitmap workstation (i.e., a workstation with a bitmap display) or a system using both bitmapping and scanning would work equally well for the implementation of the methods and apparatus described herein.

With reference first to FIG. 2, the method is performed on an electronic image of an original document 5, which may include lines of text 7, titles, drawings, figures 8, or the like, contained in one or more sheets or pages of paper 10 or other tangible form. The electronic document image to be processed is created in any conventional manner, for example, by a conventional scanning means such as those incorporated within a document copier or facsimile machine, a Braille reading machine, or by an electronic beam scanner or the like. Such scanning means are well known in the art, and thus are not described in detail herein. An output derived from the scanning is digitized to produce undecoded bit mapped image data representing the document image for each page of the document, which data is stored, for example, in a memory 15 of a special or general purpose digital computer data processing system 13. The data processing system 13 can be a data driven processing system which comprises sequential execution processing means 16 for performing functions by executing program instructions in a predetermined sequence contained in a memory, such as the memory 15. The output from the data processing system 13 is delivered to an output device 17, such as, for example, a memory or other form of storage unit; an output display 17A as shown, which may be, for instance, a CRT display; a printer device 17B as shown, which may be incorporated in a document copier machine or a Braille or standard form printer; a facsimile machine, speech synthesizer or the like.

Through use of equipment such as illustrated in FIG. 2, the identified word units are detected based on significant morphological image characteristics inherent in the image units, without first converting the scanned document image to character codes.

The method by which such image unit identification may be performed is described with reference now to FIG. 1. The first phase of the image processing technique of the invention involves a low level document image analysis in which the document image for each page is segmented into undecoded information containing image units (step 20) using conventional image analysis techniques; or, in the case of text documents, preferably using the bounding box method described in copending U.S. patent application Ser. No. 07/794,392 filed concurrently herewith by Huttenlocher and Hopcroft, and entitled "Method for Determining Boundaries of Words in Text." The locations of and spatial relationships between the image units on a page are then determined (step 25). For example, an English language document image can be segmented into word image units based on the relative difference in spacing between characters within a word and the spacing between words. Sentence and paragraph boundaries can be similarly ascertained. Additional region segmentation image analysis can be performed to generate a physical document structure description that divides page images into labelled regions corresponding to auxiliary document elements like figures, tables, footnotes and the like. Figure regions can be distinguished from text regions based on the relative lack of image units arranged in a line within the region, for example. Using this segmentation, knowledge of how the documents being processed are arranged (e.g., left-to-right, top-to-bottom), and, optionally, other inputted information such as document style, a "reading order" sequence for word images can also be generated. The term "image unit" is thus used herein to denote an identifiable segment of an image such as a number, character, glyph, symbol, word, phrase or other unit that can be reliably extracted. Advantageously, for purposes of document review and evaluation, the document image is segmented into sets of signs, symbols or other elements, such as words, which together form a single unit of understanding. Such single units of understanding are generally characterized in an image as being separated by a spacing greater than that which separates the elements forming a unit, or by some predetermined graphical emphasis, such as, for example, a surrounding box image or other graphical separator, which distinguishes one or more image units from other image units in the scanned document image. Such image units representing single units of understanding will be referred to hereinafter as "word units."

Advantageously, a discrimination step 30 is next performed to identify the image units which have insufficient information content to be useful in