WikiPatents - Community Patent Review
Create Free Account  |  License or Sell Your Patent  |  WikiPatents Marketplace  |  WikiPatents Blog
Username:  Password:  
    
Advanced Search
System for searching a corpus of document images by user specified document layout components    
United States Patent5999664   
Link to this pagehttp://www.wikipatents.com/5999664.html
Inventor(s)Mahoney; James V. (Los Angeles, CA); Blomberg; Jeanette L. (Portola Valley, CA); Trigg; Randall H. (Palo Alto, CA); Shin; Christian K. (Fairport, NY)
AbstractA document search system provides a user with a programming interface for dynamically specifying features of documents recorded in a corpus of documents. The programming interface operates at a high-level that is suitable for interactive user specification of layout components and structures of documents. In operation, a bitmap image of a document is analyzed by the document search system to identify layout objects such as text blocks or graphics. Subsequently, the document search system computes a set of attributes for each of the identified layout objects. The set of attributes which are identified are used to describe the layout structure of a page image of a document in terms of the spatial relations that layout objects have to frames of reference that are defined by other layout objects. After computing attributes for each layout object, a user can operate the programming interface to define unique document features. Each document feature is a routine defined by a sequence of selections operations which consume a first set of layout objects and produce a second set of layout objects. The second set of layout objects constitutes the feature in a page image of a document. Using the programming interface, a user flexibly defines a genre of document using the user-specified document features.



 Title Information Submit all comments and votes
 
Patent Text Patent PDF Print Page Summary File History
Plain text PDF images Print Summary File History
Inventor     Mahoney; James V. (Los Angeles, CA); Blomberg; Jeanette L. (Portola Valley, CA); Trigg; Randall H. (Palo Alto, CA); Shin; Christian K. (Fairport, NY)
Owner/Assignee     Xerox Corporation (Stamford, CT)
Patent assignment
All assignments
Publication Date     December 7, 1999
Application Number     08/971,022
PAIR File History     Application Data   Transaction History
Image File Wrapper   Patent Term   Fees
Litigation
Filing Date     November 14, 1997
US Classification     382/305 382/180
Int'l Classification     G06K 009/54
Examiner     Couso; Yon J.
Assistant Examiner    
Attorney/Law Firm    
Address
Parent Case     CROSS-REFERENCE TO RELATED APPLICATIONS Cross-reference is made to U.S. patent application Ser. Nos. 08/971,210, entitled "System For Summarizing A Corpus Of Documents By Assembling User Specified Layout Components" (Attorney Docket No. D/97493), 08/970,507, entitled "System For Sorting Document Images By Shape Comparisons Among Corresponding Layout Components" (Attorney Docket No. D/97494), and 08/971,020, entitled "System For Progressively Transmitting And Displaying Layout Components Of Document Images" (Attorney Docket No. D/97495), which are assigned to the same assignee as the present invention.
Priority Data    
USPTO Field of Search     382/180 382/305 382/293 382/295 382/173 382/176 707/517 707/520 707/523
Patent Tags     searching corpus document images user specified document layout components
   
Enter a comma (,) or semicolon (;) between multiple tag words/phrases.
Describe this patent:
 Amusing   
 Clever   
 Complex   
 Efficient   
 Historic   
 Important   
 Innovative   
 Interesting   
 Practical   
 Simple   
[no votes]
Patent WIKI

Share information and news about this patent, including information and news about the technology, inventors, company, ligation and licensing.

 References Submit all comments and votes
 
*references marked with an asterisk below are user-added references
 U.S. References
 
Add a new US reference:  
ReferenceRelevancyCommentsReferenceRelevancyComments
5848186
Wang
382/176
Dec,1998

[0 after 0 votes]
5848184
Taylor
382/173
Dec,1998

[0 after 0 votes]
5841900
Rahgozar
382/176
Nov,1998

[0 after 0 votes]
5832118
Kim

Nov,1998

[0 after 0 votes]
5701500
Ikeo
715/517
Dec,1997

[0 after 0 votes]
5598507
Kimber
704/246
Jan,1997

[0 after 0 votes]
5539841
Huttenlocher

Jul,1996

[0 after 0 votes]
5537491
Mahoney
382/218
Jul,1996

[0 after 0 votes]
5524066
Kaplan
382/229
Jun,1996

[0 after 0 votes]
5491760
Withgott
382/203
Feb,1996

[0 after 0 votes]
5442778
Pedersen
707/5
Aug,1995

[0 after 0 votes]
5434953
Bloomberg

Jul,1995

[0 after 0 votes]
5390259
Withgott
382/173
Feb,1995

[0 after 0 votes]
5384863
Huttenlocher
382/173
Jan,1995

[0 after 0 votes]
5369714
Withgott
382/177
Nov,1994

[0 after 0 votes]
5335088
Fan
382/237
Aug,1994

[0 after 0 votes]
5325444
Cass
382/177
Jun,1994

[0 after 0 votes]
5321770
Huttenlocher
382/174
Jun,1994

[0 after 0 votes]
 Foreign References
 Other References
 Market Review Submit all comments and votes
   
Market Size
Estimate the gross annual revenues of the relevant market sector:
> $10B
$5B - $10B
$2B - $5B
$500M - $2B
$100M - $500M
$10M - $100M
$1M - $10M
$500K - $1M
$100K - $500K
< $100K
[No votes]
$0
 
$0   $2.5B   $5B   $7.5B   $10B
Market Share
Estimate the percentage of the relevant market sector this invention will capture:
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Reasonable Royalty
What percentage of gross sales should the inventor or assignee be paid?
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Public's "Guesstimation" of Royalty Value
Market SizeN/A[No votes]
xMarket ShareN/A[No votes]
xReasonable RoyaltyN/A[No votes]

N/A

License Availablity
If you are NOT the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
License Availablity
If you ARE the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
Competitive Advantage
Does this invention have a significant competitive advantage over similar technologies?
Yes

No



[No votes]
Most helpful competitive advantage comment
[No comments]

Commercial Alternatives
Are there viable commercial alternatives for this invention?
Yes

No



[No votes]
Most helpful commercial alternative comment
[No comments]

 Technical Review Submit all comments and votes
 Claims Submit all comments and votes
 


We claim:

1. A method for searching a corpus of document images stored in a memory, comprising the steps of:

segmenting each document image in the corpus of document images into a first set of layout objects; each layout object in the first set of layout objects being one of a plurality of layout object types; each of the plurality of layout object types identifying a structural element of a document;

for each segmented document image, computing attributes for each layout object in the first set of layout objects; the computed attributes of each layout object having values that quantify properties of a structural element and identify spatial relationships with other segmented layout objects;

providing a program interface for composing a routine that includes a sequence of selection operations for identifying certain of the layout objects in the first set of layout objects of an example document image selected from the corpus of document images; the certain layout objects defining a feature of the example document image;

executing the sequence of selection operations of the routine for identifying the feature of the example document image in ones of the document images in the corpus of document images; for each segmented document image, the sequence of selection operations receiving as input the first set of layout objects and the computed attributes to produce as output a second set of layout objects; said executing step identifying the ones of the document images in the corpus of document images that include at least one layout object in the second set of layout objects as having the feature of the example document image; and

displaying at the program interface the ones of the document images in the corpus of document images identified by said executing step.

2. The method according to claim 1, further comprising the step of highlighting in the example document image the feature identified by the routine composed at a program interface.

3. The method according to claim 1, wherein one of the selection operations executed by said executing step is a filtering operation that produces a second set of layout objects having attribute values between a minimum threshold value and a maximum threshold value.

4. The method according to claim 1, wherein one of the selection operations executed by said executing step is a gating operation that produces a second set of layout objects equal to the first set of layout objects when each of the layout objects in a document image have attribute values between a minimum threshold value and a maximum threshold value.

5. The method according to claim 1, wherein one of the selection operations executed by said executing step is an accumulation operation that identifies an attribute value for the layout objects in the first set of layout objects.

6. The method according to claim 1, further comprising the step of forming the corpus document images in the memory by scanning hardcopy documents.

7. The method according to claim 1, wherein said segmenting step defines one of the plurality of layout objects to be a text block of.

8. The method according to claim 1, further comprising the step of defining a structural model for identifying a genre of document; the structural model defining a set of features absent and a set of features present in document images to specify a class of document images which express a common communicative purpose that is independent of document content.

9. A program storage device readable by a machine, embodying a program of instructions executable by the machine to perform method steps for searching a corpus of document images stored in a memory of a document management system, said method steps comprising:

segmenting each document image in the corpus of document images into a first set of layout objects; each layout object in the first set of layout objects being one of a plurality of layout object types; each of the plurality of layout object types identifying a structural element of a document;

for each segmented document image, computing attributes for each layout object in the first set of layout objects; the computed attributes of each layout object having values that quantify properties of a structural element and identify spatial relationships with other segmented layout objects;

providing a program interface for composing a routine that includes a sequence of selection operations for identifying certain of the layout objects in the first set of layout objects of an example document image selected from the corpus of document images; the certain layout objects defining a feature of the example document image;

executing the sequence of selection operations of the routine for identifying the feature of the example document image in ones of the document images in the corpus of document images; for each segmented document image, the sequence of selection operations receiving as input the first set of layout objects and the computed attributes to produce as output a second set of layout objects; said executing step identifying the ones of the document images in the corpus of document images that include at least one layout object in the second set of layout objects as having the feature of the example document image; and

displaying at the program interface the ones of the document images in the corpus of document images identified by said executing step.

10. The program storage device as recited in claim 9, wherein said method steps further comprise the step of defining a structural model for identifying a genre of document; the structural model defining a set of features absent and a set of features present in document images to specify a class of document images which express a common communicative purpose that is independent of document content.

11. A document management system for searching a corpus of document images, comprising:

a memory for storing the corpus of document images and image processing instructions of the document management system;

a display; and

a processor coupled to the memory and the display for executing the document image processing instructions of the document management system; the processor in executing the document image processing instructions:

segmenting each document image in the corpus of document images into a first set of layout objects; each layout object in the first set of layout objects being one of a plurality of layout object types; each of the plurality of layout object types identifying a structural element of a document;

for each segmented document image, computing attributes for each layout object in the first set of layout objects; the computed attributes of each layout object having values that quantify properties of a structural element and identify spatial relationships with other segmented layout objects;

providing a program interface on the display for composing a routine that includes a sequence of selection operations for identifying certain of the layout objects in the first set of layout objects of an example document image selected from the corpus of document images; the certain layout objects defining a feature of the example document image;

executing the sequence of selection operations of the routine for identifying the feature of the example document image in ones of the document images in the corpus of document images; for each segmented document image, the sequence of selection operations receiving as input the first set of layout objects and the computed attributes to produce as output a second set of layout objects;

identifying the ones of the document images in the corpus of document images that include at least one layout object in the second set of layout objects to have the feature of the example document image; and

displaying at the program interface on the display the ones of the document images in the corpus of document images identified as having the feature of the example document image.

12. The document management system according to claim 11, further comprising means for displaying at the program interface on the display an example document with the feature highlighted.

13. The document management system according to claim 11, wherein the program interface on the display comprises:

a first display area for specifying a name for the feature;

a second display area for specifying a set of images which exemplify the feature;

a third display area for specifying a set of input layout objects; the set of input layout objects limiting the first set of layout objects of the document image consumed by the sequence of selection operations of the routine; and

a fourth display area for defining the sequence of selections operations of the feature.

14. The document management system according to claim 11, further comprising a program interface on the display for defining a model of a genre of document.

15. The document management system according to claim 14, wherein the program interface on the display comprises:

means for specifying that a feature is present in the genre of document; and

means for specifying that a feature is absent from the genre of document.

16. The document management system according to claim 14, wherein the genre of document specified is a letter.

17. The document management system according to claim 14, wherein the genre of document specified is a memo.

18. The document management system according to claim 11, wherein the document image is a bitmap image.

19. The document management system according to claim 11, wherein one of the sequence of selection operations is a filtering operation given by:

F(L,A,u,v,N)={l.epsilon.L:uN.ltoreq.A(l)<vN}, where:

L is a set of layout objects to which the filtering operation is applied;

A is an attribute;

u and v are threshold arguments; and

N is a normalization argument.

20. The document management system according to claim 11, wherein one of the sequence of selection operations is a gate operation given by: ##EQU3## L is a set of layout objects to which the gate operation is applied; A is an attribute;

u and v are threshold arguments; and

N is a normalization argument.
 Description Submit all comments and votes
 


BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention relates generally to a system for managing and searching a large corpus of documents, and more particularly, to a system for a user to dynamically specify layout components of documents recorded in the large corpus of documents.

2. Description of Related Art

Searching for a document in a large heterogeneous corpus of documents stored in an electronic database is often difficult because of the sheer size of the corpus (e.g., 750,000 documents). Many of the documents that make up the corpus are documents that cannot be identified by simply performing text based searches. In some instances, some documents in the corpus may, for example, be scanned images of hardcopy documents, or images derived using PDF (Portable Documents Formats), or PostScript.RTM.. In other instances, simply searching the text of documents may not narrow a search sufficiently to locate a particular document in the corpus.

Techniques for searching the text of a document in a large corpus of documents exist. U.S. Pat. No. 5,442,778 discloses a scatter-gather browsing method which is a cluster-based method for browsing a large corpus of documents. This system addresses the extreme case in which there is no specific query, but rather a need to get an idea of what exists in a large corpus of documents. Scatter-gather relies on document clustering to present to a user descriptions of large document groups. Document clustering is based on the general assumption that mutually similar documents tend to be relevant to the same queries. Based on the descriptions of the documents groups, the user selects one or more of the document groups for further study. These selected groups are gathered together to form a sub-collection. This process repeats and bottoms out when individual documents are viewed.

Also, techniques exist that analyze the machine readable text of a document for identifying the genre of documents. The genre of text relates to a type of text or type of document. An example of a method for identifying the genre of machine readable text is disclosed in U.S. Provisional Application Ser. No. 60/051,558, entitled "Article And Method Of Automatically Determining Text Genre Using Surface Features Of Untagged Texts," (Attorney Docket No. D/95465P). Initially, machine readable text is analyzed to formulate a cue vector. The cue vector represents occurrences in the text of a set of non-structural, surface cues, which are easily computable. A genre of the text is then determined by weighing the elements making up the cue vector.

Besides text found in a document, often the layout of a particular document contains a significant amount of information that can be used to identify a document stored in a large corpus of documents. Using the layout structure of documents to search a large corpus of documents is particularly advantageous when documents in the corpus have not been tagged with a high level definition. Hardcopy documents which are scanned are recorded as bitmap images that have no structural definition that is immediately perceivable by a computer. A bitmap image generally consists of a sequence of image data or pixels. To become searchable, the structure of a bitmap image is analyzed to identify its layout structure.

By examining different work practices, it has been found that a work process (i.e., manner of working) can be supported with a system that is capable of searching and retrieving documents in a corpus by their type or genre (i.e., functional category). Where some genres of documents are general in the sense that they recur across different organizations and work processes, other genre of documents are idiosyncratic to a particular organization, task, or even user. For example, a business letter and a memo are examples of a general genre. A set of documents with an individual's private stamp in the upper right corner of each document is an example of a genre that is idiosyncratic to a particular user. It has also been found that many different genre of documents have a predefined form or a standard set of components that depict a unique spatial arrangement. For example, business letters are divided into a main body, author and recipient addresses, and signature. Unlike specific text based identifiers, which are used to identify the genre of a document, the layout structure of documents can apply across different classes of documents.

A number of different techniques have been developed for analyzing the layout structure of a bitmap image. Generally, page layout analysis has been divided into two broad categories: geometric layout analysis and logical structure analysis. Geometric layout analysis extracts whatever structure can be inferred without reference to models of particular kinds of pages--e.g., letter, memo, title page, table, etc. Logical structure analysis classifies a given page within a repertoire of known layouts, and assigns functional interpretations to components of the page based on this classification. Geometric analysis is generally preliminary to logical structure analysis. (For further background on image layout analysis see U.S. patent application Ser. No. 08/565,181, entitled "Method For Classifying Non-Running Text In An Image" and its references).

The present invention concerns a method and apparatus for defining user-specified layout structures of documents (i.e., the visual appearance) to facilitate the search and retrieval of a document stored in a multi-genre database of documents. This method of searching documents focuses a search according to the manner in which the layout structure of a document is defined. Unlike many techniques for searching the text within a document, searching documents according to their layout structure is based on the appearance and not the textual content found in a document. The general premise for searching documents based on their layout structure is that the layout structure of text documents often reflect its genre. For example, business letters are in many ways more visually similar to one another than they are to magazine articles. Thus, a user searching for a particular document while knowing the class of documents is able to more effectively narrow the group of documents being searched.

One problem addressed by this invention is how to best manage a large corpus of scanned documents. Many document search and retrieval systems rely entirely on the results of applying OCR (Optical Character Recognition) to every scanned document image. Generally, OCR techniques involve segmenting an image into individual characters which are then decoded and matched to characters in a library. Typically, such OCR techniques require extensive computational effort, generally have a non-trivial degree of recognition error, and often require significant amounts of time for image processing. In operation, OCR techniques distinguish each bitmap of a character from its neighbor, analyze its appearance, and distinguish it from other characters in a predetermined set of characters.

A disadvantage of OCR techniques is that they are often an insufficient means for capturing information in scanned documents because the quality of OCR results may be unacceptably poor. For example, the OCR results for a scanned document may be poor in quality because the original document was a heavily used original, a facsimile of an original, or a copy of an original. In each of these examples, the scanned results of an original document may provide insufficient information for an OCR program to accurately identify the text within the scanned image. In some instances, some scanned documents may be handwritten in whole or in part, thereby making those portions of the original document unintelligible to an OCR program.

Another disadvantage of OCR techniques is that the layout or formatting of the document is typically not preserved by an OCR program. As recognized by Blomberg et al. in "Reflections on a Work-Oriented Design Project" (published in PDC'94: Proceedings of the Participatory Design Conference, p. 99-109, on Oct. 27-28, 1994), users searching for a particular document in a large corpus of documents tend to rely on clues about the form and structure of the documents. Such clues, which could be gained from either the original bitmap image or reduced scale images (i.e., thumbnails), tend to be lost in ASCII text renderings of images. Thus, the layout or formatting of a document, which is usually not captured or preserved when a scanned image is reduced to text using an OCR program, is crucial information that can be used for identifying that document in a large corpus of documents. Improved OCR programs such as TextBridge.RTM., which is produced by Xerox ScanSoft, Inc., are capable of converting scanned images into formatted documents (e.g. HTML (hypertext markup language)) with tables and pictures as opposed to a simple ASCII text document (more information can be found on the Internet at http://www.xerox.com/xis/textbridge/).

An alternative technique for identifying information contained in electronic documents without having to decode a document using OCR techniques is disclosed in U.S. Pat. No. 5,491,760 and its references. This alternative technique segments an undecoded document image into word image units without decoding the document image or referencing decoded image data. Once segmented, word image units are evaluated in accordance with morphological image properties of the word image units, such as word shape. (These morphological image properties do not take into account the structure of a document. That is, the word image units do not take into account where the shape appeared in a document.) Those word image units which are identified as semantically significant are used to create an ancillary document image of content which is reflective of the subject matter in the original document. Besides image summarization, segmenting a document into word image units has many other applications which are disclosed in related U.S. Pat. Nos. 5,539,841; 5,321,770; 5,325,444; 5,390,259; 5,384,863; and 5,369,714. For instance, U.S. Patent No. discloses a method for identifying when similar tokens (e.g., character, symbol, glyph, string of components) are present in an image section; U.S. Pat. No. 5,324,444 discloses a method for determining the frequency of words in a document, and U.S. Pat. No. 5,369,714 discloses a method for determining the frequency of phrases found in a document.

Another alternative to performing OCR analysis on bitmap images are systems that perform content-based searches on bitmap images. An example of such a system is IBM's Query by Image Content (QBIC) system. The QBIC system is disclosed in articles by Niblack et al., entitled "The QBIC project: querying images by content using color, texture and shape," in SPIE Proc. Storage and Retrieval for Image and Video Databases, 1993, and by Ashley et al., entitled "Automatic and semiautomatic methods for image annotation and retrieval in QBIC," in SPIE Proc. Storage and Retrieval for Image and Video Databases, pages 24-35, 1995. A demo of a QBIC search engine is available on the internet at "http://wwwqbic.almaden.ibm.com/.about.qbic/qbic.html". Using the QBIC.TM. system, bitmap images in a large database of images can be queried by image properties such as color percentages, color layouts, and textures. The image-based queries offered by the QBIC system are combined with text or keyword for more focused searching.

Another system for performing content-based queries is being developed as part of the UC Berkeley Digital Library Project. Unlike the QBIC system which relies on low-level image properties to perform searches, the Berkeley system groups properties and relationships of low level regions to define high-level objects. The premise of the Berkeley system is that high-level objects can be defined by meaningful arrangements of color and texture. Aspects of the Berkeley system are disclosed in the following articles and their references: Chad Carson et al., "Region-Based Image Querying," CVPR '97 Workshop on Content-Based Access of Image and Video Libraries; Serge Belongie et al., "Recognition of Images in Large Databases Using a Learning Framework," UC Berkeley CS Tech Report 97-939; and Chad Carson et al., "Storage and Retrieval of Feature Data for a Very Large Online Image Collection," IEEE Computer Society Bulletin of the Technical Committee on Data Engineering, December 1996, Vol. 19 No. 4.

In addition to using OCR programs or the like to decipher the content of scanned documents, it is also common to record document metadata (i.e., document information) at the time a hardcopy document is scanned. This document metadata, which is searchable as text, may include the subject of the document, the author of the document, keywords found in the document, the title of the document, and the genre or type of document. A disadvantage of using document metadata to identify documents is that the genre specified for a particular corpus of documents is not static. Instead, the number of different genre of documents in a corpus can vary as the corpus grows. A further disadvantage of document metadata is that it is time consuming for a user to input into a system. As a result, a system for managing and searching scanned documents should be robust enough to provide a mechanism for defining categories and sub-categories of document formats as new documents are added to the corpus.

Another method for locating documents in a large corpus of documents is by searching and reviewing human-supplied summaries. In the absence of human-supplied summaries, systems can be used that automatically generate documents summaries. One advantage for using summaries in document search and retrieval systems is that they reduce the amount of visual information that a user must examine in the course of searching for a particular document. By being presented on a display or the like with summaries of documents instead of the entire document, a user is better able to evaluate a larger number of documents in a given amount of time.

Most systems that automatically summarize the contents of documents create summaries by analyzing the ASCII text that makes up the documents. One approach locates a subset of sentences that are indicative of document content. For example, U.S. Pat. No. 5,778,397, assigned to the same assignee as the present invention, discloses a method for generating feature probabilities that allow later generation of document extracts. Alternatively, U.S. Pat. No. 5,491,760 discloses a method for summarizing a document without decoding the textual contents of a bitmap image. The summarization technique disclosed in the '760 Patent uses automatic or interactive morphological image recognition techniques to produce documents summaries.

Accordingly, it would be desirable to provide a system for managing and searching a large corpus of scanned documents in which not only are text identified using an OCR program and inputted document metadata searchable but also the visual representations of scanned documents can be identified. Such a system would advantageously search, summarize, sort, and transmit documents using information that defines the structure and format of a document. It would also be desirable in such a system to provide an interface for a user to flexibly specify the genre of document by the particular layout format of documents. One reason this is desirable is that genre of documents tend to change and emerge over the course of using and adding document to a corpus. Consequently, an ideal system would give users the flexibility to specify either a new genre or a specific class of genre that is of interest to a single user or group of users.

SUMMARY OF THE INVENTION

In accordance with the invention there is provided a system, and method and article of manufacture therefor, for identifying a portion of a document stored in a memory of a document management system. In accordance with one aspect of the invention, the document image is segmented into a first set of layout objects. Each layout object in the first set of layout objects is one of a plurality of layout object types, and each of the plurality of layout object types identifies a structural element of a document. Attributes for each layout object in the first set of layout objects are computed. The computed attributes of each layout object are assigned values that quantify properties of a structural element and identify spatial relationships with other segmented layout objects in the document image. A routine is executed by the system for identifying a feature of the document image. The routine is defined by a sequence of selection operations that consumes the first set of layout objects and uses the computed attributes to produce a second set of layout objects. Once the routine is executed, the feature of the document image is identified by the second set of layout objects.

BRIEF DESCRIPTION OF THE DRAWINGS

These and other aspects of the invention will become apparent from the following description read in conjunction with the accompanying drawings wherein the same reference numerals have been applied to like parts and in which:

FIG. 1 is a block diagram of the general components used to practice the present invention;

FIG. 2 illustrates a detailed block diagram of the document corpus management and search system shown in FIG. 1;

FIG. 3 illustrates the manner in which document image data is arranged in the file system;

FIG. 4 is a flow diagram of an interaction cycle for defining a feature using sequences of primitive operations;

FIG. 5 is a flow diagram which sets forth the steps for specifying one or more selection operations or accumulation operations for the set of layout objects defined at step 408 in FIG. 4;

FIG. 6 illustrates an example of a feature programmed using the interaction cycle set forth in FIGS. 4-5; and

FIG. 7 illustrates in greater detail the genre model program interface 219 shown in FIG. 2;

FIG. 8 illustrates examples of three different high level configurations of documents which can be defined by specifying either the absence or presence of attributes and features using the genre model program interface shown in FIG. 7;

FIG. 9 illustrates an example of a search engine interface for searching the corpus of documents stored in file system;

FIG. 10 illustrates a summarization display profile, which can be used to define the output format of a composite summary image of user-crafted summaries;

FIG. 11 is a flow diagram which sets forth the steps in for generating user-crafted summaries of searches;

FIGS. 12, 13, and 14 illustrate three different examples of summary images created using the steps outlined in FIG. 10;

FIG. 15 is a flow diagram which sets forth the steps for sorting document images according to the similarities between layout objects segmented from document images;

FIG. 16 is a flow diagram which sets forth one embodiment for sorting the set of image segments at step 1508 shown in FIG. 15;

FIG. 17 illustrates a grouping of image segments that is formed using the method set forth in FIGS. 15 and 16;

FIG. 18 is a flow diagram which sets forth an embodiment for sorting layout objects segmented from document images by their similarity to a specified layout object;

FIG. 19 illustrates an example in which features of document images are sorted according to the similarity of a feature in a specified document image;

FIG. 20 is a flow diagram setting forth the steps for performing progressive transmission of document images from the perspective of a server workstation running the document search and retrieval system;

FIG. 21 illustrates a progressive display profile for defining the order in which features and attributes of a document image are to be transmitted and/or displayed;

FIG. 22 illustrates an example page i