|
Claims  |
|
|
We claim:
1. A method for searching a corpus of document images stored in a memory,
comprising the steps of:
segmenting each document image in the corpus of document images into a
first set of layout objects; each layout object in the first set of layout
objects being one of a plurality of layout object types; each of the
plurality of layout object types identifying a structural element of a
document;
for each segmented document image, computing attributes for each layout
object in the first set of layout objects; the computed attributes of each
layout object having values that quantify properties of a structural
element and identify spatial relationships with other segmented layout
objects;
providing a program interface for composing a routine that includes a
sequence of selection operations for identifying certain of the layout
objects in the first set of layout objects of an example document image
selected from the corpus of document images; the certain layout objects
defining a feature of the example document image;
executing the sequence of selection operations of the routine for
identifying the feature of the example document image in ones of the
document images in the corpus of document images; for each segmented
document image, the sequence of selection operations receiving as input
the first set of layout objects and the computed attributes to produce as
output a second set of layout objects; said executing step identifying the
ones of the document images in the corpus of document images that include
at least one layout object in the second set of layout objects as having
the feature of the example document image; and
displaying at the program interface the ones of the document images in the
corpus of document images identified by said executing step.
2. The method according to claim 1, further comprising the step of
highlighting in the example document image the feature identified by the
routine composed at a program interface.
3. The method according to claim 1, wherein one of the selection operations
executed by said executing step is a filtering operation that produces a
second set of layout objects having attribute values between a minimum
threshold value and a maximum threshold value.
4. The method according to claim 1, wherein one of the selection operations
executed by said executing step is a gating operation that produces a
second set of layout objects equal to the first set of layout objects when
each of the layout objects in a document image have attribute values
between a minimum threshold value and a maximum threshold value.
5. The method according to claim 1, wherein one of the selection operations
executed by said executing step is an accumulation operation that
identifies an attribute value for the layout objects in the first set of
layout objects.
6. The method according to claim 1, further comprising the step of forming
the corpus document images in the memory by scanning hardcopy documents.
7. The method according to claim 1, wherein said segmenting step defines
one of the plurality of layout objects to be a text block of.
8. The method according to claim 1, further comprising the step of defining
a structural model for identifying a genre of document; the structural
model defining a set of features absent and a set of features present in
document images to specify a class of document images which express a
common communicative purpose that is independent of document content.
9. A program storage device readable by a machine, embodying a program of
instructions executable by the machine to perform method steps for
searching a corpus of document images stored in a memory of a document
management system, said method steps comprising:
segmenting each document image in the corpus of document images into a
first set of layout objects; each layout object in the first set of layout
objects being one of a plurality of layout object types; each of the
plurality of layout object types identifying a structural element of a
document;
for each segmented document image, computing attributes for each layout
object in the first set of layout objects; the computed attributes of each
layout object having values that quantify properties of a structural
element and identify spatial relationships with other segmented layout
objects;
providing a program interface for composing a routine that includes a
sequence of selection operations for identifying certain of the layout
objects in the first set of layout objects of an example document image
selected from the corpus of document images; the certain layout objects
defining a feature of the example document image;
executing the sequence of selection operations of the routine for
identifying the feature of the example document image in ones of the
document images in the corpus of document images; for each segmented
document image, the sequence of selection operations receiving as input
the first set of layout objects and the computed attributes to produce as
output a second set of layout objects; said executing step identifying the
ones of the document images in the corpus of document images that include
at least one layout object in the second set of layout objects as having
the feature of the example document image; and
displaying at the program interface the ones of the document images in the
corpus of document images identified by said executing step.
10. The program storage device as recited in claim 9, wherein said method
steps further comprise the step of defining a structural model for
identifying a genre of document; the structural model defining a set of
features absent and a set of features present in document images to
specify a class of document images which express a common communicative
purpose that is independent of document content.
11. A document management system for searching a corpus of document images,
comprising:
a memory for storing the corpus of document images and image processing
instructions of the document management system;
a display; and
a processor coupled to the memory and the display for executing the
document image processing instructions of the document management system;
the processor in executing the document image processing instructions:
segmenting each document image in the corpus of document images into a
first set of layout objects; each layout object in the first set of layout
objects being one of a plurality of layout object types; each of the
plurality of layout object types identifying a structural element of a
document;
for each segmented document image, computing attributes for each layout
object in the first set of layout objects; the computed attributes of each
layout object having values that quantify properties of a structural
element and identify spatial relationships with other segmented layout
objects;
providing a program interface on the display for composing a routine that
includes a sequence of selection operations for identifying certain of the
layout objects in the first set of layout objects of an example document
image selected from the corpus of document images; the certain layout
objects defining a feature of the example document image;
executing the sequence of selection operations of the routine for
identifying the feature of the example document image in ones of the
document images in the corpus of document images; for each segmented
document image, the sequence of selection operations receiving as input
the first set of layout objects and the computed attributes to produce as
output a second set of layout objects;
identifying the ones of the document images in the corpus of document
images that include at least one layout object in the second set of layout
objects to have the feature of the example document image; and
displaying at the program interface on the display the ones of the document
images in the corpus of document images identified as having the feature
of the example document image.
12. The document management system according to claim 11, further
comprising means for displaying at the program interface on the display an
example document with the feature highlighted.
13. The document management system according to claim 11, wherein the
program interface on the display comprises:
a first display area for specifying a name for the feature;
a second display area for specifying a set of images which exemplify the
feature;
a third display area for specifying a set of input layout objects; the set
of input layout objects limiting the first set of layout objects of the
document image consumed by the sequence of selection operations of the
routine; and
a fourth display area for defining the sequence of selections operations of
the feature.
14. The document management system according to claim 11, further
comprising a program interface on the display for defining a model of a
genre of document.
15. The document management system according to claim 14, wherein the
program interface on the display comprises:
means for specifying that a feature is present in the genre of document;
and
means for specifying that a feature is absent from the genre of document.
16. The document management system according to claim 14, wherein the genre
of document specified is a letter.
17. The document management system according to claim 14, wherein the genre
of document specified is a memo.
18. The document management system according to claim 11, wherein the
document image is a bitmap image.
19. The document management system according to claim 11, wherein one of
the sequence of selection operations is a filtering operation given by:
F(L,A,u,v,N)={l.epsilon.L:uN.ltoreq.A(l)<vN}, where:
L is a set of layout objects to which the filtering operation is applied;
A is an attribute;
u and v are threshold arguments; and
N is a normalization argument.
20. The document management system according to claim 11, wherein one of
the sequence of selection operations is a gate operation given by:
##EQU3##
L is a set of layout objects to which the gate operation is applied; A is
an attribute;
u and v are threshold arguments; and
N is a normalization argument. |
|
|
|
|
Claims  |
|
|
Description  |
|
|
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates generally to a system for managing and
searching a large corpus of documents, and more particularly, to a system
for a user to dynamically specify layout components of documents recorded
in the large corpus of documents.
2. Description of Related Art
Searching for a document in a large heterogeneous corpus of documents
stored in an electronic database is often difficult because of the sheer
size of the corpus (e.g., 750,000 documents). Many of the documents that
make up the corpus are documents that cannot be identified by simply
performing text based searches. In some instances, some documents in the
corpus may, for example, be scanned images of hardcopy documents, or
images derived using PDF (Portable Documents Formats), or PostScript.RTM..
In other instances, simply searching the text of documents may not narrow
a search sufficiently to locate a particular document in the corpus.
Techniques for searching the text of a document in a large corpus of
documents exist. U.S. Pat. No. 5,442,778 discloses a scatter-gather
browsing method which is a cluster-based method for browsing a large
corpus of documents. This system addresses the extreme case in which there
is no specific query, but rather a need to get an idea of what exists in a
large corpus of documents. Scatter-gather relies on document clustering to
present to a user descriptions of large document groups. Document
clustering is based on the general assumption that mutually similar
documents tend to be relevant to the same queries. Based on the
descriptions of the documents groups, the user selects one or more of the
document groups for further study. These selected groups are gathered
together to form a sub-collection. This process repeats and bottoms out
when individual documents are viewed.
Also, techniques exist that analyze the machine readable text of a document
for identifying the genre of documents. The genre of text relates to a
type of text or type of document. An example of a method for identifying
the genre of machine readable text is disclosed in U.S. Provisional
Application Ser. No. 60/051,558, entitled "Article And Method Of
Automatically Determining Text Genre Using Surface Features Of Untagged
Texts," (Attorney Docket No. D/95465P). Initially, machine readable text
is analyzed to formulate a cue vector. The cue vector represents
occurrences in the text of a set of non-structural, surface cues, which
are easily computable. A genre of the text is then determined by weighing
the elements making up the cue vector.
Besides text found in a document, often the layout of a particular document
contains a significant amount of information that can be used to identify
a document stored in a large corpus of documents. Using the layout
structure of documents to search a large corpus of documents is
particularly advantageous when documents in the corpus have not been
tagged with a high level definition. Hardcopy documents which are scanned
are recorded as bitmap images that have no structural definition that is
immediately perceivable by a computer. A bitmap image generally consists
of a sequence of image data or pixels. To become searchable, the structure
of a bitmap image is analyzed to identify its layout structure.
By examining different work practices, it has been found that a work
process (i.e., manner of working) can be supported with a system that is
capable of searching and retrieving documents in a corpus by their type or
genre (i.e., functional category). Where some genres of documents are
general in the sense that they recur across different organizations and
work processes, other genre of documents are idiosyncratic to a particular
organization, task, or even user. For example, a business letter and a
memo are examples of a general genre. A set of documents with an
individual's private stamp in the upper right corner of each document is
an example of a genre that is idiosyncratic to a particular user. It has
also been found that many different genre of documents have a predefined
form or a standard set of components that depict a unique spatial
arrangement. For example, business letters are divided into a main body,
author and recipient addresses, and signature. Unlike specific text based
identifiers, which are used to identify the genre of a document, the
layout structure of documents can apply across different classes of
documents.
A number of different techniques have been developed for analyzing the
layout structure of a bitmap image. Generally, page layout analysis has
been divided into two broad categories: geometric layout analysis and
logical structure analysis. Geometric layout analysis extracts whatever
structure can be inferred without reference to models of particular kinds
of pages--e.g., letter, memo, title page, table, etc. Logical structure
analysis classifies a given page within a repertoire of known layouts, and
assigns functional interpretations to components of the page based on this
classification. Geometric analysis is generally preliminary to logical
structure analysis. (For further background on image layout analysis see
U.S. patent application Ser. No. 08/565,181, entitled "Method For
Classifying Non-Running Text In An Image" and its references).
The present invention concerns a method and apparatus for defining
user-specified layout structures of documents (i.e., the visual
appearance) to facilitate the search and retrieval of a document stored in
a multi-genre database of documents. This method of searching documents
focuses a search according to the manner in which the layout structure of
a document is defined. Unlike many techniques for searching the text
within a document, searching documents according to their layout structure
is based on the appearance and not the textual content found in a
document. The general premise for searching documents based on their
layout structure is that the layout structure of text documents often
reflect its genre. For example, business letters are in many ways more
visually similar to one another than they are to magazine articles. Thus,
a user searching for a particular document while knowing the class of
documents is able to more effectively narrow the group of documents being
searched.
One problem addressed by this invention is how to best manage a large
corpus of scanned documents. Many document search and retrieval systems
rely entirely on the results of applying OCR (Optical Character
Recognition) to every scanned document image. Generally, OCR techniques
involve segmenting an image into individual characters which are then
decoded and matched to characters in a library. Typically, such OCR
techniques require extensive computational effort, generally have a
non-trivial degree of recognition error, and often require significant
amounts of time for image processing. In operation, OCR techniques
distinguish each bitmap of a character from its neighbor, analyze its
appearance, and distinguish it from other characters in a predetermined
set of characters.
A disadvantage of OCR techniques is that they are often an insufficient
means for capturing information in scanned documents because the quality
of OCR results may be unacceptably poor. For example, the OCR results for
a scanned document may be poor in quality because the original document
was a heavily used original, a facsimile of an original, or a copy of an
original. In each of these examples, the scanned results of an original
document may provide insufficient information for an OCR program to
accurately identify the text within the scanned image. In some instances,
some scanned documents may be handwritten in whole or in part, thereby
making those portions of the original document unintelligible to an OCR
program.
Another disadvantage of OCR techniques is that the layout or formatting of
the document is typically not preserved by an OCR program. As recognized
by Blomberg et al. in "Reflections on a Work-Oriented Design Project"
(published in PDC'94: Proceedings of the Participatory Design Conference,
p. 99-109, on Oct. 27-28, 1994), users searching for a particular document
in a large corpus of documents tend to rely on clues about the form and
structure of the documents. Such clues, which could be gained from either
the original bitmap image or reduced scale images (i.e., thumbnails), tend
to be lost in ASCII text renderings of images. Thus, the layout or
formatting of a document, which is usually not captured or preserved when
a scanned image is reduced to text using an OCR program, is crucial
information that can be used for identifying that document in a large
corpus of documents. Improved OCR programs such as TextBridge.RTM., which
is produced by Xerox ScanSoft, Inc., are capable of converting scanned
images into formatted documents (e.g. HTML (hypertext markup language))
with tables and pictures as opposed to a simple ASCII text document (more
information can be found on the Internet at
http://www.xerox.com/xis/textbridge/).
An alternative technique for identifying information contained in
electronic documents without having to decode a document using OCR
techniques is disclosed in U.S. Pat. No. 5,491,760 and its references.
This alternative technique segments an undecoded document image into word
image units without decoding the document image or referencing decoded
image data. Once segmented, word image units are evaluated in accordance
with morphological image properties of the word image units, such as word
shape. (These morphological image properties do not take into account the
structure of a document. That is, the word image units do not take into
account where the shape appeared in a document.) Those word image units
which are identified as semantically significant are used to create an
ancillary document image of content which is reflective of the subject
matter in the original document. Besides image summarization, segmenting a
document into word image units has many other applications which are
disclosed in related U.S. Pat. Nos. 5,539,841; 5,321,770; 5,325,444;
5,390,259; 5,384,863; and 5,369,714. For instance, U.S. Patent No.
discloses a method for identifying when similar tokens (e.g., character,
symbol, glyph, string of components) are present in an image section; U.S.
Pat. No. 5,324,444 discloses a method for determining the frequency of
words in a document, and U.S. Pat. No. 5,369,714 discloses a method for
determining the frequency of phrases found in a document.
Another alternative to performing OCR analysis on bitmap images are systems
that perform content-based searches on bitmap images. An example of such a
system is IBM's Query by Image Content (QBIC) system. The QBIC system is
disclosed in articles by Niblack et al., entitled "The QBIC project:
querying images by content using color, texture and shape," in SPIE Proc.
Storage and Retrieval for Image and Video Databases, 1993, and by Ashley
et al., entitled "Automatic and semiautomatic methods for image annotation
and retrieval in QBIC," in SPIE Proc. Storage and Retrieval for Image and
Video Databases, pages 24-35, 1995. A demo of a QBIC search engine is
available on the internet at
"http://wwwqbic.almaden.ibm.com/.about.qbic/qbic.html". Using the QBIC.TM.
system, bitmap images in a large database of images can be queried by
image properties such as color percentages, color layouts, and textures.
The image-based queries offered by the QBIC system are combined with text
or keyword for more focused searching.
Another system for performing content-based queries is being developed as
part of the UC Berkeley Digital Library Project. Unlike the QBIC system
which relies on low-level image properties to perform searches, the
Berkeley system groups properties and relationships of low level regions
to define high-level objects. The premise of the Berkeley system is that
high-level objects can be defined by meaningful arrangements of color and
texture. Aspects of the Berkeley system are disclosed in the following
articles and their references: Chad Carson et al., "Region-Based Image
Querying," CVPR '97 Workshop on Content-Based Access of Image and Video
Libraries; Serge Belongie et al., "Recognition of Images in Large
Databases Using a Learning Framework," UC Berkeley CS Tech Report 97-939;
and Chad Carson et al., "Storage and Retrieval of Feature Data for a Very
Large Online Image Collection," IEEE Computer Society Bulletin of the
Technical Committee on Data Engineering, December 1996, Vol. 19 No. 4.
In addition to using OCR programs or the like to decipher the content of
scanned documents, it is also common to record document metadata (i.e.,
document information) at the time a hardcopy document is scanned. This
document metadata, which is searchable as text, may include the subject of
the document, the author of the document, keywords found in the document,
the title of the document, and the genre or type of document. A
disadvantage of using document metadata to identify documents is that the
genre specified for a particular corpus of documents is not static.
Instead, the number of different genre of documents in a corpus can vary
as the corpus grows. A further disadvantage of document metadata is that
it is time consuming for a user to input into a system. As a result, a
system for managing and searching scanned documents should be robust
enough to provide a mechanism for defining categories and sub-categories
of document formats as new documents are added to the corpus.
Another method for locating documents in a large corpus of documents is by
searching and reviewing human-supplied summaries. In the absence of
human-supplied summaries, systems can be used that automatically generate
documents summaries. One advantage for using summaries in document search
and retrieval systems is that they reduce the amount of visual information
that a user must examine in the course of searching for a particular
document. By being presented on a display or the like with summaries of
documents instead of the entire document, a user is better able to
evaluate a larger number of documents in a given amount of time.
Most systems that automatically summarize the contents of documents create
summaries by analyzing the ASCII text that makes up the documents. One
approach locates a subset of sentences that are indicative of document
content. For example, U.S. Pat. No. 5,778,397, assigned to the same
assignee as the present invention, discloses a method for generating
feature probabilities that allow later generation of document extracts.
Alternatively, U.S. Pat. No. 5,491,760 discloses a method for summarizing
a document without decoding the textual contents of a bitmap image. The
summarization technique disclosed in the '760 Patent uses automatic or
interactive morphological image recognition techniques to produce
documents summaries.
Accordingly, it would be desirable to provide a system for managing and
searching a large corpus of scanned documents in which not only are text
identified using an OCR program and inputted document metadata searchable
but also the visual representations of scanned documents can be
identified. Such a system would advantageously search, summarize, sort,
and transmit documents using information that defines the structure and
format of a document. It would also be desirable in such a system to
provide an interface for a user to flexibly specify the genre of document
by the particular layout format of documents. One reason this is desirable
is that genre of documents tend to change and emerge over the course of
using and adding document to a corpus. Consequently, an ideal system would
give users the flexibility to specify either a new genre or a specific
class of genre that is of interest to a single user or group of users.
SUMMARY OF THE INVENTION
In accordance with the invention there is provided a system, and method and
article of manufacture therefor, for identifying a portion of a document
stored in a memory of a document management system. In accordance with one
aspect of the invention, the document image is segmented into a first set
of layout objects. Each layout object in the first set of layout objects
is one of a plurality of layout object types, and each of the plurality of
layout object types identifies a structural element of a document.
Attributes for each layout object in the first set of layout objects are
computed. The computed attributes of each layout object are assigned
values that quantify properties of a structural element and identify
spatial relationships with other segmented layout objects in the document
image. A routine is executed by the system for identifying a feature of
the document image. The routine is defined by a sequence of selection
operations that consumes the first set of layout objects and uses the
computed attributes to produce a second set of layout objects. Once the
routine is executed, the feature of the document image is identified by
the second set of layout objects.
BRIEF DESCRIPTION OF THE DRAWINGS
These and other aspects of the invention will become apparent from the
following description read in conjunction with the accompanying drawings
wherein the same reference numerals have been applied to like parts and in
which:
FIG. 1 is a block diagram of the general components used to practice the
present invention;
FIG. 2 illustrates a detailed block diagram of the document corpus
management and search system shown in FIG. 1;
FIG. 3 illustrates the manner in which document image data is arranged in
the file system;
FIG. 4 is a flow diagram of an interaction cycle for defining a feature
using sequences of primitive operations;
FIG. 5 is a flow diagram which sets forth the steps for specifying one or
more selection operations or accumulation operations for the set of layout
objects defined at step 408 in FIG. 4;
FIG. 6 illustrates an example of a feature programmed using the interaction
cycle set forth in FIGS. 4-5; and
FIG. 7 illustrates in greater detail the genre model program interface 219
shown in FIG. 2;
FIG. 8 illustrates examples of three different high level configurations of
documents which can be defined by specifying either the absence or
presence of attributes and features using the genre model program
interface shown in FIG. 7;
FIG. 9 illustrates an example of a search engine interface for searching
the corpus of documents stored in file system;
FIG. 10 illustrates a summarization display profile, which can be used to
define the output format of a composite summary image of user-crafted
summaries;
FIG. 11 is a flow diagram which sets forth the steps in for generating
user-crafted summaries of searches;
FIGS. 12, 13, and 14 illustrate three different examples of summary images
created using the steps outlined in FIG. 10;
FIG. 15 is a flow diagram which sets forth the steps for sorting document
images according to the similarities between layout objects segmented from
document images;
FIG. 16 is a flow diagram which sets forth one embodiment for sorting the
set of image segments at step 1508 shown in FIG. 15;
FIG. 17 illustrates a grouping of image segments that is formed using the
method set forth in FIGS. 15 and 16;
FIG. 18 is a flow diagram which sets forth an embodiment for sorting layout
objects segmented from document images by their similarity to a specified
layout object;
FIG. 19 illustrates an example in which features of document images are
sorted according to the similarity of a feature in a specified document
image;
FIG. 20 is a flow diagram setting forth the steps for performing
progressive transmission of document images from the perspective of a
server workstation running the document search and retrieval system;
FIG. 21 illustrates a progressive display profile for defining the order in
which features and attributes of a document image are to be transmitted
and/or displayed;
FIG. 22 illustrates an example page i | | |