WikiPatents - Community Patent Review
Create Free Account  |  License or Sell Your Patent  |  WikiPatents Marketplace  |  WikiPatents Blog
Username:  Password:  
    
Advanced Search
Document identification by characteristics matching    
United States Patent5159667   
Link to this pagehttp://www.wikipatents.com/5159667.html
Inventor(s)Borrey; Roland G. (19251 Canyon Dr., Villa Park, CA 92667); Borrey; Daniel G. (19251 Canyon Dr., Villa Park, CA 92667)
AbstractThis invention relates to an automatic identification method for scanned documents in an electronic document capture and storage system. The invention uses the technique of recognition of global document features compared to a knowledge base of known document types. The system first segments the digitized image of a document into physical and logical areas of significance and attempts to label these areas by determining the type of information they contain, without using OCR techniques. The system then attempts to match the areas segmented to objects described in the knowledge base. The system labels the areas successfully matched then selects the most probable document type based on the areas found within the document. Using computer learning methods, the system is capable of improving its knowledge of the documents it is supposed to recognize, by dynamically modifying the characteristics of its knowledge base thus sharpening its decision making capability.



 Title Information Submit all comments and votes
 
Patent Text Patent PDF Print Page Summary File History
Plain text PDF images Print Summary File History
Drawing from US Patent 5159667
Document identification by characteristics matching - US Patent 5159667 Drawing
Document identification by characteristics matching
Inventor     Borrey; Roland G. (19251 Canyon Dr., Villa Park, CA 92667); Borrey; Daniel G. (19251 Canyon Dr., Villa Park, CA 92667)
Owner/Assignee    
Patent assignment
All assignments
Publication Date     October 27, 1992
Application Number     07/359,839
PAIR File History     Application Data   Transaction History
Image File Wrapper   Patent Term   Fees
Litigation
Filing Date     May 31, 1989
US Classification     715/500 382/171 382/180 382/305 706/900
Int'l Classification     G09C 001/00 G01L 001/06
Examiner     Hecker; Stuart N.
Assistant Examiner     Rudolph; Rebecca L.
Attorney/Law Firm     Christie, Parker & Hale
Address
Parent Case    
Priority Data    
USPTO Field of Search     364/200 MS File 364/900 MS File 364/513 364/518 382/46
Patent Tags     document identification characteristics matching
   
Enter a comma (,) or semicolon (;) between multiple tag words/phrases.
Describe this patent:
 Amusing   
 Clever   
 Complex   
 Efficient   
 Historic   
 Important   
 Innovative   
 Interesting   
 Practical   
 Simple   
[no votes]
Patent WIKI

Share information and news about this patent, including information and news about the technology, inventors, company, ligation and licensing.

 References Submit all comments and votes
 
*references marked with an asterisk below are user-added references
 U.S. References
 
Add a new US reference:  
ReferenceRelevancyCommentsReferenceRelevancyComments
4965763
Zamora
704/1
Oct,1990

[0 after 0 votes]
4881178
Holland
706/12
Nov,1989

[0 after 0 votes]
4866784
Barski
382/289
Sep,1989

[0 after 0 votes]
4835690
Gangarosa
600/410
May,1989

[0 after 0 votes]
4831447
Lake, Jr.
348/597
May,1989

[0 after 0 votes]
4633507
Cannistra
382/175
Dec,1986

[0 after 0 votes]
 Foreign References
 Other References
 Market Review Submit all comments and votes
   
Market Size
Estimate the gross annual revenues of the relevant market sector:
> $10B
$5B - $10B
$2B - $5B
$500M - $2B
$100M - $500M
$10M - $100M
$1M - $10M
$500K - $1M
$100K - $500K
< $100K
[No votes]
$0
 
$0   $2.5B   $5B   $7.5B   $10B
Market Share
Estimate the percentage of the relevant market sector this invention will capture:
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Reasonable Royalty
What percentage of gross sales should the inventor or assignee be paid?
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Public's "Guesstimation" of Royalty Value
Market SizeN/A[No votes]
xMarket ShareN/A[No votes]
xReasonable RoyaltyN/A[No votes]

N/A

License Availablity
If you are NOT the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
License Availablity
If you ARE the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
Competitive Advantage
Does this invention have a significant competitive advantage over similar technologies?
Yes

No



[No votes]
Most helpful competitive advantage comment
[No comments]

Commercial Alternatives
Are there viable commercial alternatives for this invention?
Yes

No



[No votes]
Most helpful commercial alternative comment
[No comments]

 Technical Review Submit all comments and votes
 Claims Submit all comments and votes
 


What is claimed is:

1. A computer-implemented process for classifying documents comprising the steps of:

preliminarily creating a knowledge base of documents each characterized by a hierarchy of objects that are defined by parameters indicating physical and relational characteristics, the hierarchy being organized from a lowest object level to one or more successively higher object levels and storing said knowledge base in a computer;

scanning a document to form binary light and dark pixels and inputting into said computer data representing the pixels;

performing, in said computer, the following steps;

segmenting the document into primary areas of significance based on the pixels;

calculating parameters that define the segmented primary areas;

comparing the parameters of each segmented primary area with the parameters of the lowest level objects in the hierarchy of objects that characterize each document in the knowledge base;

assigning to each segmented primary area weights of evidence relative to the lowest level objects based on the comparison;

generating a weighted hypothesis of a label for each of the segmented areas based on the weights of evidence relative to the lowest level objects;

grouping the segmented primary areas into areas of significance more relevant than the primary areas;

calculating parameters that define the more relevant areas;

comparing the parameters of each more relevant area with the parameters of the second lowest level objects in the hierarchy;

assigning to each more relevant area weights of evidence relative to the second lowest level objects based on the comparison and reevaluating the weights of evidence assigned to the segmented primary areas;

generating a weighted hypothesis of a label for each of the more relevant areas and revising the weighted hypothesis of the label for each of the segmented primary areas based on the weights of evidence of the second lowest level objects and the lowest level objects; and

classifying the document based on the labels and the weights of evidence developed by the preceding step.

2. The process of claim 1, additionally comprising the steps of:

grouping the more relevant areas into still more relevant areas of significance;

calculating parameters that define the still more relevant areas;

comparing the parameters of each still more relevant area with the parameters of the third lowest level objects in the hierarchy;

assigning to each still more relevant area weights of evidence relative to the third lowest level objects based on the comparison and reevaluating the weights of evidence assigned to the more relevant areas and the segmented primary areas;

generating a weighted hypothesis of a label for each of the still more relevant areas and revising the weighted hypothesis of the label for each of the more relevant areas and the segmented primary areas based on the weights of evidence of the third lowest level objects, the second lowest level objects and the lowest level objects; and

classifying the document based on the labels and the weights of evidence developed by the preceding step.

3. The process of claim 2, in which the recited steps are performed one or more additional times with respect to successively more relevant areas of significance and higher level objects, thereby increasing the evidence that supports the document classification.

4. The process of claim 1, additionally comprising the step of forming from the pixels darkness density histograms, the segmenting step segmenting the document based on the histograms.

5. The process of claim 4, additionally comprising the steps of:

comparing the density histograms of one side of the individual objects with the density histograms of the other side of the individual objects;

determining the vertical shift between the histograms of the two sides of the individual objects;

averaging the determined vertical shift of the individual objects; and

correcting the skew of an entire edge of the document based on the averaged vertical shift.
 Description Submit all comments and votes
 


BACKGROUND OF THE INVENTION

The present invention relates to a method for the automatic identification of scanned documents in an electronic document capture and storage system.

In electronic document scanning systems, a page of a document is illuminated then scanned using one of several existing techniques such as a charged coupled device (CCD) to create a digital image representing a matrix of the black and white points on the page by a matrix of 0's and 1's. This matrix is then transmitted to a digital computer where it can be processed, displayed, identified and stored. In some applications the complete digital image of the document is stored. In other applications the results of an additional process such as optical character recognition (OCR) permits the text contained on the document to be stored instead of the full image of the document.

In all cases, however, the document must be identified for later retrieval and this step currently requires manual intervention. A data input operator normally identifies and provides indexing information for each page of the document as it is entered into the system. Because it is a manual step, identifying and indexing documents comprise the largest operating expense of most electronic document capture and storage systems. This expense limits the benefits of such systems in many applications because of its high cost.

In a typical application, a data entry operator will enter the identification of each page of a document into the data base of a system after the page is scanned. This identification may be composed of as few as five or six numbers or characters, but often is of significantly greater length to properly identify the page and allow an efficient retrieval search of the document at a later time.

Document identification and indexing information usually consists of the document class (e.g., letter, invoice, credit application, etc.) and a combination of numeric and alphanumeric characters which uniquely identify this document within its class (e.g., name of the party sending the letter, vendor invoice number, etc.). Additionally, certain "key words" may be added to allow a group of documents within the class to be simultaneously retrieved (e.g., subject of the letter, type of merchandise ordered, etc).

To achieve such data entry in a reasonably efficient manner, documents are often prepared in advance for data entry operators by manual sorting, pre-encoding, highlighting key fields or other techniques to guide the flow of the data entry operation.

In applications with standardized documents such as processing Internal Revenue Service forms, automated processing for sorting and data entry is possible. In such cases OCR processing has been successfully used to identify documents created with well known fonts. However, the vast majority of documents processed by document capture and storage systems do not have such standardized formats or well known OCR readable fonts.

There is a need, therefore, for a method of automating the document identification and indexing steps of document capture and storage systems. The method must have sufficient flexibility to handle non-standardized document formats as well as varying types of printed fonts, images or handwritten characters.

SUMMARY OF THE INVENTION

The present invention is a method for automatic identification of scanned documents in an electronic document capture and storage system. Document identification is accomplished by a series of programming techniques which recognize as similar, patterns found in a given document. These patterns include overall document structure as well as probabilities of the presence or absence of certain data.

Most documents used in government, the professions or in commerce can be divided into a limited number of classifications. Within a class, such as business letters, the documents will contain some similar types of information (referred to herein as objects). However, this information is not presented in a standard format. For example, within the class of business letters most will have objects which can be labeled as a letterhead, a date, a destination, text, a signature or the originator's name. Likewise, within the class of invoices most will have a buyer, a seller, an invoice number, a date and an amount to be paid. But each document within a class will normally have its objects placed in differing locations.

In order to identify a document, the system first segments the digitized image of the document into physical and logical areas of significance and attempts to label these areas by determining the type of information they contain. The present invention accomplishes this without using the OCR techniques of the prior art.

Using a multiple-pass approach to the characterization of the document, the system attempts to match the segmented areas of the document to predetermined objects described in its document knowledge base.

The document knowledge base contains structured information relating to various classes of documents and the objects they each contain. This knowledge base provides the system with the expected structure and content of the documents which may be identified by the system. For each segmented area on the document an object label is assigned to the area by matching its structural characteristic, i.e., the physical location of the area on the document, and its relational characteristic, i.e., its location with respect all the other areas on the document, against the expected physical and relational characteristics provided by the knowledge base. As a result of this matching process, a table of evidence and disevidence is built for each area as containing a particular object. Based on the evidence and disevidence table, proposed object labels are assigned to the segment areas and a proposed classification designation is assigned to the document. The program then begins a verification pass through the document. Using the proposed object labels and document classification, the structural and relational characteristics of the areas are again compared. If the results of the second-pass is the same as the first-pass, the object labels and document classification are considered verified. If the results are different, new labels are proposed and the verification step is repeated. After successful verification, the system then begins another pass to attempt to match further document characteristics in order to identify a possible sub-class of the document within the selected document class. Once the document class and possibly its sub-class as well as the objects it contains are adequately classified, this information is stored for later retrieval.

The system has the capability to retrieve, display and print the contents of entire documents or just identified objects within a document. The system's capability to retrieve and extract specific objects from a document rather than displaying only the entire document makes it practical to use significantly lower resolution CRT screens than required with prior art document retrieval systems.

An important feature of the system is the ability to associate a labelled object with its exact position in an image when in either uncompressed or compressed format. This feature significantly reduces the amount of data that must be transferred from the storage device to the display device, enhancing the system performance and making it practical to use software decompression algorithms at the display station.

Another significant feature of the system is its use of a skew correction technique on scanned document images or specific objects to increase its accuracy in properly classifying documents and identifying their objects.

Another feature of the system is its capability to improve its document knowledge base based on the results of the operation of the system. New rules are automatically added to this knowledge base regarding novel classes and subclasses of documents. Decision making is improved by making use of fuzzy knowledge techniques. This feature makes the system easily adaptable to new documents as well as new imaging technologies.

Another important feature of the system is the ability to process document based on known information about the environment or application to improve or accelerate the recognition process using an execution knowledge base containing strategy, policy and execution files.

In another embodiment, the system may be used as a pre-processing step to permit Mark-reading or OCR operations on selected areas of non-formatted documents containing unstructured information.

The foregoing and other advantageous and distinguishing features of the invention are described in detail below and are recited in the appended claims.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a computer system operational flow chart depicting the presently preferred embodiment of the invention;

FIG. 2 is a detailed flow chart representing a segmentation program, functional block 6 of the flow chart shown in FIG. 1.

FIG. 2A is a depiction of an segmented area within a document as defined by the segmentation program represented in FIG. 2.

FIG. 2B is a depiction of the segmentation of a page created by the operation of the segmentation program represented in FIG. 2.

FIG. 2C is a depiction of an exemplary horizontal and histogram created by the operation of the segmentation program represented in FIG. 2.

FIG. 2D is a depiction of an exemplary vertical and histogram created by the operation of the segmentation program represented in FIG. 2.

FIG. 2E are two tables of exemplary data, of the type created by the operation of the segmentation program represented in FIG. 2, representing compressed and decompressed file addresses of objects in documents.

FIGS. 3, 4 & 5 are flow charts depicting a matching and labelling program, functional block 9 of the flow chart shown in FIG. 1.

FIG. 6 is a flow chart depicting a retrieval and execution program, functional block 11 of the flow chart shown in FIG. 1.

FIGS. 7A & 7B is a flow chart depicting a learning and correction program, functional block 12 of the flow chart shown in FIG. 1.

FIG. 8 is a flow chart depicting the procedure used to select an area for labelling at the line, block or page level, functional blocks 123, 128, 137 of the flow chart shown in FIGS. 7A & 7B.

FIGS. 9A & 9B is a flow chart depicting a skew correction program, functional block 26 in FIG. 3.

FIGS. 9C, 9D, 9E & 9F are depictions of segmented areas having skew which may be corrected by the operation of the skew correction program, functional block 26 in FIG. 3.

FIG. 10 is an exemplary letter used in describing the operation of the preferred embodiment of the invention shown in FIG. 1.

DETAILED DESCRIPTION

A. Overview of System Operation

The present invention comprises a computerized system capable of identifying certain documents as well as certain types of information, referred to as objects, contained in these documents. The system performs this identification by matching characteristic portions of the documents with predefined document characteristics contained in a document knowledge base. The identification process is carried out without using OCR, any special identification markings on the document, or any human intervention.

As the current invention addresses a developing field of art, a glossary has been attached to the present application as Appendix A. The glossary defines certain terms which may not, at this time, be in common usage by all those skilled in the art.

Referring to FIG. 1, a system implementing a presently preferred embodiment of the invention identifies a document and the document's objects using three knowledge bases and four primary programs. The programs are a segmentation program 6 (represented in greater detail in FIG. 2); a matching and labelling program 9 (represented in greater detail in FIGS. 3,4 and 5); a retrieval and execution program 11 (represented in greater detail in FIG. 6); and, 4) a learning and correction program 12 (represented in greater detail in FIG. 7 A and B. The three knowledge bases are a document knowledge base 8, which the system utilizes as a source of document and object identification criteria (described in Appendix D), an execution knowledge base 15, which is used to direct the operation of the system (described in Appendix E), and a graphics knowledge base 14, used to identify handwriting, logos and fonts (described in Appendix F).

Generally, the system begins with the application of a digital scanning device 1 to the document. The output of the scanning device is a data stream containing a digital image of the document in a machine readable form. The data stream may either be compressed 2 and stored 3 for later use after decompression 4 then loaded into a computer system memory 5 or alternatively loaded directly into a computer system memory 5 for processing. The segmentation program 6 divides the document image into segments and identifies their characteristics. The matching and labeling program 9 identifies the document by class and sub-class and labels the objects in the document. This information is then stored in a file in a data base 13. The retrieval and execution program 11 displays the result of the matching and labelling program 9 on a computer screen or printer (not shown). The learning and correction program 12 creates and updates the graphic knowledge base 14 and the document knowledge base 8 which are then used by the segmentation program 6 and the matching and labeling program 9 during the next iteration of the system.

When first starting the system and after each main program is finished, control is returned to a menu control program 52. The operator may direct the system to automatically go to the next program, or he can request to rerun the current batch of documents with different threshold parameters or he can run another batch of documents through the same program.

The system identifies a document by dividing its relevant areas into segments. The structural characteristics, i.e., the segment's physical location of the area on the document, and the relational characteristic, i.e., the segment's location with respect all the other segments in the document are matched against the expected physical and relational characteristics provided by the document knowledge base. The prestored logical representations of these areas of the document within the document knowledge base are called objects.

A knowledge base may be described as a computer file containing information about a problem represented in a structured fashion. In the case of a document knowledge base, the information relates to the structure and characteristics used to identify documents. Knowledge based systems have been used in expert and vision systems as a method of making programs independent of the problem processed and to permit them to be data driven.

In the presently preferred embodiment of the invention, knowledge about the use of the system, as opposed to knowledge about the documents themselves is stored in the execution knowledge base 15. The execution knowledge base is used as a pilot to the programs discussed above, selecting the appropriate program and guiding the system in the best direction for efficient operation. Guidance is provided by focusing the attention of the system on the relevant segments and objects, monitoring the depth of recognition applied to each document and determining the necessary threshold levels needed to make and verify classification decisions. The execution knowledge base is also a feedback mechanism allowing the selection of actions on certain area or documents to be taken automatically by triggering a program upon encountering a prespecified condition.

Each of the above-described programs first reads the execution knowledge base 15 to load its initial parameters or options. These parameters will reduce the search space or the set of all possible solutions, according to the stored knowledge on the specific documents being processed. At the end of each pass through each program, each program consults the execution knowledge base 15 to verify if the current status of the program has triggered a new situation that will modify the search path of the program (i.e the way the program will search to identify documents or document areas by calling the various matching routines), or the conditions of the process (i.e., the level of threshold used in a program to accept a match or selection as successful).

The document knowledge base 8, described in greater detail below and in Appendix D, is used to create a long term memory table (LTM) of known documents and their respective objects. This information is matched with the document image segments in the segmentation program 6 resulting in the application of labels to the segmented areas beginning with the smallest simplest areas up to the area representing the entire document. The table of these labels is stored in a segmentation parameters file 7 which is used to create a short term memory table (STM) for the current document.

The graphic knowledge base 14 contains a description of all the known graphic types that can be found on all the known documents such as typewritten, handwritten, printed, graphic logos, etc., as well as the subclasses for each graphic type, such as particular fonts for typewritten and printed graphics.

A hypothesis for the classification of a document and the labeling of its objects is built using global blackboard representation within the matching and labeling program 9. This program then uses fuzzy logic methods to evaluate proposed matches between objects and areas.

Global blackboard representation, well known in the art, is a way of representing and controlling knowledge in a computer model when search paths are unpredictable and where knowledge is presented in several layers representing levels of specificity of the description. For further information regarding this technique see: A. R. Hanson and E. M. Riseman, "Segmentation of Natural Scenes," in Computer Vision Systems, (A. Hanson and E. Riseman, Eds.), Academic Press, N.Y., 1978, which is incorporated herein by reference.

Fuzzy logic methods, are also well known and are used when objects are not represented by a finite number but rather a distribution of variations as used in pattern recognition techniques. The rules of inference are approximate rather than exact in order to manipulate information that is incomplete or imprecise. In the case of the present system, objects can have a whole range of possible size, location physical relationships, and are represented by a distribution of possible numbers. For further information regarding fuzzy logic methods see Abraham Kandel, Fuzzy Mathematical Techniques, Addison-Wesley Publishing Co., Reading, Mass., 1986, which is incorporated herein by reference.

Having introduced the operation and interrelationship of the three knowledge bases and the four major programs of the system, a description of the operation of these knowledge bases and programs follows below.

B. Knowledge Base and Program Operation

Referring again to FIG. 1, a document is scanned using a digital scanning device 1. The scanning device converts the visual image of the document into a digitized image comprised of matrix of bits which are 0N or OFF, representing, respectively, black and white points of the documents visual image. This matrix of bits is sent scanline by scanline through an appropriate interface to a general purpose computer memory 5. The resolution of the scanned image is significant only to the extent that higher resolutions will slow the process. The digital image can be optionally compressed 2 and stored as data 3 for use at a later time. When stored, the data is decompressed 4 before being placed in memory 5. The decompressed data digital image may additionally be used by the retrieval and execution program 11 to provide a visual display of the document on a CRT screen.

Images from digital scanning devices which use several bits to represent levels of gray or color for each point on the visual image require filtering through a thresholding routine in the segmentation program 11 to reduced the digital image to a black and white binary image. As this is well known in the art it is not described here. The segmentation program 6, which is described in greater detail below, processes the binary image scanline per scanline to determine areas of significance. Each area having significant black pixel content will be isolated and its structural and relational characteristics will be extracted. The program operates in a bottom up fashion beginning by isolating individual words or symbols, then groups words in text lines.

The segmentation parameters of each segmented area, such as width and height, X and Y coordinates of its location, black pixel density, texture , the number of sub-areas and the address of the area in the compressed image file are calculated. The type of graphics in the segmented area are matched with the known graphic types in the graphic knowledge base 14 and the results are stored in a segmentation parameters file 7.

In an alternate embodiment of the invention, the above segmentation process is achieved by vectorization of the binary image, producing, when expanded in binary form, the same results for storage in the segmentation parameters file 7 as provided by the segmentation program 6.

The segmentation parameters file 7 contains the short term memory table (STM) information for the document. The initial results stored will become the lowest layer of the document structure.

The segmentation program also initializes a file for the document under consideration by creating a new record with an "undefined" label in the data base 13.

The document knowledge base 8, contains the long term memory table (LTM) information of the system as a structure of three or more layers for each known document. Each layer represents a logical level in the structure of a document, the lowest layer describing the most elementary object and the highest layer describing the most complete object. The objects are described by their physical representation, their complexity and how they appear to the operator of the system. For use of the current invention for the identification of documents the layers may be defined as follows:

The lowest layer is a word containing several characters;

The next higher layer is a line containing several words;

The next higher layer is a block containing several lines;

The next higher layer is a page containing several blocks;

The next higher layer is a document containing several pages; and

The highest layer is a folder contains several documents.

The number of layers used is not fixed but it has been found that a minimum of three layers is needed for the system to function satisfactorily. In the presently preferred embodiment of the invention, three layers are used comprising a page layer, a block layer and a line layer. Because the present system embodiment limits the size of a document to one page, the words "page" and "document" are used interchangeably in this description.

The document knowledge base 8 contains information relative to pages, blocks and lines of known documents together with their structural and relational characteristics, each located in their respective layers.

The matching and labeling program 9 attempts to match the characteristics of each segmented area of the document stored in the segmentation parameters file 7, with those of the objects described in the document knowledge base 8, layer by layer, from the bottom up. In order to avoid the combinational explosion which would result from attempting to match all characteristics of all segmented areas with all characteristics of all objects for all layers, a focus of attention method is applied in the matching and labeling program in a top down manner. Matching attempts are made only with objects possessing sufficient structural evidence at both the highest layer and the current processing layer. This method is well known in the art and is used in artificial intelligence programs as a control mechanism when exploring complex problems with numerous possible paths. It limits path selection by determining and selecting the path that has the best chance of being successful. For further information on focus of attention methods see The Handbook on Artificial Intelligence, Vol III (P.R. Cohen & E. A. Feigenbaum, Eds.), HeurisTech Press, Stanford, 1982, which is incorporated herein by reference.

A focus of attention control routine is used by the matching and labeling program 9 whenever a decision is needed of what to do next. When a layer of the document knowledge base has been processed, the focus of attention routine will consult the execution knowledge base 15 to verify if the result of the current process has created a condition that requires a change to one of the other strategies described in the execution knowledge base.

A successful match is found when the characteristics of a segmented area from the segmentation parameters file 7 are sufficiently similar to the characteristics of an object in the document knowledge base 8, provided the match is not contradicted by another relation in another layer of the document knowledge base. Stated differently, a match is found when characteristics matching by the matching and labeling program 9 has accumulated enough evidence that the area under analysis is the object to which it is matched, using fuzzy logic mathematics.

Fuzzy logic is an approach to approximate reasoning in which truth values, in this instance regarding the matching of area and object characteristics, are defined as possibility distributions that carry linguistic labels such as very true, true, not very true, not true, etc. A range of numbers are used to quantify these imprecise evidences. A value above a determined threshold is considered as an acceptable match. Threshold values are determined by the policy file of the execution knowledge base or set by default.

Once an acceptable match is found between an area and an object, a preliminary label is assigned to the area 10 and temporarily stored in the labeled segmentation parameter file 10. This preliminary label is, subject to verification at an upper layer of the Document knowledge base 8 by matching the areas's structure with the structure of the class and sub-class selected for the document.

Data base 13 is updated by the matching and labeling program 9 with all the finally determined labels of the areas, the access parameters of the areas, and the class and sub-class of the document.

The retrieval and execution program 11 retrieves and displays all or selected areas of documents. Retrieval is accomplished by searching through the data base 13 for a match on a requested document class or sub-class, or for a match on a document or area characteristic. Images of documents can be displayed completely or objects representing areas of the document can be selected and displayed. The retrieval and execution program 11 accesses the execution knowledge base 15 for selective display information, the labelled segmentation parameters file 10 for specific coordinates of the areas to display, and the data base 13 for classification parameters and digitized document images.

Since all possible documents and all possible objects in these documents cannot be known in advance, knowledge bases 8 and 14 require periodic updating. This is accomplished by the learning and correction program 12. This program relieves the operator of the system from having to manually create and update the objects in the document knowledge base 8. All structural and relational characteristics of new selected object are automatically calculated and written in document knowledge base 8. All new or modified graphic types found are described and saved in the graphic knowledge base 14.

When a system operator creates a new object which may have one or more sub-objects in various positions, the learning and correction program 12 will automatically calculate the relations and create the appropriate entry in the knowledge base for this object 8.

The learning and correction program 12 also will automatically create unidentified sub-classes of objects corresponding to common criteria. The program requests the assignment of a label to that sub-class from the operator, then labels all objects classified with that criteria.

Having concluded a general description of the operation of the four main programs, the three knowledge bases and the other components of the currently preferred embodiment of the invention, a description of greater detail of the programs and knowledge bases follows below.

C. The Segmentation Program

FIG. 2 shows a detailed flow of the operation of the segmentation program shown as functional block 6 in the General Flow Diagram of FIG. 1.

Generally, the segmentation program processes the digitized image of a page into separate segmented areas containing sufficient black pixels to be significant and then extracts their structural characteristics. The separation is achieved by building the X and Y histograms of the scanlines and determining the beginning and end of areas of significance by comparing the histogram values with minimum threshold values. The histograms are first-order statistics of the frequency distribution of black pixels over a section of the selected area of the page in both vertical and horizontal directions. The structural characteristics of each segmented area consists of its height, width, density, number of sub-areas, calculated textural features and the resulting graphic type of the area. Dimensions are represented in absolute real measurement, density is the ratio of the number of black pixels found in the area over the maximum number of pixels. The concept of textural features, well known in the art, is used to discriminate segmented areas. Textural features are extracted by calculating the ratio of pixels histograms in various sections of the area. For a further discussion of textural features see The Handbook of Artificial Intelligence, Vol. III, Ibid. Any number of textural features can be used, but four histograms ratios representing the four quadrants of each segment have been found acceptable. The ratios are, left and right top to left and right bottom, top and bottom left to top and bottom right, top left to bottom right and top right to bottom left.

The graphic type of an area depends on its structural characteristics. Graphic types are assigned descriptions in the graphic type knowledge base 14. Several standard types have been used, but additional types can be added if a finer knowledge of the type characteristics of an area is required. Currently used descriptions are: printed text, handwriting, typewritten, signature, graphic, photographic, black, horizontal, vertical, small and large.

A typical area segment is depicted in FIG. 2A. The segment has a width direction represented by 200, a height direction represented by 202, and three sub-areas, 204, 206 and 208. The segment is divided into four quadrants by dashed lines. The quadrants are top left 210, top right 212 bottom left 214 and bottom right 216.

The histogram ratios for the entire area segment are compiled by the application in the above described ratios of the histogram totals for each of the sub-areas. The histogram totals are calculated for that portion of the height and width of each sub-area which is located in each quadrant. For example, the height and width measurements for sub-area 204 are taken for that portion of the sub-area which extends into each of the quadrants 210, 212, 214 and 216. It can be seen that nonzero measurements would exist only in quadrants 210 and 212. The histogram totals are, nevertheless, calculated for all quadrants using the height and width portions of the sub-area measured in each quadrant. In this example the ratios would be calculated substituting the total for each quadrant in the following ratio formulas: (210+212)/(214+216), (210+214)/(212+216), 210/216 and 212/214. Six ratios, three based on the height of the sub-area in each quadrant and three based on the width of the sub-area are generated. This process is repeated for each sub-area in the area segment. The compilation of all these ratios provides the parameters of the textural features of the area segment.

For the lowest layer of the document, each area segmented corresponds to a word or a symbol. Horizontally compatible sub-areas are grouped together on the assumption that, such that they are words contained in lines of text.

A count of characters contained in the words is done by statistical computation, without actually identifying the characters' exact location. In another embodiment of this invention, a lower layer segmentation is performed for individual characters to permit post system processing of the document by an OCR system.

A further description of the operation of the segmentation program follows with reference to FIGS. 2 and 2B. The description corresponds to the segmentation of a single scanned page. A digital image has been stored in the computer memory 5.

In order to improve the quality of the segmentation process, the image is processed to reduce the skew by one of several possible methods. The method used in the presently preferred embodiment of the invention is set forth below and in further detail in Appendix B. Generally it consists of correcting the skew on each line scanned with a value calculated by the program.

Referring to FIG. 2B, a page is segmented in a top down and left to right fashion. The program first searches in vertical direction until it detects the end of an area, then it segments that area in the horizontal direction. The segmentation program 6 then fetches the document scanline per scanline and determines if the scanline contains relevant information by checking the number of dark pixels in the scanline against a minimum threshold number. If that threshold is reached, an open area trigger is set 220. Further scanlines are fetched by the program until the number of dark pixels in a scanline falls below a minimum value. The program then closes the vertical segmentation for the area 221. The process is then repeated for the horizontal axis. When the number of dark pixels on the scanline drops below a minimum threshold, it indicates the end of a segmentation area 222.

If the entire line has not been segmented, the program continues its search on the horizontal axis for another area to segment 223. Referring now to FIG. 2, the program begins the segmentation by initializing all housekeeping parameters including resetting histograms values to zero, repositioning a position pointer to the first line, resetting open and close threshold parameters, resetting scanline width to scan full document width, loading from the execution knowledge base 15, needed execution parameters such as levels of threshold to be used by the program, number of lines to segment, depth of the segmentation, resolution of the images to segment, orientation of the segmentation process, the known structural characteristics of the documents to be processed such as the minimum size and shape of area to segment and scanner generated or default skew value 16.

Execution parameters restrict the program to the parameters provided in order to accelerate the segmentation process and simplify the matching and labelling process later.

The segmentation program scans the page in memory 5 scan line by scan line and adjusts the line with the skew parameter from the execution knowledge base 15 or the calculated skew 17. The program calculates two tables of information referred to as the horizontal and vertical histograms. The tables are referred to as histograms after the way in which they may be displayed.

Exemplary horizontal and vertical histograms are illustrated in FIGS. 2C and 2D for the line of text "This information is for illustration". In FIG. 2C the horizontal axis 234 represents the horizontal extent of the scanline. The vertical axis 230 represents the sum of all the dark pixels found at each location represented on the horizontal axis. The sentence "This information is for illustration" translates to a set of data points representing the number of dark pixels located in small vertical slices of the characters making up the sentence. A connected line 232 has been drawn approximating these data points. The words of the sentence 236 have been placed above the line 232 to show the correspondence between the data points and the sentence.

An illustration of a vertical histogram may be seen in FIG. 2D. The vertical axis 238 represents the vertical extent of the scanline. The horizontal axis 240 represents the sum of the dark pixels found at each location represented on the vertical axis. The data points for the same sentence as above represent the number of dark pixels located in small horizontal slices through the sentence. A line 242 has been drawn approximating these data points.

Referring again to FIG. 2, the program test 30 determines whether it has reached the last line of the page. If it has, all opened areas are closed 31. If not, the program begins computing the number of black pixels of the scanned line for the entire width of the page and sets this value in the vertical histogram for that scan line. It then adds each pixel of the scan line to the horizontal histogram for the area under consideration 18.

The program tests 19 whether an area is already currently open for segmentation. If an "open area" flag is off, a threshold test 20 on the vertical histogram value is performed to determine whether the current potential area has enough black pixels to be relevant. An "open area" flag is set 33 when the vertical histogram for that scanline reaches a minimum value set by the rules from the execution knowledge base 15. The rule in the presently preferred embodiment of the invention is equivalent by default to a 2.5 mm section of black line or less depending on the width of the area processed. This threshold value is reduced dynamically depending on the width of the area under consideration.

After an "open area" is set the next scanline