|
Description  |
|
|
BACKGROUND OF THE INVENTION
The present invention relates to a method for the automatic identification
of scanned documents in an electronic document capture and storage system.
In electronic document scanning systems, a page of a document is
illuminated then scanned using one of several existing techniques such as
a charged coupled device (CCD) to create a digital image representing a
matrix of the black and white points on the page by a matrix of 0's and
1's. This matrix is then transmitted to a digital computer where it can be
processed, displayed, identified and stored. In some applications the
complete digital image of the document is stored. In other applications
the results of an additional process such as optical character recognition
(OCR) permits the text contained on the document to be stored instead of
the full image of the document.
In all cases, however, the document must be identified for later retrieval
and this step currently requires manual intervention. A data input
operator normally identifies and provides indexing information for each
page of the document as it is entered into the system. Because it is a
manual step, identifying and indexing documents comprise the largest
operating expense of most electronic document capture and storage systems.
This expense limits the benefits of such systems in many applications
because of its high cost.
In a typical application, a data entry operator will enter the
identification of each page of a document into the data base of a system
after the page is scanned. This identification may be composed of as few
as five or six numbers or characters, but often is of significantly
greater length to properly identify the page and allow an efficient
retrieval search of the document at a later time.
Document identification and indexing information usually consists of the
document class (e.g., letter, invoice, credit application, etc.) and a
combination of numeric and alphanumeric characters which uniquely identify
this document within its class (e.g., name of the party sending the
letter, vendor invoice number, etc.). Additionally, certain "key words"
may be added to allow a group of documents within the class to be
simultaneously retrieved (e.g., subject of the letter, type of merchandise
ordered, etc).
To achieve such data entry in a reasonably efficient manner, documents are
often prepared in advance for data entry operators by manual sorting,
pre-encoding, highlighting key fields or other techniques to guide the
flow of the data entry operation.
In applications with standardized documents such as processing Internal
Revenue Service forms, automated processing for sorting and data entry is
possible. In such cases OCR processing has been successfully used to
identify documents created with well known fonts. However, the vast
majority of documents processed by document capture and storage systems do
not have such standardized formats or well known OCR readable fonts.
There is a need, therefore, for a method of automating the document
identification and indexing steps of document capture and storage systems.
The method must have sufficient flexibility to handle non-standardized
document formats as well as varying types of printed fonts, images or
handwritten characters.
SUMMARY OF THE INVENTION
The present invention is a method for automatic identification of scanned
documents in an electronic document capture and storage system. Document
identification is accomplished by a series of programming techniques which
recognize as similar, patterns found in a given document. These patterns
include overall document structure as well as probabilities of the
presence or absence of certain data.
Most documents used in government, the professions or in commerce can be
divided into a limited number of classifications. Within a class, such as
business letters, the documents will contain some similar types of
information (referred to herein as objects). However, this information is
not presented in a standard format. For example, within the class of
business letters most will have objects which can be labeled as a
letterhead, a date, a destination, text, a signature or the originator's
name. Likewise, within the class of invoices most will have a buyer, a
seller, an invoice number, a date and an amount to be paid. But each
document within a class will normally have its objects placed in differing
locations.
In order to identify a document, the system first segments the digitized
image of the document into physical and logical areas of significance and
attempts to label these areas by determining the type of information they
contain. The present invention accomplishes this without using the OCR
techniques of the prior art.
Using a multiple-pass approach to the characterization of the document, the
system attempts to match the segmented areas of the document to
predetermined objects described in its document knowledge base.
The document knowledge base contains structured information relating to
various classes of documents and the objects they each contain. This
knowledge base provides the system with the expected structure and content
of the documents which may be identified by the system. For each segmented
area on the document an object label is assigned to the area by matching
its structural characteristic, i.e., the physical location of the area on
the document, and its relational characteristic, i.e., its location with
respect all the other areas on the document, against the expected physical
and relational characteristics provided by the knowledge base. As a result
of this matching process, a table of evidence and disevidence is built for
each area as containing a particular object. Based on the evidence and
disevidence table, proposed object labels are assigned to the segment
areas and a proposed classification designation is assigned to the
document. The program then begins a verification pass through the
document. Using the proposed object labels and document classification,
the structural and relational characteristics of the areas are again
compared. If the results of the second-pass is the same as the first-pass,
the object labels and document classification are considered verified. If
the results are different, new labels are proposed and the verification
step is repeated. After successful verification, the system then begins
another pass to attempt to match further document characteristics in order
to identify a possible sub-class of the document within the selected
document class. Once the document class and possibly its sub-class as well
as the objects it contains are adequately classified, this information is
stored for later retrieval.
The system has the capability to retrieve, display and print the contents
of entire documents or just identified objects within a document. The
system's capability to retrieve and extract specific objects from a
document rather than displaying only the entire document makes it
practical to use significantly lower resolution CRT screens than required
with prior art document retrieval systems.
An important feature of the system is the ability to associate a labelled
object with its exact position in an image when in either uncompressed or
compressed format. This feature significantly reduces the amount of data
that must be transferred from the storage device to the display device,
enhancing the system performance and making it practical to use software
decompression algorithms at the display station.
Another significant feature of the system is its use of a skew correction
technique on scanned document images or specific objects to increase its
accuracy in properly classifying documents and identifying their objects.
Another feature of the system is its capability to improve its document
knowledge base based on the results of the operation of the system. New
rules are automatically added to this knowledge base regarding novel
classes and subclasses of documents. Decision making is improved by making
use of fuzzy knowledge techniques. This feature makes the system easily
adaptable to new documents as well as new imaging technologies.
Another important feature of the system is the ability to process document
based on known information about the environment or application to improve
or accelerate the recognition process using an execution knowledge base
containing strategy, policy and execution files.
In another embodiment, the system may be used as a pre-processing step to
permit Mark-reading or OCR operations on selected areas of non-formatted
documents containing unstructured information.
The foregoing and other advantageous and distinguishing features of the
invention are described in detail below and are recited in the appended
claims.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a computer system operational flow chart depicting the presently
preferred embodiment of the invention;
FIG. 2 is a detailed flow chart representing a segmentation program,
functional block 6 of the flow chart shown in FIG. 1.
FIG. 2A is a depiction of an segmented area within a document as defined by
the segmentation program represented in FIG. 2.
FIG. 2B is a depiction of the segmentation of a page created by the
operation of the segmentation program represented in FIG. 2.
FIG. 2C is a depiction of an exemplary horizontal and histogram created by
the operation of the segmentation program represented in FIG. 2.
FIG. 2D is a depiction of an exemplary vertical and histogram created by
the operation of the segmentation program represented in FIG. 2.
FIG. 2E are two tables of exemplary data, of the type created by the
operation of the segmentation program represented in FIG. 2, representing
compressed and decompressed file addresses of objects in documents.
FIGS. 3, 4 & 5 are flow charts depicting a matching and labelling program,
functional block 9 of the flow chart shown in FIG. 1.
FIG. 6 is a flow chart depicting a retrieval and execution program,
functional block 11 of the flow chart shown in FIG. 1.
FIGS. 7A & 7B is a flow chart depicting a learning and correction program,
functional block 12 of the flow chart shown in FIG. 1.
FIG. 8 is a flow chart depicting the procedure used to select an area for
labelling at the line, block or page level, functional blocks 123, 128,
137 of the flow chart shown in FIGS. 7A & 7B.
FIGS. 9A & 9B is a flow chart depicting a skew correction program,
functional block 26 in FIG. 3.
FIGS. 9C, 9D, 9E & 9F are depictions of segmented areas having skew which
may be corrected by the operation of the skew correction program,
functional block 26 in FIG. 3.
FIG. 10 is an exemplary letter used in describing the operation of the
preferred embodiment of the invention shown in FIG. 1.
DETAILED DESCRIPTION
A. Overview of System Operation
The present invention comprises a computerized system capable of
identifying certain documents as well as certain types of information,
referred to as objects, contained in these documents. The system performs
this identification by matching characteristic portions of the documents
with predefined document characteristics contained in a document knowledge
base. The identification process is carried out without using OCR, any
special identification markings on the document, or any human
intervention.
As the current invention addresses a developing field of art, a glossary
has been attached to the present application as Appendix A. The glossary
defines certain terms which may not, at this time, be in common usage by
all those skilled in the art.
Referring to FIG. 1, a system implementing a presently preferred embodiment
of the invention identifies a document and the document's objects using
three knowledge bases and four primary programs. The programs are a
segmentation program 6 (represented in greater detail in FIG. 2); a
matching and labelling program 9 (represented in greater detail in FIGS.
3,4 and 5); a retrieval and execution program 11 (represented in greater
detail in FIG. 6); and, 4) a learning and correction program 12
(represented in greater detail in FIG. 7 A and B. The three knowledge
bases are a document knowledge base 8, which the system utilizes as a
source of document and object identification criteria (described in
Appendix D), an execution knowledge base 15, which is used to direct the
operation of the system (described in Appendix E), and a graphics
knowledge base 14, used to identify handwriting, logos and fonts
(described in Appendix F).
Generally, the system begins with the application of a digital scanning
device 1 to the document. The output of the scanning device is a data
stream containing a digital image of the document in a machine readable
form. The data stream may either be compressed 2 and stored 3 for later
use after decompression 4 then loaded into a computer system memory 5 or
alternatively loaded directly into a computer system memory 5 for
processing. The segmentation program 6 divides the document image into
segments and identifies their characteristics. The matching and labeling
program 9 identifies the document by class and sub-class and labels the
objects in the document. This information is then stored in a file in a
data base 13. The retrieval and execution program 11 displays the result
of the matching and labelling program 9 on a computer screen or printer
(not shown). The learning and correction program 12 creates and updates
the graphic knowledge base 14 and the document knowledge base 8 which are
then used by the segmentation program 6 and the matching and labeling
program 9 during the next iteration of the system.
When first starting the system and after each main program is finished,
control is returned to a menu control program 52. The operator may direct
the system to automatically go to the next program, or he can request to
rerun the current batch of documents with different threshold parameters
or he can run another batch of documents through the same program.
The system identifies a document by dividing its relevant areas into
segments. The structural characteristics, i.e., the segment's physical
location of the area on the document, and the relational characteristic,
i.e., the segment's location with respect all the other segments in the
document are matched against the expected physical and relational
characteristics provided by the document knowledge base. The prestored
logical representations of these areas of the document within the document
knowledge base are called objects.
A knowledge base may be described as a computer file containing information
about a problem represented in a structured fashion. In the case of a
document knowledge base, the information relates to the structure and
characteristics used to identify documents. Knowledge based systems have
been used in expert and vision systems as a method of making programs
independent of the problem processed and to permit them to be data driven.
In the presently preferred embodiment of the invention, knowledge about the
use of the system, as opposed to knowledge about the documents themselves
is stored in the execution knowledge base 15. The execution knowledge base
is used as a pilot to the programs discussed above, selecting the
appropriate program and guiding the system in the best direction for
efficient operation. Guidance is provided by focusing the attention of the
system on the relevant segments and objects, monitoring the depth of
recognition applied to each document and determining the necessary
threshold levels needed to make and verify classification decisions. The
execution knowledge base is also a feedback mechanism allowing the
selection of actions on certain area or documents to be taken
automatically by triggering a program upon encountering a prespecified
condition.
Each of the above-described programs first reads the execution knowledge
base 15 to load its initial parameters or options. These parameters will
reduce the search space or the set of all possible solutions, according to
the stored knowledge on the specific documents being processed. At the end
of each pass through each program, each program consults the execution
knowledge base 15 to verify if the current status of the program has
triggered a new situation that will modify the search path of the program
(i.e the way the program will search to identify documents or document
areas by calling the various matching routines), or the conditions of the
process (i.e., the level of threshold used in a program to accept a match
or selection as successful).
The document knowledge base 8, described in greater detail below and in
Appendix D, is used to create a long term memory table (LTM) of known
documents and their respective objects. This information is matched with
the document image segments in the segmentation program 6 resulting in the
application of labels to the segmented areas beginning with the smallest
simplest areas up to the area representing the entire document. The table
of these labels is stored in a segmentation parameters file 7 which is
used to create a short term memory table (STM) for the current document.
The graphic knowledge base 14 contains a description of all the known
graphic types that can be found on all the known documents such as
typewritten, handwritten, printed, graphic logos, etc., as well as the
subclasses for each graphic type, such as particular fonts for typewritten
and printed graphics.
A hypothesis for the classification of a document and the labeling of its
objects is built using global blackboard representation within the
matching and labeling program 9. This program then uses fuzzy logic
methods to evaluate proposed matches between objects and areas.
Global blackboard representation, well known in the art, is a way of
representing and controlling knowledge in a computer model when search
paths are unpredictable and where knowledge is presented in several layers
representing levels of specificity of the description. For further
information regarding this technique see: A. R. Hanson and E. M. Riseman,
"Segmentation of Natural Scenes," in Computer Vision Systems, (A. Hanson
and E. Riseman, Eds.), Academic Press, N.Y., 1978, which is incorporated
herein by reference.
Fuzzy logic methods, are also well known and are used when objects are not
represented by a finite number but rather a distribution of variations as
used in pattern recognition techniques. The rules of inference are
approximate rather than exact in order to manipulate information that is
incomplete or imprecise. In the case of the present system, objects can
have a whole range of possible size, location physical relationships, and
are represented by a distribution of possible numbers. For further
information regarding fuzzy logic methods see Abraham Kandel, Fuzzy
Mathematical Techniques, Addison-Wesley Publishing Co., Reading, Mass.,
1986, which is incorporated herein by reference.
Having introduced the operation and interrelationship of the three
knowledge bases and the four major programs of the system, a description
of the operation of these knowledge bases and programs follows below.
B. Knowledge Base and Program Operation
Referring again to FIG. 1, a document is scanned using a digital scanning
device 1. The scanning device converts the visual image of the document
into a digitized image comprised of matrix of bits which are 0N or OFF,
representing, respectively, black and white points of the documents visual
image. This matrix of bits is sent scanline by scanline through an
appropriate interface to a general purpose computer memory 5. The
resolution of the scanned image is significant only to the extent that
higher resolutions will slow the process. The digital image can be
optionally compressed 2 and stored as data 3 for use at a later time. When
stored, the data is decompressed 4 before being placed in memory 5. The
decompressed data digital image may additionally be used by the retrieval
and execution program 11 to provide a visual display of the document on a
CRT screen.
Images from digital scanning devices which use several bits to represent
levels of gray or color for each point on the visual image require
filtering through a thresholding routine in the segmentation program 11 to
reduced the digital image to a black and white binary image. As this is
well known in the art it is not described here. The segmentation program
6, which is described in greater detail below, processes the binary image
scanline per scanline to determine areas of significance. Each area having
significant black pixel content will be isolated and its structural and
relational characteristics will be extracted. The program operates in a
bottom up fashion beginning by isolating individual words or symbols, then
groups words in text lines.
The segmentation parameters of each segmented area, such as width and
height, X and Y coordinates of its location, black pixel density, texture
, the number of sub-areas and the address of the area in the compressed
image file are calculated. The type of graphics in the segmented area are
matched with the known graphic types in the graphic knowledge base 14 and
the results are stored in a segmentation parameters file 7.
In an alternate embodiment of the invention, the above segmentation process
is achieved by vectorization of the binary image, producing, when expanded
in binary form, the same results for storage in the segmentation
parameters file 7 as provided by the segmentation program 6.
The segmentation parameters file 7 contains the short term memory table
(STM) information for the document. The initial results stored will become
the lowest layer of the document structure.
The segmentation program also initializes a file for the document under
consideration by creating a new record with an "undefined" label in the
data base 13.
The document knowledge base 8, contains the long term memory table (LTM)
information of the system as a structure of three or more layers for each
known document. Each layer represents a logical level in the structure of
a document, the lowest layer describing the most elementary object and the
highest layer describing the most complete object. The objects are
described by their physical representation, their complexity and how they
appear to the operator of the system. For use of the current invention for
the identification of documents the layers may be defined as follows:
The lowest layer is a word containing several characters;
The next higher layer is a line containing several words;
The next higher layer is a block containing several lines;
The next higher layer is a page containing several blocks;
The next higher layer is a document containing several pages; and
The highest layer is a folder contains several documents.
The number of layers used is not fixed but it has been found that a minimum
of three layers is needed for the system to function satisfactorily. In
the presently preferred embodiment of the invention, three layers are used
comprising a page layer, a block layer and a line layer. Because the
present system embodiment limits the size of a document to one page, the
words "page" and "document" are used interchangeably in this description.
The document knowledge base 8 contains information relative to pages,
blocks and lines of known documents together with their structural and
relational characteristics, each located in their respective layers.
The matching and labeling program 9 attempts to match the characteristics
of each segmented area of the document stored in the segmentation
parameters file 7, with those of the objects described in the document
knowledge base 8, layer by layer, from the bottom up. In order to avoid
the combinational explosion which would result from attempting to match
all characteristics of all segmented areas with all characteristics of all
objects for all layers, a focus of attention method is applied in the
matching and labeling program in a top down manner. Matching attempts are
made only with objects possessing sufficient structural evidence at both
the highest layer and the current processing layer. This method is well
known in the art and is used in artificial intelligence programs as a
control mechanism when exploring complex problems with numerous possible
paths. It limits path selection by determining and selecting the path that
has the best chance of being successful. For further information on focus
of attention methods see The Handbook on Artificial Intelligence, Vol III
(P.R. Cohen & E. A. Feigenbaum, Eds.), HeurisTech Press, Stanford, 1982,
which is incorporated herein by reference.
A focus of attention control routine is used by the matching and labeling
program 9 whenever a decision is needed of what to do next. When a layer
of the document knowledge base has been processed, the focus of attention
routine will consult the execution knowledge base 15 to verify if the
result of the current process has created a condition that requires a
change to one of the other strategies described in the execution knowledge
base.
A successful match is found when the characteristics of a segmented area
from the segmentation parameters file 7 are sufficiently similar to the
characteristics of an object in the document knowledge base 8, provided
the match is not contradicted by another relation in another layer of the
document knowledge base. Stated differently, a match is found when
characteristics matching by the matching and labeling program 9 has
accumulated enough evidence that the area under analysis is the object to
which it is matched, using fuzzy logic mathematics.
Fuzzy logic is an approach to approximate reasoning in which truth values,
in this instance regarding the matching of area and object
characteristics, are defined as possibility distributions that carry
linguistic labels such as very true, true, not very true, not true, etc. A
range of numbers are used to quantify these imprecise evidences. A value
above a determined threshold is considered as an acceptable match.
Threshold values are determined by the policy file of the execution
knowledge base or set by default.
Once an acceptable match is found between an area and an object, a
preliminary label is assigned to the area 10 and temporarily stored in the
labeled segmentation parameter file 10. This preliminary label is, subject
to verification at an upper layer of the Document knowledge base 8 by
matching the areas's structure with the structure of the class and
sub-class selected for the document.
Data base 13 is updated by the matching and labeling program 9 with all the
finally determined labels of the areas, the access parameters of the
areas, and the class and sub-class of the document.
The retrieval and execution program 11 retrieves and displays all or
selected areas of documents. Retrieval is accomplished by searching
through the data base 13 for a match on a requested document class or
sub-class, or for a match on a document or area characteristic. Images of
documents can be displayed completely or objects representing areas of the
document can be selected and displayed. The retrieval and execution
program 11 accesses the execution knowledge base 15 for selective display
information, the labelled segmentation parameters file 10 for specific
coordinates of the areas to display, and the data base 13 for
classification parameters and digitized document images.
Since all possible documents and all possible objects in these documents
cannot be known in advance, knowledge bases 8 and 14 require periodic
updating. This is accomplished by the learning and correction program 12.
This program relieves the operator of the system from having to manually
create and update the objects in the document knowledge base 8. All
structural and relational characteristics of new selected object are
automatically calculated and written in document knowledge base 8. All new
or modified graphic types found are described and saved in the graphic
knowledge base 14.
When a system operator creates a new object which may have one or more
sub-objects in various positions, the learning and correction program 12
will automatically calculate the relations and create the appropriate
entry in the knowledge base for this object 8.
The learning and correction program 12 also will automatically create
unidentified sub-classes of objects corresponding to common criteria. The
program requests the assignment of a label to that sub-class from the
operator, then labels all objects classified with that criteria.
Having concluded a general description of the operation of the four main
programs, the three knowledge bases and the other components of the
currently preferred embodiment of the invention, a description of greater
detail of the programs and knowledge bases follows below.
C. The Segmentation Program
FIG. 2 shows a detailed flow of the operation of the segmentation program
shown as functional block 6 in the General Flow Diagram of FIG. 1.
Generally, the segmentation program processes the digitized image of a page
into separate segmented areas containing sufficient black pixels to be
significant and then extracts their structural characteristics. The
separation is achieved by building the X and Y histograms of the scanlines
and determining the beginning and end of areas of significance by
comparing the histogram values with minimum threshold values. The
histograms are first-order statistics of the frequency distribution of
black pixels over a section of the selected area of the page in both
vertical and horizontal directions. The structural characteristics of each
segmented area consists of its height, width, density, number of
sub-areas, calculated textural features and the resulting graphic type of
the area. Dimensions are represented in absolute real measurement, density
is the ratio of the number of black pixels found in the area over the
maximum number of pixels. The concept of textural features, well known in
the art, is used to discriminate segmented areas. Textural features are
extracted by calculating the ratio of pixels histograms in various
sections of the area. For a further discussion of textural features see
The Handbook of Artificial Intelligence, Vol. III, Ibid. Any number of
textural features can be used, but four histograms ratios representing the
four quadrants of each segment have been found acceptable. The ratios are,
left and right top to left and right bottom, top and bottom left to top
and bottom right, top left to bottom right and top right to bottom left.
The graphic type of an area depends on its structural characteristics.
Graphic types are assigned descriptions in the graphic type knowledge base
14. Several standard types have been used, but additional types can be
added if a finer knowledge of the type characteristics of an area is
required. Currently used descriptions are: printed text, handwriting,
typewritten, signature, graphic, photographic, black, horizontal,
vertical, small and large.
A typical area segment is depicted in FIG. 2A. The segment has a width
direction represented by 200, a height direction represented by 202, and
three sub-areas, 204, 206 and 208. The segment is divided into four
quadrants by dashed lines. The quadrants are top left 210, top right 212
bottom left 214 and bottom right 216.
The histogram ratios for the entire area segment are compiled by the
application in the above described ratios of the histogram totals for each
of the sub-areas. The histogram totals are calculated for that portion of
the height and width of each sub-area which is located in each quadrant.
For example, the height and width measurements for sub-area 204 are taken
for that portion of the sub-area which extends into each of the quadrants
210, 212, 214 and 216. It can be seen that nonzero measurements would
exist only in quadrants 210 and 212. The histogram totals are,
nevertheless, calculated for all quadrants using the height and width
portions of the sub-area measured in each quadrant. In this example the
ratios would be calculated substituting the total for each quadrant in the
following ratio formulas: (210+212)/(214+216), (210+214)/(212+216),
210/216 and 212/214. Six ratios, three based on the height of the sub-area
in each quadrant and three based on the width of the sub-area are
generated. This process is repeated for each sub-area in the area segment.
The compilation of all these ratios provides the parameters of the
textural features of the area segment.
For the lowest layer of the document, each area segmented corresponds to a
word or a symbol. Horizontally compatible sub-areas are grouped together
on the assumption that, such that they are words contained in lines of
text.
A count of characters contained in the words is done by statistical
computation, without actually identifying the characters' exact location.
In another embodiment of this invention, a lower layer segmentation is
performed for individual characters to permit post system processing of
the document by an OCR system.
A further description of the operation of the segmentation program follows
with reference to FIGS. 2 and 2B. The description corresponds to the
segmentation of a single scanned page. A digital image has been stored in
the computer memory 5.
In order to improve the quality of the segmentation process, the image is
processed to reduce the skew by one of several possible methods. The
method used in the presently preferred embodiment of the invention is set
forth below and in further detail in Appendix B. Generally it consists of
correcting the skew on each line scanned with a value calculated by the
program.
Referring to FIG. 2B, a page is segmented in a top down and left to right
fashion. The program first searches in vertical direction until it detects
the end of an area, then it segments that area in the horizontal
direction. The segmentation program 6 then fetches the document scanline
per scanline and determines if the scanline contains relevant information
by checking the number of dark pixels in the scanline against a minimum
threshold number. If that threshold is reached, an open area trigger is
set 220. Further scanlines are fetched by the program until the number of
dark pixels in a scanline falls below a minimum value. The program then
closes the vertical segmentation for the area 221. The process is then
repeated for the horizontal axis. When the number of dark pixels on the
scanline drops below a minimum threshold, it indicates the end of a
segmentation area 222.
If the entire line has not been segmented, the program continues its search
on the horizontal axis for another area to segment 223. Referring now to
FIG. 2, the program begins the segmentation by initializing all
housekeeping parameters including resetting histograms values to zero,
repositioning a position pointer to the first line, resetting open and
close threshold parameters, resetting scanline width to scan full document
width, loading from the execution knowledge base 15, needed execution
parameters such as levels of threshold to be used by the program, number
of lines to segment, depth of the segmentation, resolution of the images
to segment, orientation of the segmentation process, the known structural
characteristics of the documents to be processed such as the minimum size
and shape of area to segment and scanner generated or default skew value
16.
Execution parameters restrict the program to the parameters provided in
order to accelerate the segmentation process and simplify the matching and
labelling process later.
The segmentation program scans the page in memory 5 scan line by scan line
and adjusts the line with the skew parameter from the execution knowledge
base 15 or the calculated skew 17. The program calculates two tables of
information referred to as the horizontal and vertical histograms. The
tables are referred to as histograms after the way in which they may be
displayed.
Exemplary horizontal and vertical histograms are illustrated in FIGS. 2C
and 2D for the line of text "This information is for illustration". In
FIG. 2C the horizontal axis 234 represents the horizontal extent of the
scanline. The vertical axis 230 represents the sum of all the dark pixels
found at each location represented on the horizontal axis. The sentence
"This information is for illustration" translates to a set of data points
representing the number of dark pixels located in small vertical slices of
the characters making up the sentence. A connected line 232 has been drawn
approximating these data points. The words of the sentence 236 have been
placed above the line 232 to show the correspondence between the data
points and the sentence.
An illustration of a vertical histogram may be seen in FIG. 2D. The
vertical axis 238 represents the vertical extent of the scanline. The
horizontal axis 240 represents the sum of the dark pixels found at each
location represented on the vertical axis. The data points for the same
sentence as above represent the number of dark pixels located in small
horizontal slices through the sentence. A line 242 has been drawn
approximating these data points.
Referring again to FIG. 2, the program test 30 determines whether it has
reached the last line of the page. If it has, all opened areas are closed
31. If not, the program begins computing the number of black pixels of the
scanned line for the entire width of the page and sets this value in the
vertical histogram for that scan line. It then adds each pixel of the scan
line to the horizontal histogram for the area under consideration 18.
The program tests 19 whether an area is already currently open for
segmentation. If an "open area" flag is off, a threshold test 20 on the
vertical histogram value is performed to determine whether the current
potential area has enough black pixels to be relevant. An "open area" flag
is set 33 when the vertical histogram for that scanline reaches a minimum
value set by the rules from the execution knowledge base 15. The rule in
the presently preferred embodiment of the invention is equivalent by
default to a 2.5 mm section of black line or less depending on the width
of the area processed. This threshold value is reduced dynamically
depending on the width of the area under consideration.
After an "open area" is set the next scanline | | |