|
Description  |
|
|
TECHNICAL FIELD
This invention relates to optical character recognition (OCR) devices and
processes, and more particularly to a computer-implemented method for
automatic extraction of data from printed forms.
BACKGROUND OF THE INVENTION
Automatic computer-implemented reading of data from printed forms is
typically done in a sequence of three steps. First a form is optically
scanned to create an electronic image, which is then written in digital
storage as a rectangular array of 0's and 1's representing white and black
subareas or pixels. Then the image is processed to extract regions or
fields containing the data to be read. Finally, the black and white
subimage in each extracted region is interpreted and expressed as an
alphanumeric code, such as ASCII or EBCDIC.
The data present in printed forms may be defined as having two aspects: a
value and a significance. For example, the word "Yes" is a value that
becomes data only when its significance, i.e., the question it answers, is
made clear. Printed forms provide a conventional means for recording data
in which significance is predefined as a background of text and graphics,
such as boxed areas. Since forms are printed mechanically, the background
is identical over different instances of the same form. Thus the position
of data values on the form is in correspondence with the data
significance. Optical character recognition (OCR) devices take advantage
of this fact to read data from credit card receipts, billing statements,
etc. Such "OCR forms" are designed with data values entered in spaces well
separated from background printing to assure that the latter are not
erroneously interpreted as data values. Data significance does not appear
explicitly, but is stored in the computer and associated with the data
values on the basis of position in the image. In some cases, forms are
printed in a color invisible to the scanner to avoid a possibility of
confusion. Data values are carefully positioned during printing, and the
form precisely registered during scanning. All these steps serve to
guarantee that the data values are exactly where the reading or scanning
equipment performs its extraction process.
In recent years, demand has grown for a capability to capture data from
printed forms that do not meet OCR constraints. Forms routinely used in
government and commercial operations, such as birth and marriage
certificates, are designed to be intelligible to the human eye and brain.
While people are sophisticated processors of visual images, they also
require that both attributes of a data element, the significance and the
value, be present on the document. Thus background printing is provided to
supply the meaning of each data field, and lines and boxes are imposed to
make clear the association of data value and data significance. The
crowded appearance of these "people forms", compared with OCR forms, is a
necessary outcome of a requirement to pack a great deal of information
into a limited space.
It is likewise difficult to enforce controls in the preparation of people
forms. A birth or marriage certificate filled out with a typewriter is
registered by eye, often with errors in translation and skew compared to
the ideal orientation. Data values may superimpose on the form background
as a result. The printing process itself is subject to mechanical slippage
that may give the same effect. Finally, mechanical slippage and electronic
noise occurring during the optical scanning process present a further
source of registration error. This is particularly true if economical
general-purpose scanners are used. The net result of all these factors is
that printing of a given data value on people forms may be skewed, may
overlap boundary lines separating data regions, and even when ideally
positioned does not consistently appear in a fixed, predictable region in
scanned images of different instances of the form. These difficulties pose
severe problems for automatic computer-implemented data extraction,
rendering inapplicable the sort of processing used for OCR forms.
SUMMARY OF THE INVENTION
The invention is a computer-implemented method operable with conventional
OCR scanning equipment and software for the automatic extraction of data
from printed forms.
A blank master form is scanned and its digital image stored. Clusters of ON
bits of the master form image are first recognized as part of a line and
then connected to form lines. All of the lines in the master form image
are then identified by row and column start position and column end
position, thereby creating a master-form-description. The resulting image,
which consists only of lines in the master form, can then be displayed.
Regions or masks in the displayed image of master form lines are then
created, each mask corresponding to a field where data would be located in
a filled-in form. Each data mask is spaced from nearby lines by a
predetermined data margin, referred to as D.
A filled-in or data form is then scanned and lines are also recognized and
identified in a similar manner to create a data-form-description. The
data-form-description is compared with the master-form-description by
computing the horizontal and vertical offsets and skew of the two forms
relative to one another. The created data masks, whose orientation with
respect to the master form has been previously determined, are then
transposed into the data form image using the computed values of
horizontal and vertical offsets and skew. In this manner, the data masks
are correctly located on the data form so that the actual data values in
the data form reside within the corresponding data masks. Routines are
then implemented for detecting extraneous data intruding into the data
masks and for growing the masks, i.e. enlarging the masks to capture data
which may extend beyond the perimeter of the masks. Thus, the data masks
are adaptive in that they are grown if data does not lie entirely within
the perimeter of the masks. During the mask growth routine, lines which
are part of the background form are detected and removed by line removal
algorithms.
Following the removal of extraneous data from the masks, the growth of the
masks to capture data, and any subsequent line removal, the remaining data
from the masks is extracted and transferred to a new file. The new file
then contains only data comprising characters of the data values in the
desired regions, which can then be operated on by conventional OCR
software to identify the specific character values.
For a fuller understanding of the nature and advantages of the present
invention reference should be made to the following detailed description
taken in conjunction with the accompanying drawings.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a schematic illustrating the relationship of the invention to
conventional hardware and software for scanning an image and converting
the image to stored characters through the use of OCR software;
FIG. 2 is a schematic illustrating the two portions of the data extraction
software designated Phase I and Phase II ;
FIG. 3 is a schematic of the functions present in the Phase I portion of
the data extraction software;
FIG. 4 is a schematic of the functions present in the Phase II portion of
the data extraction software;
FIG. 5 is a schematic of the Copy Masks function of the Phase II portion of
the data extraction software;
FIG. 6 is an illustration of the manner in which horizontal and vertical
adjustments between the master form and the data form are used to register
the data form following scanning;
FIG. 7 is a depiction of an alternative solution to the problem of skew in
the location of a data mask;
FIG. 8 is a graphical illustration of the Grow Bottom function for growing
a data mask to capture data which crosses a mask perimeter;
FIG. 9 is a graphical depiction of a line removal process to capture
character data which is intersected by a line, the line having been
detected during the Grow Bottom function;
FIG. 10 is a representation of ON pixels for data intersected by a line;
FIG. 11 is a representation of the ON pixels depicted as dashes following
detection of the line pixels through the use of the algorithm in Appendix
C;
FIG. 12 is a typical bland birth certificate master form; and
FIG. 13 is a typical filled-in birth certificate data form.
DESCRIPTION OF THE PREFERRED EMBODIMENTS
Referring first to FIG. 1, the conventional hardware and software for
converting characters from a printed form into stored digital data
comprises a scanner, such as any 300 pixel bi-level scanner, and a central
processing unit (CPU). The CPU is connected to conventional peripheral
devices such as a display, a data storage (e.g. a hard disk file) and a
printer. The scanner converts the printed image from the form into
bi-level digital data which is transferred to the CPU and converted into
an array of 0's and 1's, representing white and black pixels on the
scanned form. Conventional OCR software is then utilized to recognize
characters from the array of digital data. The CPU then stores and/or
displays the results of the OCR software operation to any of its
peripheral devices. Thus, the output from operation of the OCR software is
character data, represented for example in ASCII format, which can be
easily manipulated by the CPU and stored or transmitted to peripheral
devices.
As illustrated in FIG. 1, in the present invention a blank master form is
first scanned, converted to bi-level data, received by the CPU and stored.
An example of a master form is shown in FIG. 12. A completed or filled in
form, is then scanned, converted to bi-level data and stored by the CPU in
an array referred to as an "old-map". An example of a filled-in or data
form is shown in FIG. 13. The present invention is the method of data
extraction performed by the data extraction software which allows just the
pixels in the data regions or fields to be extracted from the old-map so
that this data can then be received and operated on by the OCR software in
the conventional manner. The functions of the data extraction software
will be described with reference to the schematics of FIGS. 2-5 and to the
"pseudo-code" included in Appendix A, which is incorporated herein by
reference. A list of definitions to assist in understanding the
pseudo-code is listed in Appendix B and incorporated herein by reference.
Referring now to FIG. 2, the data extraction software includes two phases,
referred to as Phase I and Phase II. The operation of Phase I is identical
for both the master form and the data form, while Phase II operates only
on the data form. Thus, the first step in the data extraction process is
the scanning of the blank master form and, through the use of the
functions in Phase I to be described below, the creation of a
master-form-description. The master-form-description includes the
identification of all lines by location and length and the skew of the
form, i.e. its angular orientation relative to a fixed coordinate system
of the scanning device.
Referring to FIG. 3, the description of the operation of Phase I will be
described relative to its operation on a scanned data form, i.e. the
old-map, although as just indicated, the operation of Phase I is identical
for the blank master form. This is because the function of Phase I is to
locate all horizontal lines on the form, to generate a line-list and to
identify the location and length of all of the lines on the form, be it
the master form or the data form.
The old-map is a two-dimensional array of digital data corresponding to the
pixels of the scanned data form image. Each pixel is either 1 or 0, or
"ON" or "OFF", depending upon the presence or absence of either background
lines or character data at the corresponding X-Y location. This digital
data is analyzed with a Build Table routine (Appendix A, 1.1), which is a
search for vertical runs of ON bits having a height less than an
predetermined number, e.g. 3 in the preferred embodiment. The Build Table
routine identifies all such vertical runs and creates a cluster-list for
identifying the location of these vertical runs. The Build Table routine
outputs a list of all vertical clusters located in the old-map and a
cluster-index which points to the start of each row for each 3-pixel-high
vertical cluster. This information is then passed to the Build Lines
routine (Appendix A, 1.2) whose function is to connect horizontally
related clusters to detect horizontal lines. Thus, the Build Lines routine
searches through all clusters in the cluster-list, and with knowledge of
their location from the cluster-index, identifies horizontal lines. If
there is an additional cluster within a predetermined minimum length away
from the portion of the line just built, then that cluster is added to the
existing line. The output of the Build Lines routine is a line-list of all
of the lines in the data form. Each line in the line-list is identified by
its row-start, row-end, column-start, and column-end.
The lines are then normalized or de-skewed by the Normalize Lines (Appendix
A, 1.3) routine. Because the data form may not have been perfectly aligned
with the scanning device, or the background lines on the form not
perfectly oriented on the form, the lines which have been identified from
the Build Lines routine may be skewed. This means that the pixels
comprising a line may not necessarily begin and end on the same row. The
Normalize Lines routine calculates the average skew of the lines in the
line-list. The skew is defined as the ratio of the difference between the
start and stop row of a line to the difference between the start and stop
column of the line. After the skew has been calculated in this manner, the
lines are normalized or de-skewed so that each line in the line-list can
now be identified by a row-start, column-start and column-end. As a
necessary output of the Normalize Lines routine, the skew of the data form
has been calculated and is available for later use. The Connect Lines
routine (Appendix A, 1.4) then takes the list of Normalize Lines and
determines if any of these lines are within a predetermined line tolerance
of one another. If so, the lines are connected to make a single line,
thereby reducing the number of lines in the line-list. As a result of the
Connect Lines routine, the output is a data-form-description which is a
listing of all horizontal lines in the data form, where each of those
lines is identified by its row-start, its column-start and its column-end.
A similar procedure may be performed to detect and identify vertical
lines. Thus, at the completion of Phase I the data extraction software has
generated in digital form an exact replica of all background lines present
on the filled-in data form.
As previously described, the Phase I process is first performed on a master
form, i.e. a form identical to a filled-in data form, but without any
character data present in the data fields. This is the first step in the
process in the present invention and results in a master-form-description
which is a list of all horizontal lines on the master form identified by
row-start, column-start, and column-end. Vertical lines would be
identified by row-start, column-start, and row-end.
Referring now to Paragraph 1.5 of Appendix A, the next step is the creation
of the data-field-masks from the master form, utilizing the
master-form-description output by Phase I when the master form was
scanned. The function of the Create Masks step is to define a series of
rectangular masks separated by portions of the lines identified in the
line-list of the master form, wherein each mask corresponds to a desired
region where a data value is located in the data form. While there are
numerous techniques for creating the masks, the preferred technique is to
generate the output of Phase I from the master form on a visual display
(See FIG. 1). A user, through the use of a light pen, then points to those
regions bounded by lines where it is desired to extract data from the data
forms. For example, the light pen could identify the upper left and the
lower right corners of each desired data mask, which would result in
identifying the location and dimensions of each rectangular data region.
Each data mask has a perimeter which is preferably spaced from the lines
by a predetermined data margin, D. As a result of the Create Mask step,
the CPU will have stored the location and dimensions of the data regions
by use of a mask-list, which identifies each mask by row-start,
column-start and row-end, column-end. Since the normalized lines in the
master form are presented on the visual display during this step it is not
necessary to compensate for skew when the data masks are created on the
master form.
At this point in the description of the invention, a master form has been
scanned, a master-form-description has been defined through the use of
Phase I (FIG. 3 and Appendix A, 1.1-1.4), the data masks have been
created, a data form has been scanned and a data-form-description has been
generated through the use of Phase I.
Phase II is now performed on the data form. Referring now to FIG. 4 and
Paragraph 2 of Appendix A, it is noted that Phase II has several
functions. The first function is the Prepare Form routine (Appendix A,
2.1) which compares the data-form-description with the
master-form-description in order to calculate the differences in terms of
field locations between the two descriptions. The function of the Prepare
Form routine can be better understood by reference to FIG. 6. Because the
data form cannot always be perfectly aligned with the scanner, there will
usually be both vertical and horizontal offsets of the data form lines
from the master form lines form each time a new data form is scanned.
Thus, the Prepare Form routine compares each line in the line-list in the
master-form-description with its corresponding line in the line-list in
the data-form-description, and using the differences between the
row-start, column-start and column-end values, determines the offset of
the data form from the master form. This offset is essentially the
horizontal adjustment (horiz-adj) and the vertical adjustment (vert-adj)
necessary to transform the master form into the data form. The output of
the Prepare Form routine are the values or horiz-adj, vert-adj, and skew,
the latter term being the rotation of the data form relative to the master
form (See FIG. 6).
In the Get Data Masks function (FIG. 4 and Appendix A, 2.2), the data masks
created from the master form and which are identified by location and
dimension, are retrieved and placed in corresponding locations on the data
form so as to create data masks located in the data form. As part of the
Get Data Masks function, the width of the output array is calculated as
the width of the longest data mask created from the master form plus a
constant (such as two bytes). This number becomes the width of the output
image containing the extracted data.
The next function in Phase II is the Read Group function (Appendix A, 2.3),
which transfers the old-map image from data storage into main memory,
creating an array of records, referred to as "current-group", where each
record has a width corresponding to the width of the scanned image. Thus,
the Read Group function essentially converts the bi-level raw data into an
array having a width and height corresponding to that of the scanned
image. The output of the Read Group function is the current-group.
The Copy Masks function of Phase II comprises the five routines shown in
FIG. 5. The Copy Masks function locates the data masks in a current-group
to be extracted, grows or enlarges the masks, if necessary, to capture
information which lies outside of the masks, removes any lines which
intrude into the data masks, and finally writes the extracted data to the
extracted image. As part of the Copy Masks function, the mask parameters
are calculated in the Calculate Mask Parameters routine. This process
essentially adjusts the data masks to compensate for skew, horizontal
offset and vertical offset relative to the master form. The operation of
the Calculate Mask Parameters can best illustrated by reference to FIG. 6.
The input to the routine are the data mask locations and dimension, and
vert-adj, horiz-adj and skew which have been previously calculated in the
Prepare Form function. FIG. 6 illustrates how each of these parameters is
used to transpose the location of the data masks, whose relationships
relative to the master form are known, to their correct locations on the
data form. As shown in FIG. 6, the skew rotates the masks by the degree of
the skew. However, while this may be an optimal solution in that the size
of the masks remain the same as in the master form, the CPU cost in
performing the rotation step and in growing the masks may be inefficient
if the skew remains within a reasonable range. In an alternative
technique, as shown in FIG. 7, the skew is used to alter the height of the
mask in the data form, so that the resulting mask in the data form retains
its rectangular shape. As the skew increases, the chance of the predefined
rectangle not fitting inside the field increases. To compensate for this,
as the skew becomes greater the heights of the masks are reduced. The
output of the Calculate Mask Parameters routine is the location and
dimension of the data masks within the data form, each data mask being
defined by its start or left column, its end or right column, and its
top-row and bottom-row.
After the Calculate Masks Parameters routine has been performed, if the
data is perfectly located within the masks, then no additional data
extraction routines are needed and the data can be extracted directly by
the Write Row routine (Appendix A, 2.4.5). Perfect data is considered to
be data that resides entirely within a predefined data mask. If there are
no ON pixels along the parameter of the mask, the data is considered to be
wholly inside the mask and completely perfect. In the Write Row function,
all of the data in the current-group which is contained in the rectangles
defined by left, right, top-row and bottom-row, i.e. all of the data in
the masks, is copied to a new file. The result is that the new file
contains only extracted data without any extraneous lines of the
background form. The data located in the new file can then be operated on
by conventional OCR software to identify the data as specific character
data. In addition, because each data mask is predefined as corresponding
to a certain significance, i.e. name, age, etc., the data which has been
extracted from the form and converted to character data by the OCR
software is immediately identifiable with a predetermined significance.
If, on the other hand, the data present in the data masks is not perfectly
located within the perimeter of the data masks, then it is necessary to
perform one of the additional functions of the Copy Masks function (FIG.
5).
The Remove Vert Bars (Appendix A, 2.4.2) removes vertically oriented bars
which may be intruding into a data field, such as the portion of the
signature in the "Middle" data field shown in FIG. 13. The Remove Vert
Bars routine checks the two right-most and left-most bytes for the entire
height of each field to determine if there is a run of connected ON bits
that occurs that forms a vertical line. If such a line is not included
since it assumed that data occurring beyond the vertical bar belongs to
another field.
Referring now to FIG. 8, the Grow Bottom routine (Appendix A, 2.4.4)
enlarges the bottom of the mask until all of the data in the field has
been captured or until the growth of the mask reaches a predetermined
limit, defined as 2D, the spacing between neighboring data masks. The mask
is grown and during the growth lines are detected in the manner similar to
the line detection technique described in Phase I. The mask is grown by
searching a perimeter row for ON bits. If a violation (the presence of ON
bits) is encountered, the column to the left and right of the violation on
the next row (one row up if growing the top of the mask, one row down if
growing the bottom of the mask) is checked for ON bits. If either column
has ON bits, then the process is repeated. This process terminates if
either both bytes become OFF, or the maximum growth limit of 2D is
reached. Once any portion of a line is detected during mask growth the
complete line is removed. A similar technique is used in the Grow Top
(Appendix A, 2.4.3) to enlarge the bottom of the mask to capture data. For
example, the word "Twin" in FIG. 13 would be located entirely within the
grown data mask and the line removed. The line removal process is
activated whenever the Grow Top or Grow Bottom routine detects a line in
the ambiguous 2D area (FIG. 8). Line removal begins by following the path
of the line and erasing the line as it is followed. The erasing of the
line also erases anything that lies either one pixel above or one pixel
below the line. This is to account for edge noise that is generated during
printing and scanning. As the line is being erased, a check is made to
determine whether there is any data that may be either touching or going
through the line. If such data is found, the areas where the pixels may be
from the line or from the data are not erased (see FIG. 9). The line
detection and erasing algorithm has also been implemented in APL and is
listed in Appendix C together with descriptions of the respective lines
and routines. FIGS. 10 and 11 illustrate a pattern processed by the
algorithm of Appendix C wherein FIG. 10 is the input pattern where all
black or ON pixels are denoted with an asterisk, and FIG. 11 is the result
of the line detection algorithm wherein the lines have been detected and
the asterisks replaced with dashes.
While the preferred embodiments of the present invention have been
illustrated in detail, it should be apparent that modifications and
adaptations to those embodiments may occur to one skilled in the art
without departing from the scope of the present invention as set forth in
the following claims.
##SPC1##
##SPC2##
##SPC3##
* * * * *
|
|
|
|
|
Description  |
|