|
Description  |
|
|
FIELD OF THE INVENTION
The invention disclosed herein relates to data processing devices and more
particularly relates to post processing devices for character recognition
machines such as optical character readers and speech analyzers. The
invention can also be applied to the analysis of typographical errors
resultsing from the use of a standard keyboard.
BACKGROUND OF THE INVENTION
From its technical debut, the optical character recognition machine (OCR)
has had unique potential for purposes of text processing applications. Its
input processing rate far exceeds that of keypunch or typewriter inputs
and its output is in machine readable form. Despite these very important
attributes, optical character recognition machines have made only minor
inroads to overall text processing applications. This may be principally
due to the tendency of state of the art character recognition machines to
generate a substantial percentage of erroneous misreads when a variety of
fonts and formats are scanned.
When multifont nonformatted optical character recognition is attempted,
problems arise which are not as prevalent in unifont applications. They
stem from the highly error prone character recognition environment which
is created when the character recognition machine operation is performed
over many different alphabetic and numeric fonts with minimum control
exercised over text conventions and typographical print quality. When
scanning such a text, discrimination between confusable character
geometries causes a nominal 5 percent character misrecognition rate.
In the prior art, apparatus for selecting the correct form of a garbled
input word misread by an OCR has been limited to correcting errors in the
substitution misrecognition mode. For improving the performance of an
optical character reader, prior art discloses the use of conditional
probabilities for simple substitution of one character for another or of
character rejection, for calculating a total conditional probability that
is input OCR word was misread, given that a predetermined dictionary word
was actually scanned by the OCR. But the prior art deals only with the
simple substitution of confusion pairs occupying the same corresponding
location in the OCR word and in the dictionary word. The OCR word and the
dictionary word must be of the same length.
A significant advance in the art of post processing error correction
apparatus has been contributed by W. S. Rosenbaum, et al., in the
copending patent application Ser. No. 600,743, filed July 30, 1975, as a
continuation-in-part of application Ser. No. 459,820, filed Apr. 10, 1974,
now abandoned, both applications being assigned to the same instant
assignee. A regional context error correction apparatus is disclosed
therein which corrects for segmentation errors as well as substitution
errors in the characters read by the OCR. Segmentation misrecognition
differs from that of simple substitution in that its independent events
correspond to groupings of at least two characters. Nominally there are
three types of segmentation errors. They are: horizontal splitting
segmentation, concatenation segmentation and crowding segmentation. The
underlying mechanical factor which these segmentation types have in common
is that they are generated by the improper delination of the character
beginning and ending points. Segmentation errors occur quite frequently in
OCR output streams and constitute a substantial impediment to accuracy in
text processing applications. The regional context error correction
apparatus disclosed in patent application Ser. No. 600,743 contains a
dictionary storage 28 shown in FIG. 3 containing words which are expected
to be read by the OCR. It is disclosed that for general English Text
Processing applications the words appearing in a conventional dictionary
may be stored in the dictionary storage 28. It is seen however that the
dictionary storage 28 would remain a substantial storage capacity to
accommodate a conventional English dictionary and would require very fast
accessing time in order to compare each word in the dictionary with the
garbled word input from the OCR. The application also discloses that the
dictionary store 28 may optionally have a bulk storage input 3 which could
for example supply selected categories of reference words which are most
likely to match with the particular type of misrecognized word received
from the OCR.
Storage techniques of the associative memory type have been disclosed in
the prior art for accessing the correct form of a misspelled word. For
example, J. J. Giangardella, "Spelling Correction by Vector Representation
Using a Digital Computer" IFEE Transactions on Engineering Writing and
Speech, Volume EWS-10, Number 2, December 1967, page 57, discloses the use
of vector representation of alpha words by assigning the numbers 1 through
26 to the letters A through Z respectively and calculating the vector
magnitude and angle for accessing the word from a memory in a general
purpose computer. Problems associated with this approach, which are
typical of those confronting the prior art, relate to the associative
memory accessing and over-inclusive or an under-inclusive class of words
to correspond with the input word of interest.
OBJECTS OF THE INVENTION
It is an object of the invention to associatively access the class of valid
alpha words as potential candidates for the correct form of a garbled
alpha word, in an improved manner.
It is another object of the invention to associatively access a group of
alpha words as potential candidates for the correct form of a garbled
alpha word, the group accessed being less over-inclusive or
under-inclusive than was possible in the prior art.
It is still another object of the invention to associatively access a group
of valid alpha words as potential candidates for the correct form of a
garbled alpha word misrecognized by an OCR machine, in an improved manner.
It is a further object of the invention to associatlvely access a group of
spoken words represented by a sequence of phoneme characters as potential
candidates for the correct form of a garbled spoken word as represented by
a sequence of phoneme characters, in an improved manner.
It is an additional object of the invention to associatively access a group
of words as potential candidates for the correct form of a word containing
typographical errors commonly committed in the use of a keyboard, in an
improved manner.
SUMMARY OF THE INVENTION
These and other objects of the invention are accomplished by the cluster
storage apparatus disclosed herein. The cluster storage apparatus outputs
groups of valid alpha words as potential candidates for the correct form
of an alpha word misrecognized by a character recognition machine, a
speech analyzer, or a standard keyboard. The cluster storage apparatus
comprises a two-dimensional array of alpha word read only storage
locations, each location having a group of alpha words arranged such that
adjacent locations contain alpha words having similar character
recognition misread propensities. A first-dimensional accessing means is
connected to the read only storage for addressing the locations based upon
the values assigned to the characters of which the input alpha word is
composed. A second-dimensional accessing means is connected to the read
only storage for accessing the location therein based upon the number of
characters in the input alpha word. The read only storage memory is
organized so as to minimize the difference in address between alpha words
which have similar OCR misread propensities and so as to cluster words of
a given character length, as well as words of other length that have a
significant probability of being malsegmented into the given length. The
propensity for misread is determined by empirical measurement of the OCR
character transfer function. The transfer function is expressed as a
series of equations representing each character's probability of being
confused into a false output character. These equations are solved for the
optimum character value set which assigns higher numeric values to highly
reliable characters and lower numeric values to less reliable characters
under a global constraint that characters that are misread related are
assigned values within a maximal distance of one another. In addition the
malsegmentation probability is determined by the OCR character transfer
function. The transfer function of the OCR is expressed as a series of
values representing the probability of a character being malsegmented.
These values are used to calculate the probability of each word being
malsegmented. The malsegmentation probability for a word is compared with
a minimum threshold so that words whose malsegmentation propensity exceeds
this threshold are stored with words of adjacent lengths. The cluster
storage organization of the read only storage memory therefore has a
structure which conforms with a global constraint such that no numeric
assignment of two characters which can be misrecognized into one another
will differ in location by more than a predetermined error interval. Thus
an input alpha word which is potentially in error can be associated with
that portion of the read only storage which contains potential candidates
for the correct form of the input alpha word without excessive
over-inclusion of extraneous alpha words or under-inclusive of significant
alpha words.
DESCRIPTION OF THE DRAWINGS
The foregoing and other objects, features and advantages of the invention
will be apparent from the following more particular description of the
preferred embodiments of the invention, as illustrated in the accompanying
drawings.
FIG. 1 is a schematic flow diagram of the vector fetch process.
FIG. 2 shows the number selection criteria for the assignment of numerical
values to alphabetical characters.
FIG. 3 is a schematic diagram showing the initial numeric assignment scheme
of numerical values to alphabetical characters.
FIG. 4 is a schematic diagram of read only storage arrangement for the
second-dimensional access.
FIG. 5 is a schematic diagram of the read only storage memory.
FIG. 6 is a detailed block diagram of the cluster storage apparatus
invention.
FIG. 7 is a general block diagram of the post processing, error correction
system containing the Bavesian Online Numeric Discriminator, Ser. No.
409,526, filed Oct. 25, 1973, now U.S. Pat. No. 3,842,402, issued Oct.
15, 1974 the Binary Reference Matrix Apparatus, Ser. No. 494,251, filed
Aug. 2, 1974, now U.S. Pat. No. 3,925,761, issued Dec. 9, 1975 the
Regional Context Maximum Likelihood Bavesian Error Correction Apparatus,
Ser. No. 600,743, filed July 30, 1975 as a continuation-in-part of
application Ser. No. 459,820, filed Apr. 10, 1974, now abandoned, and the
Cluster Storage Apparatus disclosed herein.
DISCUSSION OF THE PREFERRED EMBODIMENT
Theory of Operation
The strategy used to effect OCR error correction is to reference an error
correction dictionary and determine from all the words listed therein,
"Which of the dictionary entries is the word that was scanned by the OCR
and misrecognized into the incorrect form presently being processed?"
Clearly, a basic part of this operation is the ability to determine which
segment of the error correction dictionary should be reviewed.
Schematically this is shown in FIG. 1. The more accurately one can
delineate the portion of the dictionary which contains the correct form of
the input word, the larger the dictionary can be without compromising the
efficiency and speed of the OCR error correction operation.
When a garbled alpha word is detected in an output recognition stream and
it is desired to select a group of candidate words for its correct form,
the properties of OCR misread, makes it impossible to formulate a reliable
dictionary accessing means using the normal dictionary indexing word
attributes of character alphabetic properties and/or word length. The OCR
misread propensities can alter either or both of the prereading word
attributes in various ways. In spite of this, there is still much
potential dictionary entry information in the misrecognized data. To
utilize a garbled word as a key to the dictionary, the character string
must be analyzed in a new perspective. The vehicles for this analysis are
the Vector Fetch (VF) and Word Group file organization concepts.
The rationale which underlies the VF dictionary accessing methodology can
best be understood as a specialized application of classical statistical
confidence interval theory. As normally configured, an error interval sets
up a range of values within which the true value of the factor being
estimated can be said to lie with a predetermined error tolerance.
Within the perspective of the error interval analysis, the VF methodology
can be configured as a specialized application which uses the garbled word
data to:
a. Estimate the dictionary location of the word that was misrecognized by
the OCR.
b. Give relevance to the estimated dictionary access point (DAP) by
generating around it a range of locations wherein the required word
information lies with a predetermined certainty.
The description of the mechanics involved in the implementing of the
preceding dictionary fetch methodology is logically broken into two
portions:
1. A first-dimension accessing means, based on character content, which
requires
a. Estimation of dictionary access point within the storage apparatus
b. Determination of the fetch width constraints
2. A second-dimension accessing means which requires grouping of dictionary
words within the storage apparatus to reflect similar length
characteristics.
1.1 Estimation of Dictionary Access Point Within the Storage Apparatus
The dictionary access point (DAP) is the initial estimate of at what
location the correct form of the OCR input word lies in the dictionary
storage apparatus. The vehicle for this initial estimation process is a
specialized hashing transformation applied to the misrecognized input
alpha word. Underlying the hashing transformation is a specially developed
numeric assignment scheme in which each character in the alphabet has a
numeric value that reflects its absolute and relative OCR recognition
reliability. The particulars of the alphameric assignment scheme will be
elaborated upon shortly. It presently suffices to say that the numeric
assigned is related to the reliability of the alpha character. In its
simplest form, this implies that the more reliable an alpha character
recognition, the more weight is put upon it in the hashing calculation.
Given this alphanumeric assignment scheme, the DAP follows as the summation
of positive integers:
##EQU1##
where: L = numeric value assigned to the character in the Nth position of
the misrecognized word.
M = the number of character positions in the misrecognized word.
The key to this technique is the derivation of the appropriate alphameric
assignment scheme. Dual and seemingly conflicting constraints have to be
accommodated in the assignment scheme. Essentially, the alphameric
assignment used to compute the DAP has to:
a. Minimize the effects on the DAP of intercharacter substitutions
resulting from OCR misreads.
b. Map the dictionary words into a sufficiently uniform spread throughout
the storage apparatus.
The first constraint reflects the desire that Equation (1), the hashing
formulation, be as insensitive as possible to OCR substitution and
segmentation misread. The second constraint seeks to avoid a trivial
solution from evolving as a result of the first constraint. Such a
solution will be the collapsing of the dictionary so that all entries
occupy a single DAP or a very narrow band of DAPs within the storage
apparatus. If this were the case, nearly the entire dictionary would be
output in each fetch. For real time processing this would be an
unacceptable situation and would defeat the intent of the vector fetch
algorithm.
The optimum alphameric assignment scheme for the vector fetch can be
derived by virtue of a mathematical approach using linear programming.
This approach to vector fetch alphameric assignment scheme generation
follows by expressing the OCR intercharacter substitution propensities as
linear relations. This implies, for every non-null event in the OCR
transfer function (confusion matrix), a norm distance is set up of the
form:
.vertline.X.sub..alpha. - X .sub..beta..vertline..ltoreq. Constant (2)
where:
X.sub..alpha.,x.sub..beta. are the numeric designates of the alphabetic
characters denoted in the general case by .alpha. and .beta..
A typical OCR transfer function (confusion matrix) when reconstituted in
the above form, yields several hundred separate expressions of the form of
Equation (2). Standard linear optimization formulation, however, are not
able to directly accommodate a norm distance (i.e., and absolute value
relationship) as a base variable in the system of constraints or in its
objective function.
To allow the programming optimization of the VF alphameric assignment
scheme to reflect an analog to the OCR misread characteristics, a mixed
integer linear programming formulation was adopted. Each relationship of
the form of Equation (2), is reconstituted as a set of constraints of the
form:
##EQU2##
where: I.sub..alpha..sub..beta. represents the set of integer variables
constrained to take on the value of either one or zero.
Z.sub..alpha..sub..beta. is the variable over which the objective function
optimization of the form .SIGMA.
P.sub..alpha..sub..beta.Z.sub..alpha..sub..beta. = min is performed.
P.sub..alpha..sub..beta. is relative weight or importance value associated
with a respective constraint. In the present analysis
P.sub..alpha..sub..beta. has been set equal to the cumulative occurrence
rate of the respective .alpha., .beta. characters.
K is the fetch error tolerance in units of magnitude.
Up to this point, the system of optimization equations has only taken into
account constraints consistent with goal a above.
Goal b --the avoidance of inordinate degrees of clustering of dictionary
entries in any range of magnitude values is accomplished by appending to
the system of OCR misread equations (Equation 3) a series of constraints
which maintain a somewhat uniform distribution of entries for all segments
of the dictionary. These latter constraints are set up by randomly
selected legal entries from the dictionary word list and specifying that a
predetermined norm distance be maintained between them in the final
dictionary vector structure. For example, the entries CORNWALL and
SHERWOOD can be used to yield a vector dictionary infra-structure
constraint of the form:
(X.sub.C +X.sub.O +X.sub.R +X.sub.N +X.sub.W +X.sub.A +X.sub.L
+X.sub.L)-(X.sub.S +X.sub.H +X.sub.E +X.sub.R +X.sub.W +X.sub.O +X.sub.O
+X.sub.D).gtoreq.D.sub.1 X.sub.C +X.sub.N +X.sub.A +2X.sub.L -X.sub.S
-X.sub.H -X.sub.E -X.sub.O -X.sub.D .gtoreq.D.sub.1 (4)
the value of D.sub.1 represents the norm distance between entries SHERWOOD
and CORNWALL in a dictionary where an initial alphameric assignment scheme
has been used which yields good dictionary word lists spread
characteristics but does not necessarily meet all the OCR constraints as
given by Equation (3). The programming array of constraints is completed
by adding the additional infra-structure constraints consistent with the
simple linear format described by the SHERWOOD, CORNWALL example described
in the Equation (4).
The initial alphameric assignment scheme used to define the D values of
Equation (4), was obtained by treating Equation (1), as a vector magnitude
computation; that is,
##EQU3##
and assigning 1 through 26 (L.sub.N.sup.2 1 through 676) to the characters
in the alphabet.
FIGS. 2 and 3 indicate how the numeric assignments are made in a manner
that is consistent with that required by OCR misread magnitude distortion
minimization constraints posed by Equations (3). If the numeric scale is
to be 1 to 26, the squares of these values will range from 1 to 676. A
matrix is shown for these values without specifying the character
assignments. The vertical part of the matrix represents the input
characteristics from the scanned document; the horizontal set represents
the OCR recognition decision. All correct recognitions are indicated by
the diagonal of the matrix. All substitutions or rejects are off the
diagonal. For example, if an H and M, given values of 10 and 9
respectively, and if an H is misread as M, the difference of magnitude
will be 100 minus 81 or 19. This would be an appropriate selection since H
and M substitution is common.
If the OCR misread distortion is set at plus or minus 250 units (i.e., the
normal value of the factor K on the right hand side of the system of
equations generated from Equations (3)), then a relatively simple yet
meaningful initial assignment of alpha characters to the numeric values
indicated on the axes of the confusion matrix can be derived such that a
large number of common recognition errors are contained within these plus
or minus 250 unit error intervals. These boundaries are indicated in FIG.
2. The initial numeric assignment scheme is shown in FIG. 3, where the
shaded portion of that figure has those misreads for which the initial
scheme cannot compensate (the numbers within the matrix relate to the
relative occurrence rate of the specific misread errors). Empirical
analysis with this numbering scheme has shown that although it did not
satisfy all constraints of the form of Equations (2), it did transform a
word list into a suitable distributed dictionary which did not produce
inordinate clustering of dictionary entries. For this reason, this
numbering scheme was used to define the norm distance between the randomly
selected entries used to formulate the dictionary infra-structure
constraints as given by Equation (4). It should be noted that other
numbering schemes could have been successfully used for the basis of these
infra-structure constraints. The vector magnitude scheme was used because
of its simplicity and familiarity.
The resulting formulation of Mixed Integer Linear Programming constraints
and objective functions were solved using the facilities of IBM
Mathematical Programming System Extended (MPSX) (MPS) Program Number
5734-XM4. Similar optimization routines are available from several other
software sources. The final output of the programming solution yielded a
set of alphameric assignments which minimized hashing distortions due to
OCR misread, while maintaining a relatively uniform spread of entries over
the dictionary. The alphameric assignment scheme is shown in Table 1.
TABLE 1
__________________________________________________________________________
Final Fetch Vector Alphameric Assignment Scheme-
Values Generated Using Mixed Integer Linear Programming
__________________________________________________________________________
A=200
B=36 C=256
D=196
F=144
F=16 G=289
H=144
I=64
J=225
K=441
L=25 M=175
N=185
O=225
P=361
O=289
R=225
S=324
T=121
U=169
V=100
W=49 X=529
Y=9 Z=484
*=121
__________________________________________________________________________
1.2 Determination of Fetch Width Constraints
If the misread word were transformed into a magnitude value using the
alphameric assignment scheme shown in Table 1, then it could be assumed
that the garbled and correct forms of the same word would map into fairly
similar (close) magnitude values. If the correct form of each word had
been stored in the error correction dictionary with respect to its
magnitude, then the DAP yielded by Equation (1) would approach the
vicinity of the correct word entry required for completion of error
correction processing. However, to successfully perform the decision
process which underlies the Regional Context Likelihood Error Correction
Apparatus disclosed in copending patent application Ser. No. 600,743, it
is a prerequisite that the misread form of the word be compared in a
conditional probabilistic format with the correct version of that word.
Hence, the DAP, in itself, is not sufficient for retrieving the data
required for the latter phases of OCR error correction. However, the
proximity of the DAP to the correct dictionary entry makes it a natural
axis point for the construction of an error interval .DELTA. which will
act as the delimiter of a dictionary fetch range. If properly configured,
the fetch range .DELTA. will retrieve from locations adjacent to the DAP a
set of address entries which will contain within it, with a predetermined
error tolerance, the correct version of the misread input word. As in the
preceding example, the selection of .+-. 250 as a fetch width .DELTA.
implies an error tolerance, i.e., the possibility of the correct version
of the input word is outside the fetch range that was accessed.
The three major OCR misread errors which must be compensated for the
construction of the dictionary fetch range are reject characters,
substitution errors, and segmentation errors. The fetch is most effective
for the reject and substitution errors. Segmentation errors are
statistically less predictable and therefore not as readily overcome. A
misread word can become unretrievable using the VF if successive misreads
within the word additively reinforce one another until a delta magnitude
greater than 250 is achieved. This situation is comparatively rare in that
successive misreads will tend to randomly cancel, to some degree, the
magnitude of the deviation that each has respectively added.
1.3 Word Length Grouping Within the Storage Apparatus
Organization of dictionary structure according to word length similarities
is used to complement the accessing potential of the VF methodology.
FIG. 1 shows a schematic of the fetch process for the misrecognized input
word. The magnitude of the input word is calculated using Equation (1).
For the word shown in this is 1087. The word length is also used to reduce
the number of entries in the fetch. For OCR data, length cannot be used as
an absolute discriminant, since segmentation errors may artificially
increase or decrease the word length. A common approach to this problem is
to include not only words of the same length in the fetch as the input
word, but also all words of adjacent length and even those that differ by
as much as two positions. This is done according to the set of rules which
themselves are length-dependent. The problem with this approach is that it
leads to unacceptable large fetch sizes. (on the average, approximately 20
percent of the dictionary)
It is again possible to utilize known OCR error propensities to improve the
word length discriminant. Since word length changes are caused by some
type of segmentation problem (splitting or concatenation), only the words
that are prone to be malsegmented by virtue of their composition are
entered in more than one of the word length subdivision. This leads to the
concept of a Word Group discriminant. In a Word Group, all words of the
designated length are included as well as words of all other lengths that
have a significant probability of being malsegmented to that length.
The implementation of Word Group accessing is dependent on determination of
objective criteria by virtue of which a word's character composition may
be evaluated for assessment of the degree of missegmentation propensity
and accordingly the requirement for multiple Word Group entry. To allow
assessment of a dictionary word for inclusion in a Word Group, the
following segmentation threshold calculation is performed.
The probability of word segmentation is described functionally by Equation
(5).
P(Word.sub.seg) = 1 - P(Word.sub.seg) = 1 - P(W.sub.seq) (5)
where bar notations indicates the complement of the segmentation event,
that is, no segmentation occurrence. From empirical data averaged over all
word lengths, 80% of all segmentation will occur in words whose
P(W.sub.seq) is greater than 0.6%. It would be reasonable, therefore, to
take as a threshold for Word Group duplicative entry, any word whose
cumulative character segmentation probability surpasses this nominal value
or, in other words:
P(W.sub.seg) > T = 0.6% (6)
of course this threshold could be lowered more but this would add many more
duplicative entries while not accommodating significant additional word
segmentations. The relationship in Equation (5) can be made more
meaningful by posing it in terms of constituent character events as:
P(W.sub.seg) = 1 - P(.alpha.1.sub.seg) . P(.alpha.2.sub.seg) . . .
P(.alpha.N.sub.seg) (7)
substituting Equation (7) in Equation | | |