|
Claims  |
|
|
I claim:
1. A method of operating a computing system comprising a host computer and
a printer connected to receive an input stream of electrical signals
defining a print job from said host computer encoded in any one of a
plurality of printer control languages comprising the steps for:
causing an application program to run upon said host computer to generate a
print job encoded in a selected printer control language,
transmitting the print job to said printer,
sampling a portion of the print job,
analyzing the sampled portion of the print job using statistical techniques
using stored data sets to identify the printer control language in which
it is encoded, said stored data sets comprising statistical data setting
forth a measure of the ability of selected n-grams occurring in print jobs
to distinguish a given printer control language from all others, and
interpreting the print job in accordance with the printer control language
identified by the sampling and analyzing steps.
2. A method of operating a computing system comprising at least one host
computer and at least one printer, said host computer outputting print
jobs encoded in a plurality of printer control languages and said at least
one printer processing print jobs encoded in more than one printer control
language comprising the steps of:
a) generating samples of print jobs encoded in various printer control
languages,
b) analyzing said samples using statistical techniques to build data sets
defining distinguishing characteristics for each printer control language,
said data sets comprising statistical data setting forth a measure of the
ability of selected n-grams occurring in print jobs to distinguish a given
printer control language from all others,
c) storing said data sets in said printer,
d) capturing the initial portion of a new print job being transmitted to
said at least one printer and testing said initial portion against said
data sets to identify the printer control language in which the new print
job is encoded, and
e) printing the new print job using an interpreter or emulation suitable
for the printer control language identified in the preceding step.
3. A method of operating a computing system comprising a host computer and
a printer connected to receive an input stream of electrical signals
defining print jobs from said host computer encoded by specific computer
applications in a plurality of printer control languages and in pure text
format comprising the steps for:
sampling a portion of the print jobs created by a plurality of applications
programs for each printer control language and in pure text format using
statistical techniques to build data sets that can be used to distinguish
sampled print jobs according to the printer control language in which they
have been encoded, said data sets comprising statistical data setting
forth a measure of the ability of selected n-grams occurring in print jobs
to distinguish a given printer control language from all others,
storing the data sets in the printer,
running an applications program on said host computer to generate or
acquire a new print job,
transmitting the new print job to said printer,
sampling the initial portion of the new print job being transmitted to the
printer,
analyzing the sampled portion of the new print job using the stored data
sets to identify the printer control language or pure text format in which
it is encoded, and
interpreting the input stream in accordance with the printer control
language, if any, identified by the sampling and analyzing steps.
4. A method of operating a printer configured for processing print jobs
encoded in more than one printer control language comprising the steps of:
a) storing data sets obtained by statistical techniques in the printer,
said data sets defining distinguishing characteristics of said more than
one printer control language, said data sets comprising statistical data
setting forth a measure of the ability of selected n-grams occurring in
print jobs to distinguish a given printer control language from all
others,
b) capturing an initial portion of a new print job and testing said initial
portion against said data sets to identify the printer control language in
which the new print job is encoded, and
c) printing the new print job using an interpreter or emulation suitable
for the printer control language identified in the preceding step.
5. A method according to claims 2, 3 or 4 wherein the data sets comprise
statistical data setting forth a measure of the likelihood of combinations
of selected n-grams occurring in print jobs are indicative of print jobs
encoded in a given printer control language.
6. A method according to claim 5 wherein the data sets comprise a plurality
of vectors of real numbers corresponding to the selected n-grams for each
printer control language and a threshold value corresponding to each
vector.
7. A method according to claim 6 wherein for each printer command language
a score is computed based upon the number of occurrences of each n-gram in
the sampled portion of the print job and each vector of data sets for each
printer command language until the score computed with a given vector when
compared to the corresponding threshold indicates the print job is encoded
in the printer command language to which that vector corresponds and
directing the interpreting means to interpret the print job in accordance
with that printer control language.
8. A method according to claims 2, 3 or 4 wherein the data sets comprise
vectors of real numbers corresponding to the selected n-grams for each
printer control language.
9. A method according to claim 8 wherein for each printer command language
a score is computed based upon the number of occurrences of each n-gram in
the sample portion of the print job and the data sets for each printer
command language, said scores being indicative of the likelihood of the
print job being coded in each command language and directing the
interpreting means to interpret the print job in accordance with the
printer control language having the score indicating it is the most likely
language in which the print job is encoded.
10. A method according to claims 2, 3 or 4 wherein the data sets comprise
statistical data setting forth a measure of the likelihood of selected
n-grams occurring in print jobs are indicative of print jobs encoded in a
given printer control language by a given application.
11. A method according to claims 2, 3 or 4 wherein the data sets comprise
statistical data setting forth a measure of the likelihood of combinations
of selected n-grams occurring in print jobs are indicative of print jobs
encoded in one printer control language by a given application or another.
12. A method according to claims 2, 3 or 4 wherein the data sets comprise
statistical data setting forth a measure of the likelihood of selected
n-grams occurring in print jobs are indicative of print jobs encoded in a
given printer control language by a given application, said data weighted
by the ability of said n-grams to distinguish a given printer control
language from other printer control languages.
13. A method according to claims 2, 3 or 4 wherein the n-grams included in
the data sets avoid sequences of signals representing device dependent
characters or parameters.
14. A method according to claims 2, 3 or 4 wherein the n-grams included in
the data sets avoid sequences of signals representing which are
application dependent.
15. A method according to claims 2, 3 or 4 wherein the n-grams included in
the data sets comprise command sequences which have a correlation with a
printer control language.
16. A method according to claims 2, 3 or 4 wherein the n-grams included in
the data sets map upper and lower case characters to the same character
code.
17. In a printer for receiving an input stream of electrical signals
defining a print job from a host computer, said input stream encoded by a
computer application in any one of a plurality of printer control
languages, said printer comprising means for interpreting each of said
plurality of printer control languages to define a bit mapped image, means
for converting the bit mapped image into a visual display of said image,
the improvement comprising:
means for sampling a portion of an input stream,
means using a printer resident algorithm and a plurality of data sets
obtained by statistical techniques for analyzing the sampled portion of
the input stream to identify the printer control language in which it is
coded, there being at least one data set for each printer control
language, said data sets comprising statistical data setting forth a
measure of the ability of selected n-grams occurring in print jobs to
distinguish a given printer control language from all others, and
means for directing the interpreting means to interpret the input stream in
accordance with the printer control language identified by the sampling
and analyzing means.
18. A printer according to claim 17 wherein the data sets comprise
statistical data setting forth a measure of the likelihood of combinations
of selected n-grams occurring in print jobs encoded in one printer control
language.
19. A printer according to claim 18 wherein the data sets comprise a
plurality of vectors of real numbers corresponding to the selected n-grams
for each printer control language and a threshold value corresponding to
each vector.
20. A printer according to claim 19 wherein for each printer command
language a score is computed based upon the number of occurrences of each
n-gram in the sample portion of the print job and each vector of data sets
for each printer command language until the score computed with a given
vector when compared to the corresponding threshold indicates the print
job is encoded in the printer command language to which that vector
corresponds and directing the interpreting means to interpret the print
job in accordance with that printer control language.
21. A printer according to claim 17 wherein the data sets comprise vectors
of real numbers corresponding to the selected n-grams for each printer
control language.
22. A printer according to claim 21 wherein for each printer command
language a score is computed based upon the number of occurrences of each
n-gram in the sample portion of the print job and the data sets for each
printer command language, said scores being indicative of the likelihood
of the print job being coded in each command language and directing the
interpreting means to interpret the print job in accordance with the
printer control language having the score indicating it is the most likely
language in which the print job is encoded.
23. A printer according to claim 17 wherein the data sets comprise
statistical data setting forth a measure of the likelihood of selected
n-grams occurring in print jobs are indicative of print jobs encoded in a
given printer control language by a given application.
24. A printer according to claim 17 wherein the data sets comprise
statistical data setting forth a measure of the likelihood of combinations
of selected n-grams occurring in print jobs are indicative of print jobs
encoded in one printer control language by a given application or another.
25. A printer according to claim 17 wherein the data sets comprise
statistical data setting forth a measure of the likelihood of selected
n-grams occurring in print jobs are indicative of print jobs encoded in a
given printer control language by a given application, said data weighted
by the ability of said n-grams to distinguish a given printer control
language from other printer control languages.
26. A printer according to claims 17, 18, 20 or 22 wherein the n-grams
included in the data sets avoid sequences of signals representing device
dependent characters or parameters.
27. A printer according to claims 17, 18, 20 or 22 wherein the n-grams
included in the data sets avoid sequences of signals representing which
are application dependent.
28. A printer according to claims 17, 20, 26 or 27 wherein the n-grams
included in the data sets comprise command sequences which have a
correlation with a printer control language.
29. A printer according to claims 17, 20, 26 or 27 wherein the n-grams
included in the data sets map upper and lower case characters to the same
character code. |
|
|
|
|
Claims  |
|
|
Description  |
|
|
MICROFICHE APPENDIX
Filed herewith is a microfiche appendix comprising 1 microfiche and 50
total frames.
FIELD OF THE INVENTION
This invention relates to computer printers and more specifically to
computer printers capable of printing jobs which may be encoded in one or
another printer command language.
BACKGROUND OF THE INVENTION
Printers associated with computers receive print jobs transmitted from the
computers. The print jobs comprise data (character codes and graphic
elements encoded in bit maps, etc.) and, usually, instructions encoded in
a specific printer command language. However, when only data and certain
standard instructions (tabs, line feeds, etc.) are transmitted, the print
jobs are said to be in the form of "pure" text.
A printer command language is a set of instructions understood by a
printer. It may include information about positioning of text and/or
graphics and options to control the attributes (e.g., font style, font
size, color, density) of the printed information. Examples of such
languages are Postscript, HP PCL, HP GL, and Impress. A printer command
language is considered to consist only of the defined command sequences
and the specified number of parameters associated therewith. In a print
job, the printer command sequences are interspersed with data.
In order for a printer to process a print job encoded in a particular
printer command language, it must have a combination of hardware and
software capable of understanding and processing the printer command
language. Typically, the printer has a controller which itself is a
digital computer programmed with interpreters or emulations for processing
more than one command language. For example, the printer controller may
first generate a bit map stored in its page memory from the print job.
Other apparatus, with reference to the bit map, produces a hard copy.
Examples of computer printers that are capable of converting bit maps to
hard copy are laser printers, thermal printers and dot matrix printers.
Applications programs that run on host computers are end-user programs (or
frequently used utility programs) which generate print jobs. Applications
programs generate print jobs using a printer command language or in the
form of pure text. It is likely that different applications programs used
with a given host computer use different printer command languages and/or
pure text output.
A print job must be transmitted to a printer that can interpret the
language in which the job is encoded and, if the printer can handle more
than one printer command language, the correct interpreter or emulation
must be selected.
In the past, three methods have be used to assure that a print job is
transmitted to a printer prepared to interpret the print job: 1) Users or
host software selected from among a variety of printers connected to a
computer system, each of which can handle print jobs encoded in a single
printer command language. 2) Switches of some form are set manually upon
the printer capable of handling more than one printer command language in
order to select the printer command language desired by the user. To
change the printer command language processed by a printer, the switches
must be altered and the printer reset in some fashion. 3) Additional
command sequences or job headers may be defined by the printer
manufacturer to be sent at the start of print jobs to select a desired
printer command language.
The prior methods of directing a print job to a printer prepared to receive
it have shortcomings. Multiple printers each dedicated to one printer
command language can be an expensive solution. A printer must be purchased
for each language Moreover, some printers may be heavily used while others
sit idle. The use of configuration switches to select a printer command
language may lead to resource contention as the users of one printer
command language may inhibit the use by others. The user closest to the
printer can dominate use of the printer because, to assure that a printer
is configured to receive a print job, a trip to the printer is required.
The use of job headers involves non-standard command sequences across
printers made by different manufacturers. It also involves modification of
existing application software to generate the headers for each print job.
SUMMARY OF THE INVENTION
It is an object according to this invention to provide printers with the
capability of recognizing the printer command language in which a print
job (without special header) is encoded and to process the print job
accordingly.
It is an advantage according to this invention, to improve the productivity
and throughput of printer resources, especially in a networked
environment.
It is a further advantage according to this invention to provide methods
and apparatus for identifying the printer control language of print jobs
from a sample of the print jobs.
It is still another advantage according to this invention, to provide
methods and apparatus for automatically identifying the printer control
language of print jobs which do not annoyingly delay the processing of the
print job nor require the use of hardware that is prohibitively expensive.
Briefly, according to this invention, there is provided a method of
operating a computing system comprising a host computer and a printer. The
printer is arranged to receive an input stream of electrical signals
defining a print job from said host computer. Print jobs are encoded by
computer software applications being executed by the computer in any one
of the plurality of printer control languages. The method comprises
running an application program on said host computer to generate a print
job. The next step is outputting or transmitting the print job to said
printer without special headers or without first activating switches upon
the printer. The next steps comprise sampling a short portion, say from 64
to 512 bytes, of the print job (usually at the start of the print job)
received at the printer and using statistical techniques analyzing the
sampled portion of the print job to identify the printer control language
in which it is encoded. As used herein "statistical techniques" mean
techniques for selecting those characteristics of a printer control
language based on the off-line analysis of large sample sets of print jobs
encoded in a given language. The final step is interpreting the entire
print job in accordance with the printer control language identified by
the sampling and analyzing steps.
A related method of operating a computing system according to this
invention comprises the steps of:
a) gathering samples of many print jobs encoded in various printer control
languages,
b) using statistical techniques, analyzing said samples to build data sets
defining distinguishing characteristics for each printer control language,
c) storing said data sets in said printer,
d) providing means in said printer for capturing a portion (usually the
initial portion) of a new print job and testing said portion against said
data sets to identify the printer control language in which the new print
job is encoded, and
e) printing the new print job using an interpreter or emulation suitable
for the printer control language identified in the preceding step.
There is also provided, according to this invention, an improvement in
computer printers which receive input streams of electrical signals
defining print jobs. The print jobs may be encoded by a specific computer
application in any one of a plurality of printer control languages. The
printer has the capability, usually implemented by a combination of local
digital computer hardware and software, for interpreting each of said
plurality of printer control languages to define a bit mapped image. The
printer further comprises suitable apparatus for converting the bit mapped
image into a visual display of said image. The improvement comprises the
following. The printer is provided with a buffer means for capturing a
portion of the start of any print job. The printer has stored therein an
algorithm and a plurality of statistically derived characterizing data
sets for analyzing the captured portion of the input stream to identify
the printer control language in which it is encoded. Data sets are
provided for each printer control language which the printer can
interpret. In accordance with the control language identified by the
analyzing means, the printer processes the print job in the appropriate
control language.
According to this invention, the characterizing data sets comprise
statistical data reflecting the likelihood of selected n-grams (short
sequences of characters) occurring individually or in combination in print
jobs encoded in a given printer control language. The data sets comprise
lists of selected n-grams and weighted pattern vectors (ordered lists) of
real numbers corresponding to the selected n-grams for each printer
control language. The values in the weighted pattern vectors are
indicative of the likelihood of n-grams occurring or co-occurring in a
given language and the diagnostic value of the n-grams. According to a
preferred embodiment, the data sets comprise a plurality of pattern
vectors for each printer control language and a threshold value
corresponding to each weighted pattern vector. There may be more than one
data set for a given printer control language, for example, based upon
print jobs created by different applications that differently use the same
printer control language. The data within the data sets are weighted by
the ability of n-grams to distinguish a given printer control language
from other printer control languages.
An aspect of this invention is the selection of the particular n-grams to
be used to characterize a particular printer control language. Preferably,
the n-grams for which data is included in the printer control language
data sets do not include sequences of signals representing device
dependent characteristics or parameters, sequences of signals which are
application dependent, and subsets of longer n-grams. Preferably, the
n-grams included in the data sets comprise command sequences which have
correlation with a printer control language.
At run-time, the captured portion of the input stream is analyzed to
develop a sample vector indicative of the presence and frequency of
certain n-grams in the captured or sampled portion of the input stream.
This sample vector is used with the weighted pattern vectors associated
with each printer control language to calculate scores which can be used
to select the correct interpreter for processing the input stream. The
values in the sample vectors correspond to the same n-grams for which data
is included in the weighted pattern vectors. In one embodiment, a score is
computed for each language and the language receiving an extreme score
(highest or lowest, depending upon the details of the calculation) is
selected. A procedure is provided to handle tie scores. A procedure is
also provided to handle an inability to select a language according to the
n-gram patterns found in the sample vector.
According to a preferred embodiment, for each printer command language a
score is computed based upon the number of occurrences and/or
co-occurrences of selected n-grams in the sampled portion of the print job
as represented by a sample vector with each weighted pattern vector of
data sets for each printer command language until the score computed with
a given weighted pattern vector when compared to a threshold associated
with the given weighted pattern vector indicates the print job is encoded
in the printer command language to which that weighted pattern vector
corresponds.
THE DRAWINGS
Further features and other objects and advantages will become apparent from
the following detailed description of the preferred embodiments in which:
FIG. 1 is a schematic drawing illustrating the organization of a computing
system and a printer with self-selecting of print job interpreter;
FIG. 2 is a diagram illustrating the multiple weighted pattern vectors or
discriminants used when identifying the printer command language of print
jobs; and
FIG. 3 is a simplified diagram illustrating the assignment of weighted
pattern vectors to the decision structure used to implement the process
illustrated in FIG. 2.
DESCRIPTION OF THE PREFERRED EMBODIMENTS
Referring to FIG. 1, there is shown schematically a computer printer 10
connected to a host computer 11 for receiving print jobs through the port
12. The printer may be considered to comprise a printer controller 13 and
a print engine 14. The printer controller receives the print job and
controls the print engine to generate the hard copy 15 of the print job. A
number of types of print engines are known including thermal printers,
laser printers and dot matrix printers. This invention is not specific to
any particular printer or type of print engine.
Printer controllers are electronic circuits usually including a local
microcomputer, including digital processor, memory for storing control
programs and a page memory for storing all or a portion of the bit mapped
definition of the text and graphics to be printed. A portion of the
control programs for the printer controllers are standard interpreters of
established printer control languages or emulations thereof. The details
of the printer controller or the interpreters for the various printer
command languages are not a part of this invention. As shown in FIG. 1,
the printer controller may have stored therein the interpreter or emulator
for more than one printer control language. In this case, there must be
means to select the correct interpreter for an incoming print job.
According to this invention, the controller has a buffered input so that a
short initial portion of the input data stream comprising a new print job,
say 64 to 512 bytes, can be captured and analyzed to determine the printer
control language in which it is encoded. Also, stored in the controller
memory, is an algorithm or algorithms and characterizing data sets that
permit the determination of the correct interpreter or emulation to be
used with the print job.
Practice of the preferred embodiment according to this invention may be
broken down into two stages: off-line training and on-line application.
Off-line training comprises the development of the characterizing data
sets using statistical methods to be stored in the printer controller.
On-line application comprises the run-time use of the data sets and a
suitable algorithm to identify the control language of new print jobs.
Practically speaking, the data sets must be developed and tested on a
computer prior to being installed in and used by a printer.
Off-line training comprises determining the identifying characteristics of
each printer command language and appropriate weights to be given those
characteristics. Print jobs at the most primitive level consist of data
comprising sequences of data (character codes and graphic elements) and
command sequences (instructions) in the printer control language. Some
command sequences are unique to a given printer control language and
others are not. During off-line training, a representative sample of print
jobs encoded in each printer control language of interest and jobs in pure
text form are analyzed using statistical methods to empirically derive
characteristic sequences of each printer command languages referred to
herein as n-grams.
Applications programs tend to make use of different subsets of the
available command sequences in a printer command language. This subset
selection is dependent upon the software architecture of the applications
and the use of custom or generic printer driver interfaces. In any event,
it is desirable to obtain samples of print jobs encoded in the various
command languages generated by diverse applications such as word
processors, spreadsheets, graphics packages, page layout utilities,
CAD/CAM packages and other important applications. A variety of samples
from each application in the environment of interest should be gathered.
Samples should be drawn from different application modes (e.g., text,
graphics, mixed), different printer initializations and page setups (e.g.,
portrait, landscape, and some different margins), different document
lengths (e.g., one page, two pages) and other major features. A large
number of samples of each application, release and configuration is
desired to guard against statistical anomalies.
Selecting n-gram Sets
The analysis of the samples may be automated by use of a computer program,
performed by simple examination or both. Practically speaking, the
automation is required for off-line analysis and generation of printer
control language data sets. Characteristic sequences or n-grams are chosen
such that each occurs in a significant portion of the samples for a given
control language or where the n-gram is expected to occur in a previously
not considered subset of the samples. To reduce the potential
combinatorial explosion in the number of n-grams found, analysis may be
limited to subsets of the initial set of characters in the sampled print
jobs, for example, the first 64, 128, 256 or 512 bytes, of the print jobs.
A justification for this restriction is that the first few lines of a
print job will tend to perform similar actions (relying on the same
commands) across a wide variety of applications (e.g., prologues,
initializations). In addition, it will also reduce memory storage
requirements and interpreter selection time.
More precisely, n-grams are one or more distinct character codes (perhaps
ASCII character codes) concatenated in a fixed order to produce a string
of length one or more. Some desirable attributes of the characteristic
n-grams include the following:
a) Longer strings are preferred to shorter strings as being more likely to
be unique to a printer command language.
b) Strings that are substrings of other strings should be avoided, where
possible. Substrings may be desired where unique semantics are entailed.
c) Single character strings should be avoided where possible because the
proper weights for these features would require analysis of vast numbers
of samples to eliminate special case bias.
d) Sequences containing printable numeric codes should be avoided as
numerics usually correspond to parameters that are application or site
specific.
e) Proper names, dates or other identification references should be avoided
as application or site specific.
f) Command sequences that have strong correlation to printer command
languages should be utilized.
g) Optionally, upper and lower case characters may be mapped to the same
character code.
h) Manual intervention can be used to fine tune the n-gram selection.
A computer program may be used for extracting and identifying a set of
n-grams from the print job samples and producing frequency vectors the
elements of which are the frequency of each n-gram across the sample set
for that printer control language. The program is written to examine a
number of sample files containing the initial snapshot of print jobs.
Given a set of print job samples for a specific printer command language
and a set of options to indicate the useful characteristics of desirable
command sequences, a set of n-grams is derived from the snapshots. The
n-grams are selected according to the frequency of occurrence across a
number of print job samples or based upon patterns of co-occurrence and
their ability to diagnose previously unrecognized subsets of the samples.
So, for example, in a fast method of analysis, n-grams may only be
selected if found in a minimum percent of samples examined. Typically, the
minimum percent chosen is between 50 and 80.
An alternative correlation algorithm would select n-grams which are highly
diagnostic of large subsets of samples but which do not represent a
duplication of information (low co-occurrence with previously selected
n-grams). Minimum percent requirements do not apply as each candidate
n-gram is evaluated in the context of sample subsets which are not
appropriately represented by the previously selected n-grams.
If an n-gram is highly diagnostic of a particular but rarely used subset of
samples that are difficult to classify, it may be included even though it
is rarely seen.
Co-occurrence probability or conditional probability is defined to be the
probability that a given string A can be found in a sample given that
another string B also occurs. The probability is defined as occ(A&B) /
occ(B) where occ(X) is the integer number of samples where the event X is
true. This information is gathered by cycling through the current set of
samples, and computing all frequencies of occ(B) and occ(A&B).
A program named "Analysis" has been written to select the n-grams from
samples of a given command language. This program examines the beginning
of sample data files in order to locate commonly used data patterns and
counts the number of occurrences. A number of options may be provided to
control how much effort the program will exert to find character patterns.
The program seeks to select diagnostic sets of n-grams based either a)
upon frequency of occurrence, co-occurrence probabilities of the n-grams
and string lengths or b) according to the frequency of occurrence across
samples. Analysis makes use of procedures Processfile (which in turn makes
use of Patternscan and Wordscan), Trim.sub.-- by.sub.-- options,
Trim.sub.-- percent and Selectstrings which are all described herein.
##SPC1##
Assigning Weights
It would be desirable if the n-grams described features which are unique
command sequences for each candidate printer control language.
Unfortunately, this is not always possible. Hence, it is necessary to
assign weights to n-grams or combinations of n-grams to designate their
ability to distinguish one language from all others. The set of n-grams
and weights attached to each enable the differentiation between print jobs
encoded in a given language from all others.
Two methods of assigning weights are herein disclosed. The method used
depends upon the confidence in the set of n-grams derived and the
strictness of the requirement to make the correct selection. The method of
assigning weights is directly related to the run-time method of selecting
the correct interpreter to be used by the printer.
In the first method, a single pattern vector is generated which uses the
relative observed occurrences of n-grams and other features to compute a
single weight for each n-gram. For example, the weight (w.sub.s) of each
n-gram is computed as follows:
w.sub.s =(n.sub.s / (n.sub.a +1)) * (l.sub.s.sup.2)
where
n.sub.s =the number of occurrences of the n-grams,
n.sub.a =the total number of occurrences of all n-grams in all samples, and
l.sub.s =the length in characters of each n-gram.
The above weighting equation was determined by empirical evaluation of a
number of sample printer command languages and example print jobs as
providing a reasonable way to balance the frequency of occurrence of
useful n-grams against the diagnostic importance of longer strings. It is
not the only possibility as improvements in performance observed may be an
artifact of the language studied. In other words, certain command
sequences may be recognized as unique to one printer command language and
may be assigned added weight. The results of applying this calculation for
a printer command language is a single weight pattern vector (ordered
list) of real numbers containing one value for each of the n-grams
diagnostic of that language. This vector may be used in a single vector
statistical frequency method to identify print jobs.
It may be the case that a single linear discriminant (i.e., a mathematical
function based upon a single vector of weighted values) is not sufficient
to differentiate print jobs. This may be due, for example, to the
interaction between the characteristic n-grams selected. For this reason,
a second method involving a more complex weighting scheme which produces
multiple weighted pattern vectors for each printer command language has
been implemented. Essentially, a tree of weighted pattern vectors is
provided for each command language which will determine that the,
candidate language should be selected because all other languages are
eliminated or that the determination cannot be made. This process
continues until a select or reject decision is determined for the current
language. An algorithm to handle ties may also be provided. A procedure is
provided to handle the inability to select a language based on the n-gram
patterns in evidence. The process is illustrated schematically in FIG. 3
for the current language. The training set may be developed from the
original samples for n-gram analysis and other samples provided for this
specific purpose.
In order to use this method, a training set for each printer command
language to be considered must be formed. Each training set consists of a
set of positive samples (samples known to be encoded in the command
language) and a set of negative samples (samples known to be encoded in
other command languages or as pure text). Each sample is mapped or
represented as a vector of the frequency of occurrences of each of the
n-grams selected for the language for which the vector tree is to be
produced and a "similarity measure" or 1.0 for a positive sample and 0.0
for a negative sample. A linear discriminant (in effect, an equation
defining a hyperplane in multidimensional space) which best separates a
subset of the positive samples from the remainder of the samples in the
training set is computed (defined by a vector with values or weights for
each n-gram and a threshold value). The process of choosing hyperplane is
shown schematically with reference to FIG. 2 which, of course, can only
show the trivial two-dimensional case. If the first found hyperplane
separates all positive and negative samples, no further separation is
required. If not, a further hyperplane is found that separates the
remaining subsets of positive and negative examples. This process is
repeated until all subsets bounded by hyperplanes contain only one set of
positive or negative samples. This process results in a plurality of
numeric decision vectors and thresholds.
The result of the weighting process (either method) is a prototype for each
language. Computer programs to be run off-line assist in the calculation
of the weighted vectors. The computer program for weighting the single
weighted vector is straight forward. The computer program for the second
(multiple vector) weighting scheme is, of course, more complicated. In an
implementation of this weighting process, three separate computer programs
are useful. The first program determines the mapping between selected
n-grams of the languages and sets of positive and negative print jobs. The
next program implements a version of the Athena classifier algorithm of C.
Koutsougeras. (See Israel, P. and Koutsougeras, C. (1989) "Associative
recall based on abstract object descriptions learned from observations:
The CBM neural net model," Proceedings of the IEEE workshop on Tools for
AI, Fairfax, Va. (October 1989).) The source code for the version of the
Athena algorithm used by the applicant's and called Freya is set forth in
the microfiche appendix.
Given a set of sample vectors representing positive and negative samples, a
tree of weight vectors is derived. The tree allows evaluation of
combinations of n-grams or "higher-order" attributes and their relations
to the positive and negative samples. A final program .assembles the list
of n-grams selected for a given language and the tree of weighted vectors
into a "language prototype" for use in the run-time environment. Using the
tree of weights and a sample vector from a new print job, it is possible
to infer whether the new print job is encoded in a print language
represented by a weighted pattern vector tree.
One set of n-grams and corresponding weighted vectors may be insufficient
to characterize all (or most) of the print jobs encoded in a given
language. It is possible that several alternative sets of n-grams and
weighted pattern vectors may be necessary or desirable to accurately
identify the printer command language of print jobs encoded in a given
language for the following reasons: 1) A single set of n-grams may be too
large and unwieldy for "efficient" on-line application. 2) Some n-grams
may be useful diagnostic factors but are not sufficiently unique to a
single language. These sequences may be encoded but the degree to which
they contribute to the certainty of the identification of the language is
limited. 3) Different subsets of the command sequences may be used in
different environments. Specialization of the set of diagnostic n-grams
may improve performance in making decisions. 4) An n-gram may be added
after encountering valid, but rare examples of command sequences. For
these reasons and others, it is occasionally necessary to build more than
one language prototype for a single printer command language.
Language Selection
Once language prototypes have been developed for each of the printer
command languages to be considered, this information may be used to make a
selection of the language which is the nearest match to a new print job.
This is an iterative process. For each available language prototype and
with a sample vector derived from the snapshot of a new print job, a
similarity score is calculated. The language with an extreme score is said
to "win" and the interpreter or emulator for that language is used to
interpret the new print job. In more technical terms, a pointer to the
interpreter for the winning language is returned by the iterative
procedure that calculates the scores. It may be necessary to handle "ties"
when two languages receive the same or a very similar score. In such a
case, the ties can be resolved by reference to an auxiliary priority
assigned to each language ordering their priorities (stored with the
language prototype). Other strategies for resolving ties might comprise
selecting the language with the "highest wins," "most recently used,"
"most frequently used," "first-come-first-served," or other method
specified by the user.
Finally, the language prototypes may not provide coverage of all
distinguishing n-grams of the languages considered. In such a case, it may
not be possible to make a choice between languages. An error value must be
returned by the iterative procedure that calculates the scores instead of
a pointer to an interpreter. The procedure that uses this language
inference procedure will have to detect the error value and initiate
alternative processing, such as using an alternative algorithm to select
an interpreter (e.g., default value) or it may elect to reject and ignore
the print job (possibly after issuing a warning message).
A pseudo-code representation of a language selection algorithm is set forth
hereafter.
______________________________________
function LanguageSelect (L: LangChars; J : PrintJob)
/* COMMENTS: The input parameters are L, a pointer to an
indexed array of the language prototypes, and J, a
pointer to the snapshot of the new print job. The
function returns an integer pointer. This function calls
two others TestLanguage and WinTie. TestLanguage is
described in more detail hereafter. Note: := means
assignment;
== means equals; != means does not equal.*/
vars
ptr :pLanChars
i,ret :Integer
scr,max :Real
begin
ret := FAILURE.sub.-- VALUE
max := 0.0
for i := 1 to (# of LangChars) do
ptr := L[i] /* Get next
prototype
*/
scr := TestLanguage(ptr, J)
/* Call function
to compute similarity score*/
if scr > max /* handle clear
wins */
then
max := score
ret := i
else
if ( (ret != FAILURE.sub.-- VALUE) and
(scr == max) and
(WinTie(ret, i) ) /* Call function
to resolve ties */
then
ret := i
endif
endif
endfor
return ret /* Return Pointer
*/
end LanguageSelect
___________________________ | | |