|
Description  |
|
|
FIELD OF THE INVENTION
This invention relates to text compression and expansion methods and
devices in general and specifically to the class of digital to digital
converters such as those found in the patent classification Class 340,
Sub. 347.
PRIOR ART
A wide variety of prior patents and published papers exist in this field.
U.S. Pat. No. 4,545,032, for example teaches one of the basic techniques
of a dictionary or tabular conversion in which numerical codes for the
basic English language vocabulary are utilized. However, the words are
broken up into prefixes, suffixes and stems and it does not appear that
weighted ranking of the dictionary entries or tabular entries was
considered or applied. Also it does not appear that the concept of being
able to switch between different tables or dictionaries based upon the
field of usage was considered at all.
U.S. Pat. No. 4,597,057 has another typical example of a compression
technique in which standard ASCII coded text may be divided into alpha,
numeric and punctuation elements as "words" with the "words" being divided
into prefixes, suffixes and stems. The prefixes, suffixes and stems
utilize numeric encoding in a fashion similar to the aforementioned U.S.
Pat. No. 4,545,032.
U.S. Pat. No. 4,295,124 is an earlier example of a dictionary or tabular
look-up type of conversion for applying numeric codes to English text
characters. The input ASCII encoded textual words are utilized to generate
a second representative code by hashing. This code can be utilized as the
memory address for comparison against a prearranged dictionary or memory
of textual words. When a match is found, the hashing address is sent as an
identifier. If a match is not found, an auxiliary dictionary is built up
but the word is transmitted uncompressed the first time it is encountered.
The next time the same word is encountered, both dictionaries are checked
and if the word is found in the second built-up dictionary, its hashing
address is sent. While this description appears to be an efficient and
useful technique, it does not employ the flexibility of numerous
dictionaries that can be carefully selected for the field usage of the
text and does not appear to employ a weighted figure of merit for
assigning entries into the dictionaries.
U.S. Pat. No. 4,386,416 is similar to the aforementioned U.S. Pat. No.
4,295,124 but utilizes an escape code to indicate whether a code from a
memory library address or a differently encoded representation of words
not found in the library are utilized. Again, no weighted figure of merit
for the dictionary entries appears to have been employed nor is the
technique of utilizing a preamble or header to identify the particular
subset of dictionaries employed in a given compression of text indicated.
The basic technique of encoding utilizing dictionaries for prefixes,
suffixes and stems of words is found in numerous references such as the
Proceedings of the IEEE, Vol. 55, No. 3, March, 1967, pages 390-396 in an
article by H. E. White entitled, "Printed English Compression by
Dictionary Encoding." A compression of only about two to one was found in
this paper and it is found to be generally similar to those of the first
two mentioned patents above.
The use of multiple tables or dictionaries itself is not totally new to the
art. An example may be found in the article by J. Pike, "Text Compression
Using A Four-Bit Coding Scheme," appearing in the Computer Journal, Vol.
24, No. 4, 1981, pages 324-330. However, the article fails to recognize
that a truly flexible system utilizing a variety of dictionaries and a
header or preamble in the compressed text to indicate which selection of
dictionaries has been made for the given compression could be utilized.
Further, the use of weighted entries for selection of the words to
comprise the given dictionaries to nowhere described.
An early paper entitled, "Experiments in Text File Compression," by Rubin
appearing in the Communications of the ACM, November, 1976, Vol. 19, No.
11, pages 617-623 treates the subject of weighted frequency of occurrence
as a figure of merit for ranking words for dictionary construction for
text compression. However, based on the experiments conducted by that
author, it was concluded that frequency was the best method for maximizing
the savings in choosing input groups, i.e., dictionary entries. This
failed to take account of the possibility that different rankings would
occur in different dictionaries for the same word and that switching
between dictionaries that are better suited for a given text is a
possibility at all.
An early paper showing the use of escape characters to effectively extend
the size of the dictionary or change from the use of one dictionary to
another is in the article entitled, "A New Technique for Compression and
Data Storage," by Hahn appearing in the Communications of the ACM, August,
1974, Vol. 17, No. 8, pages 434-436. This article also shows the dynamic
method of adding new or unknown symbols or words encountered to the
dictionary. However, character groups rather than words are utilized and
the author notes that the experiments with the technique showed somewhat
lesser degree of compression than word codes for English text compression.
An overview paper dealing with several methods but not describing the use
of the weighted word frequency technique nor the use of a preamble or
header to indicate the use of multiple dictionaries is found in the IEEE
Transactions on Software Engineering, Vol. SE6, No. 4, July, 1980, pages
340-347 in an article entitled, "Overhead Storage Considerations and a
Multi-Linear Method for Data File Compression," by Young et al. A similar
overview is found in the Computer publication of April, 1981, pages 71-75
in an article entitled, "An Overview of Data Compression Techniques," by
Reghbati.
Finally, an article by Weiss et al appearing in the Journal of Library
Automation, Vol. 11, No. 2, June, 1978, pages 97-105 presents a limited
dictionary scheme utilizing an escape code or dictionary indication code
to show to the decompression system when a compressed word has been taken
from a dictionary file and to indicate, by its absence, that the word is
not compressed and is transmitted to ordinary ASCII or EBCDIC characters.
Pointers for escape characters identify themselves as pointers and
indicate the dictionary entry for the word that they represent. Source
files of EBCDIC or ASCII encoded text are read one word at a time and the
words are compared against the dictionary. If the word is not found, the
word and any trailing delimiter such as punctuation marks or spaces are
written unchanged into the compressed file. If the word is found, a
pointer indicating that the word was found in the dictionary and pointing
to the dictionary location is written to the compressed file. This is
essentially the same technique as the aforementioned U.S. Pat. No.
4,386,416 which differs only in the terminology employed to describe this
method. The use of multiple dictionaries and a weighted frequency of
occurrence method of assigning entries are not described in this article.
From all of the foregoing, it is apparent that while much work has been
done on text compression methods, techniques and systems, the
investigators to date have avoided the use of mutiple extensive
dictionaries. This may have been due to the complexity of handling the
data storage requirements and in switching among dictionaries.
Additionally, early studies of the weighted frequency of occurrence method
of assigning entries to the dictionaries have found, contrary to the
findings of the present invention, that the weighted frequency of use for
dictionary entries is not as beneficial as pure frequency of occurrence.
The best known compression ratios from the above-noted prior art appear to
be in the range of four or five to one, but we have unexpectedly
discovered that by the use of multiple dictionaries with their entries
being selected on the basis of weighted frequency of use, compression
ratios on the order of more than six to one to as high as eight to one are
routinely achievable.
OBJECTS OF THE INVENTION
In view of the shortcomings of the above-noted prior art, it is an object
of this invention to provide an improved text compression and
decompression technique of the dictionary look-up type in which the
dictionaries themselves have their entries arranged by a weighted
frequency of use for the words encountered.
Yet another subject of this invention is to provide an improved text
compression and expansion technique in which multiple dictionaries are
employed according to the specific field of use of the text being
compressed and in which a preamble or header is utilized in the
compression for indicating which selection of dictionaries has been made
for compression.
These and still other objects of the invention not specifically enumerated
are met in a preferred embodiment of the invention as will be further
described with reference to a preferred embodiment as depicted in the
drawings herein.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a schematic illustration of a text compression and decompression
system in accordance with the invention.
FIG. 2 is a schematic illustration of the assignment of memory space into
segments containing the actual dictionaries employed, an index for each
dictionary segment and dictionary segment maps as well as control tables
for use in compression and decompression.
FIG. 3 illustrates schematically the decompression logic flow practiced by
the processor in FIG. 1.
FIGS. 4A-1, 4A-2, 4A, 4B, 4C, 4D, 4E, and 4G illustrate a fully augmented
compression routine flow chart as is implemented in the microprocessor of
FIG. 1 for compression of text.
FIG. 5 illustrates a typical example of the format of a compressed text
string utilizing the techniques of the present invention.
FIGS. 6A and 6B are schematic drawings illustrating the assignment of the
bit locations in the header or preamble section of a compressed data
record which enables interpretation by the receiver or text expansion
system for selecting the appropriate dictionaries and control tables for
expansion of the compressed text.
BRIEF SUMMARY OF THE INVENTION
In the present invention, the foregoing objects are met by providing a
plurality of language use specific dictionaries whose entries of words are
ranked in a weighted frequency of usage ranking based on statistical
studies of the areas of use employed. For example, words such as "docket"
or "versus" or "case" will appear much more frequently in legal texts than
in normal English usage. Similar professional jargon is found for other
fields as well, engineering, business, accounting, medicine, agriculture,
petro-chemicals, etc., etc., ad infinitum being possibilities. In the
present invention, the user of a text compression and decompression system
builds up dictionaries that are custom-tailored to the field of use. This
is done by utilizing a scanning and analysis technique that incorporates
counting both the number of characters in each unique word and the number
of occurrences of the word within the general usage over a sample of texts
from the user's environment. Multiple such dictionaries can be constructed
and applied to maximum beneficial effect to achieve a high degree of
compression for an individual user. A preamble or header is constructed in
compressed text to indicated the user's selection of the appropriate
dictionaries actually employed in doing the compression for a given text
source input. The header is employed to instruct the receiver as to which
dictionaries and control tables are to be loaded into memory for use in
the decompression routine. Simple, single-byte entries in a control table
can be employed for indicating the source in compressed text of each
individual word or group of words when the words are found outside of a
single byte control table itself. A hierarchy of dictionaries ranging from
the single byte control table which contains the control codes and
dictionary segment indicators as well as a list of somewhere between 80
and 224 frequently encountered words is the highest priority dictionary
for searching for compression or expansion. Two and three byte encoded
dictionary entries where the first byte indicates a specific dictionary
segment in memory and the second byte indicates an offset or memory
address position within the segment are also employed. The system may be
implemented on a typical microprocessor computer system that has a bulk
storage such as disk or cassette and a normal size working read/write
memory into which the various control tables and dictionaries can be
assembled for use by reading them from a bulk storage mechanism. Typical
flow charts for compression and decompression are included and will be
described in greater detail.
DETAILED SPECIFICATION
The compression technique of the present invention allows more data to be
stored on a given medium and/or it allows more data to be sent in reduced
time over a transmission link between any two locations. The normal
compression ratio achieved may be as high as 6 to 8 although, when the
effect of adding in the transmission header and dictionary indicators is
taken into account, the overall compression ratio falls slightly. In a
data processing organization, the delay for hard copy documentation is
excessively costly. In many such locations, the need for sharing large
quantities of information between additional remote locations results in
investment in extensive communications networks. The growth of the
networks and the major expense to the organizations employed is a factor
dictating the necessity of developing efficient text compression and
expansion methods and systems. In addition, visual displays and desk top
computers have proliferated and a major part of information processed is
textual data that is presented on the screen. Therefore, any effective
compression technique that will minimize the amount of storage space
and/or transmission medium employed will be of major benefit. Numerous
compression techniques have been developed as alluded to earlier under the
prior art section of this application and most of these techniques result
in an overall compression factor in the range of 1.5 or 2 to 1.
The present text compression technique in the invention at hand is one that
addresses the degree of repetitive occurrence and the length of certain
words in the basic language of the persons generating and utilizing the
text. Many options are provided. The options may be tailored to specific
environments and to specific documents themselves by selecting the basic
dictionary type employed. Either a predefined existing dictionary or
dictionaries that have been generated especially for text compression or
dictionaries that are created on-line for the individual text itself may
be utilized in combination or separately. The dictionaries are arranged on
byte wide boundaries of addresses to simplify processing. Those
dictionaries that have a single byte address utilize the highest degree of
compression but are also of limited length since only 256 different
entries can exist for a given 8-bit byte. Two byte compression word
dictionaries enable a much greater range of words to be defined, i.e.,
65,536 words or two byte entry addresses are possible. Extension to a
three byte address makes possible a maximum of 16,777,216 entries which is
far in excess of all words encountered in the English language.
Dictionaries for graphics displays and audio playback are also permitted
since these are merely digital representations of specific signals for
output by graphic or audio devices. Obviously, the technique is not
limited to English language. Virtually any character-spelled language may
be similarly treated by our methods. English is only used for convenience
herein in explaining our invention.
The basic approach of the preferred text compression technique is to
substitute a 1, 2 or 3 byte address pattern for actual English words found
in a given document. The word "document" as used herein is used to define
encoded textual data, graphics data or audio data that is created by or
utilized by human operators. Key stroke entries may be recorded on a disk,
a tape or the like or stored in memory as a source text of the typical
English language. Such source text is actually a series of ASCII coded
alphanumeric characters, space characters, punctuation marks, control
characters, such as capitalization, case shift, line feed and the like,
etc. The assignment of the dictionary word entries is based on their
relative weighted merit according to their frequency of use and length.
The relative weighted merit is based upon the length of each word and the
frequency of its occurrence in a specific document or within a specific
environment such as law, engineering, medicine or the like. The limitation
of the compressed bit patterns to lengths of 1, 2 or 3 bytes which would
be 8, 16 or 24 bits, makes implementation of these techniques on byte
oriented computers available today highly feasible.
The basic compression technique will utilize at least one and usually
several dictionaries. The most basic and compact dictionary is the 256
entry single byte compression table. This table is generated to contain
the control patterns necessary to decode indications telling which
dictionaries are being employed in a given compression, the most
frequently occurring words in the environment of concern, and any
necessary control patterns or signals. Single byte address compression is
obviously limited for 8 bit bytes to 256 unique binary patterns. Some of
these patterns must be reserved for control sequences and for definitions
of the individual dictionary segments being employed in a given
compression. The single byte compression table or memory will thus be
limited to words which have the highest relative merit and which will fit
into the 256 possible patterns minus the number of reserved control
patterns. This approach may be used for small documents to great
advantage. Small documents are those consisting of only 2 to 3 pages, for
example.
If a 2 byte compression dictionary or memory is constructed, it will have a
possible maximum of 65,536 entries defined. However, one bit must be
reserved to indicate that a two-byte pattern is used, so only 32,768
actual dictionary entries are then possible. Viewed differently, 128
segments of 256 words each may be encoded. In many cases, however, the
normal vocabulary in constant use for a specific group such as an
accountant's or lawyer's office would be only a few thousand words. If the
maximum basic dictionary size of 2 bytes or 65,536 possible entries is
selected, at least one bit of the first byte would have to be used to
identify the fact that a 2 byte encoding pattern is employed because the
decompression routine must know how to group the compressed bytes either
singly in 2's or in 3's. Thus, instead of 256 possibilities for the first
byte, only 128 patterns are left in the first byte alone and a total of
128 sections of 256 patterns each can actually be selected. One hundred
twenty-eight 8 bit patterns must be reserved in the control table to
indicate which 256 word segment of the 2 byte dictionary is being employed
in each instance. Thus, the first "byte" of the 2 byte address is taken
from the control table. This will leave, of the original 256 possible
patterns in the single byte control table, 128 patterns that may be
employed for control sequences, high relative merit words and the like.
If the basic dictionary size is arbitrarily limited from 32,768 to only
one-half that number, i.e., 16,384, then 192 bit patterns will be left
available in the single byte table for the control sequences and high
relative merit words. If a basic 2 byte dictionary size of only 8,192
patterns is selected, a likely and useful choice, then 224 bit patterns
will be left over in the single byte table for control sequences, encoding
format indicators and high relative merit words.
In general, the number of bit patterns left in a single byte control table
out of the 256 original possible ones will be 256 less the basic
dictionary size selected for containing the words in the 2 or 3 byte
dictionaries that may be employed, divided by 256. That is, the manner of
encoding the data in this invention is by direct table entries but
indications must be placed in the compressed text to indicate which table
is employed. A basic control table of 256 possible entries must have
reserved in it pattern for ever 256 entries in the 2 byte dictionaries
that may be employed and one pattern for every 65,536 entries in each
three byte dictionaries employed. These are termed "dictionary segments"
and a unique identifying pattern is reserved in the control table for each
segment of each dictionary employed in each text compression example. Of
course, the control table could be expanded to be a 2 byte table but this
would have a negative effect upon the overall compression ratio. In most
cases, higher compression factors can be achieved by selecting a smaller
number of entries for the basic word dictionary size. Many locations or
organizations have an actual working vocabulary of only a few thousand
words. Therefore, the compression factor can be optimized by selecting a
basic working vocabulary dictionary in the area of 8,192 words (1/4 of the
possible size of a 2 byte pattern dictionary). This will require only 32
reserved bit patterns for the single byte control table since 32 times 256
equals 8,192. In the example just given, 224 possible entries would remain
out of the 256 available originally in a single byte control table. The
224 entries could be advantageously assigned to high frequency of use
words, control characters and alphanumeric characters if desired.
This brief description of the flexibility inherent in the present invention
leads to a consideration of how the complex set of possible dictionaries
selected for a given text compression can be communicated to the
decompression or expansion user. This is accomplished in the present
invention by utilizing a preamble or a header to the compressed text.
FIG. 6 illustrates schematically the header encoding format which is
employed in the present invention. Line A illustrates the start of the
header with byte 0 being identified. Bits AA are utilized in the present
invention to define the type of single byte compression table that is
being used. Two types are possible. Either an existing default value
single byte compression table is employed which consists of a prearranged
set of assignments for the 256 possible patterns in a single byte table or
an alternative existing single byte compression table is to be used and
may be found on a predefined external storage medium. Alternatively, a
single byte compression table will actually be provided within the
compressed text byte string or a dictionary of two or three byte address
size could be transmitted. These four possible alternatives are encoded by
setting the value of bits AA as follows: if bits AA are 00, a default
single byte compression table is used in the remainder of the text
compression string which follows the header until a new header appears. If
the bit values for AA are 01, a reserved unassigned alternative is
employed. This pattern could be used to indicate that the dictionary to be
used will be transmitted in the compressed text, Bits 10 for AA define an
existing single byte compression table is to be used which should be found
by the user on the predefined external storage medium, such disk or tape.
An example would be that a single byte compression table which has been
optimized for cancer specialists in the medical field, for example, is to
be employed for a given text compression. The final alternative where bits
AA are 11, is used to indicate that a single byte compression table has
actually been generated and will be supplied within the compression string
which follows. The remainder of the bits B through G in byte 0, line A,
FIG. 6, are utilized as indicators to define which types or choices of
other word dictionaries are employed in the compression routine for the
given stream of compressed text that follows the header.
For example, bit B is defined in the present invention to be the indicator
for an extended 3-byte wide address dictionary. If the value of bit B is 0
then no 3-byte dictionary is employed and if it is a 1 then some
defineable 3-byte dictionary is employed. The definition of which 3-byte
dictionary, if any, is employed is treated later in the header format as
will become apparent.
Bit C through bit G are each individual bit indicators used in the same
fashion as bit B but to indicate instead the usage of alternative 2-byte
wide address dictionaries. In the example shown, five different 2-byte
wide address dictionaries could possibly be employed.
The remainder of the bytes in the header and the manner of employing them
in conjunction with the indicator bits in byte 0 will now be given.
Byte 1 will either be the start of compressed text or it will be
interpreted as a control table number if an individual control table other
than the default condition control table is to be employed. Byte 1 will be
null if bits AA in byte 0 are either 00 or 11. If bits AA equal 10, then a
control table number of eight bits' width will appear in byte 1 to
identify a specific control table to be accessed by the decompression
system from some external storage medium. If bits AA equal 11, then byte 1
will actually be the start of a supplied control table in the compressed
text stream and byte 1 through byte n will be the supplied control table
which will be the sum of the length of all the supplied words which are to
be assigned in the control table plus 1,024 more bytes. One thousand and
twenty-four is equal to 256 times 4, where 256 four-byte patterns are
written. Byte 1 of each group of 4 bytes is a length indicator of a
possible 256 byte sequence. The next 3 bytes in each group of 4 bytes are
the sequence number of the word in a supplied dictionary string. Thus, the
control table supplied, when appropriate, by bytes 1 through n will
contain first, all of the supplied word entries and second, their
assignment within the 256 byte array of a 1 byte control table
arrangement. Each assignment requires 4 bytes where the first byte in a
group of 4 indicates the lengths in bytes up to 256 bytes long of the
entry and the next 3 bytes out of a group of 4 indicate the sequence
number of the actual word which should be associated with the table entry
when the word is supplied in the dictionary string that follows the
control table definition portion.
In FIG. 6, line B, 3 byte dictionary table definition segments for the
header will begin. At byte n+1, 3 byte dictionary definitions will occur
based on whether byte 0 bit B was a 0 or a 1. For example, if bit B of
byte 0 was 0, then fields BA, BB and BC in bytes n+1 and n+2 in FIG. 6,
line B are null, i.e., no bits are present since no 3 byte dictionary is
employed that needs to be defined in greater detail. However, if byte 0,
bit B was equal to a 1, then field BA is 1 bit wide and field BB in byte
n+1 is 7 bits wide. Byte BB is interpreted as the number of 65,536 word
segments of a 3 byte dictionary that are employed. Field BC, which is byte
n+2, will be the number indicating which 3 byte dictionary was used. Field
BA defines whether fields BB and BC are necessary at all. If field BA is
0, then BC is null or no bits since no specific 3 byte dictionary needs to
be defined, i.e., a default condition or a prearranged 3 byte dictionary
will be used. If, however, field BA is set to 1, then field BC will be the
identifying number of the selected 3 byte dictionary to be utilized.
Continuing with the convention of establishing the header format, byte N+3
is a definition section for the first possible 2 byte auxiliary dictionary
which may or may not be employed. Byte 0 bit C defines whether or not a 2
byte dictionary of one type is to be employed. If byte 0 bit C is equal to
0, then the fields CA, CB, and CC in bytes n+3 and n+4 are all set to 0
since no further definition is necessary. If, however, byte 0 bit C is a
1, then field CA will be 1 bit and field CB will be 7 bits in width. Bit
CA is an indicator which defines whether a default 2 byte dictionary or a
specifically to be defined 2 byte dictionary has been employed. If CA is
set to 0, then a default condition or prearranged 2 byte dictionary is
used and byte n+4 field CC, which would ordinarily by the specific number
for a given 2 byte dictionary, does not need to be given at all since the
default condition is invoked. If, however, field CA is equal to 1, then
field CD is the number of 256 word long segments that have been utilized
for the first 2 byte dictionary being defined and field CC is the
identifying number for that 2 byte dictionary.
This format or convention continues in the same fashion for byte n+5
through byte n+12 as shown in lines B, C and D of FIG. 6 in accordance
with the status of bits D, E, F and G in byte 0, line A of FIG. 6 as just
described. If all five possible indicators for 2 byte dictionaries in byte
0 are employed, then the start of actual compressed text will not begin
until byte N+13 in line D of FIG. 6 and will continue for an indeterminant
number of bytes until the end of text occurs at byte n+n in line D of FIG.
6.
Several illustrative examples of the first byte, byte 0, in a given header
will now be described. Case 1: let us suppose that byte 0 of the header
has been set to all 0's. Referring to FIG. 6, and the description given
for it, a single byte compression technique has been defined as being used
since none of the 2 or 3 byte indicator bits B through G are set to a 1.
In addition, since both bits AA are set to 0, a default single byte
compression table has been utilized and, since this is the default table,
it is by convention already in existence in the address space for the
decompression program at the receiver or user's end. This implies that the
receiver and the sender have previously arranged via communication to
establish their default table values and enter them into their respective
systems. Since this is a 1 byte wide table, there are 256 possible
entries. Some of the entries will be allocated for control sequences, some
for special characters, and some for letters and digits. However, the bulk
of the entries will be reserved for words with a high relative figure of
merit as encountered in a particular user's installation vocabulary.
The actual number of control sequences, special character, letters or
digits and words of high relative figure of merit are chosen by the
particular user based on his own preferences and experience. A method of
assigning these patterns typically would be for the user to scan a
representative variety of texts from his or her environment to
statistically isolate the highest relative figure of merit words as they
occur in the working vocabulary of their installation. An example of a
technique for doing this appears later within this specification. However,
some of the 256 possible patterns will have to be assigned for the
normally occurring controls that will be encountered. For example, a space
character and all of the upper and lower case alpha and numeric characters
are usually defined with separate entries. End of line and end of document
control codes need to be reserved in each case and a multiple space code
is a useful addition. Commas and space combinations or period and space
combinations are also usual definitions, as are arbitrarily employed
escape characters to indicate change of coding technique or the like. A
table switching indicator byte pattern should also be reserved which can
be used by convention to indicate that the byte in the text following the
occurrence of this particular byte pattern will be a table number which is
to be employed instead of the table currently being employed.
In the example given above where byte 0 is all 0's, the compressed text
will start at byte 1 in the compressed text byte string and the header
itself is a single byte, byte 0. With the naming and format convention as
described in FIG. 6, the first byte, byte 0, of the header is the control
byte which defines the specific array and type of dictionaries that have
been employed and the remaining format of the header fields defines the
specific individual lengths and identities of the auxiliary dictionaries
that were in employed in the given compression of text being encoded.
Case 2: Let us assume that byte 2 is set to 1 and seven 0's. This by the
convention established for FIG. 6 in this example, indicates that a single
byte compression technique itself is employed but a non-default or
auxiliary singly byte definition table will be used by the decompression
algorithm which must access this table residing on some external storage
medium. The table will normally have been given a name by the user
consisting of an installation chosen prefix such as BASTAB followed by a
suffix of numerical value contained in byte 1, for example, in the header
string. A naming convention for single byte compression tables on an
external storage medium might be, for example, "BASTAB.0" which would be a
default table value or "BASTAB.1" which would be the table supplied within
a compressed text byte string, and so forth. "BASTAB.n would be an
additional table residing on an external storage medium where n could have
any value between 2 and 255. In such a scheme, the actual compressed text
string would then begin at relative byte position 2 within the compressed
text byte string. During decompression, following reading of byte 0, the
decompression program would access the indentified single byte compression
table on the external storage medium and read it into the decompression
program address space in read/write memory of the processor.
As noted above, one of the bit patterns within a single byte compression
table is normally reserved for a table switching indicator byte. This
would be used for switching from one single byte compression table to
another. Whenever a table switch pattern is encountered in the compressed
byte string, the next byte by convention would be encoded and interpreted
as containing the number of the table to be utilized next. The new table
would then be read from the external storage medium and placed in the
decompression program addressable memory space if it did not already
reside therein.
Case 3: Let us assume that byte 0 is set to 11000000. This convention in
accordance with the description of FIG. 6 indicates that a single byte
compression table has been employed and that the initial table to be used
for decompression of the text will be the one which is actually
transmitted within a compressed text byte string starting at the relative
byte position number 3. The relative byte positions 1 and 2 in the text
string following the header contain the binary value of the total length
of the supplied single byte compression table. That table is supplied
within the compressed byte strings and must be extracted from the
compressed byte string and placed into the decompression program
addressable memory space. The compressed text table will then begin at
byte 3 and will end at a count defined by the length of the table which
will then be followed by the beginning of the actual compressed text.
Case 4: Let us assume that byte 0 is set to 00100000. This pattern,
according to the convention of FIG. 6, defines that a combination of a
single byte compression table of a type defined as the default control
table and a 3-byte dictionary of a type defined in field BA has been used.
Byte n+1 in this case will be byte 1 since no control table number or
supplied control table are required in this configuration for the header
as specified. The seven low order bits of byte 1 will be the length as a
number (less 1) of 65,536 word-long segments that were utilized in the
3-byte dictionary employed. Field BC which will be byte 2 would then
define the identifying number for a 3-byte dictionary if a specific one is
employed and byte 3 would be the start of the compressed text or in the
event that field BA (the high order bit of byte 1) was 0 specifying the
default 3 byte table condition, byte 2 would be the start of the
compressed text.
The number of words in the 3-byte dictionary that has been specified is the
number of 65,536 word groupings employed. Given that 7 bits can define the
length, somewhere between 1 and 128 such groups could be defined.
Therefore, the number of words in the total length of the dictionary of
the 3 byte compression type will be at least 65,536 words but no more than
8,388,608. In a single byte compression control table a number of control
patterns equal to the number of 65,536 word groups employed will have to
be reserved to identify which section of the 3-byte dictionary is being
utilized at any given time. Thus to encode the occurrence of a word found
in a 3-byte dictionary, the first portion of its identifier will be the 1
byte identifier from the control table which will indicate that segment of
the 3-byte dictionary which is found to contain the word. The next 2 bytes
will indicate the relative offset within the segment of the dictionary
where the actual word encoded is found.
Thus, if a single group of 65,536 words were used in a 3 byte compression
table, a single 1 byte bit pattern in a single byte compression control
table will be used to tell the decompression program that the 3 byte
decompression dictionary is necessary. When the decompression routine
encounters the specific single byte pattern in the compressed text byte
string, the subsequent 2 bytes must be read together to specify which word
within the 65,536 word 3 byte dictionary is being represented from the
original text.
As an aside, it may be noted that if the number of such 65,536 word groups
is relatively small, then given the availability of random access memory
in small systems today, the entire 3 byte compression dictionary may be
loaded into the address space of the decompression program. Since each
group of 65,536 words will typically occupy somewhat less than one-half a
megabyte of storage, given the average word length of around five
characters, it is feasible on large processors, that several megabytes of
address space could be effectively allocated to a decompression program
space. On small processors, there is usually room for only a few such
gro | | |