WikiPatents - Community Patent Review
Create Free Account  |  License or Sell Your Patent  |  WikiPatents Marketplace  |  WikiPatents Blog
Username:  Password:  
    
Advanced Search
Text compression and expansion method and apparatus    
United States Patent4843389   
Link to this pagehttp://www.wikipatents.com/4843389.html
Inventor(s)Lisle; Ronald J. (Austin, TX); Moss; Eual A. (Raleigh, NC); Ryder; John H. (Durham, NC)
AbstractA text compression method and apparatus are disclosed that enable overall compression ratios of more than six or eight to one for normal language text. Plural multiple-word dictionaries that are specialized for the particular field of use are employed together with a header transmission format that identifies which dictionaries are to be used. In addition, entries in these dictionaries are categorized by a weighted frequency of use ranking in which the product of the word length in characters and the frequency of occurrence of that word in the text is taken as the weighted figure of merit for ranking words to be placed in the individual dictionaries.
   














 Title Information Submit all comments and votes
 
Patent Text Patent PDF Print Page Summary File History
Plain text PDF images Print Summary File History
Drawing from US Patent 4843389
Text compression and expansion method and apparatus - US Patent 4843389 Drawing
Text compression and expansion method and apparatus
Inventor     Lisle; Ronald J. (Austin, TX); Moss; Eual A. (Raleigh, NC); Ryder; John H. (Durham, NC)
Owner/Assignee     International Business Machines Corp. (Armonk, NY)
Patent assignment
All assignments
Publication Date     June 27, 1989
Application Number     06/937,799
PAIR File History     Application Data   Transaction History
Image File Wrapper   Patent Term   Fees
Litigation
Filing Date     December 4, 1986
US Classification     341/106 341/90
Int'l Classification     H03M 007/42
Examiner     Shoop Jr.; William M.
Assistant Examiner     Young; Brian
Attorney/Law Firm     Duffield; Edward H.
Address
Parent Case    
Priority Data    
USPTO Field of Search     340/347 DD 364/200 364/300 364/900 341/86 341/90 341/106
Patent Tags     text compression expansion
   
Enter a comma (,) or semicolon (;) between multiple tag words/phrases.
Describe this patent:
 Amusing   
 Clever   
 Complex   
 Efficient   
 Historic   
 Important   
 Innovative   
 Interesting   
 Practical   
 Simple   
[no votes]
Patent WIKI

Share information and news about this patent, including information and news about the technology, inventors, company, ligation and licensing.

 References Submit all comments and votes
 
*references marked with an asterisk below are user-added references
 U.S. References
 
Add a new US reference:  
ReferenceRelevancyCommentsReferenceRelevancyComments
3348205



[0 after 0 votes]
4646061
Bledsoe
341/65
Feb,1987

[0 after 0 votes]
4597057
Snow
341/60
Jun,1986

[0 after 0 votes]
4545032
Mak
341/90
Oct,1985

[0 after 0 votes]
4386416
Giltner
710/68
May,1983

[0 after 0 votes]
4295124
Roybal
341/86
Oct,1981

[0 after 0 votes]
 Foreign References
 Other References
 Market Review Submit all comments and votes
   
Market Size
Estimate the gross annual revenues of the relevant market sector:
> $10B
$5B - $10B
$2B - $5B
$500M - $2B
$100M - $500M
$10M - $100M
$1M - $10M
$500K - $1M
$100K - $500K
< $100K
[No votes]
$0
 
$0   $2.5B   $5B   $7.5B   $10B
Market Share
Estimate the percentage of the relevant market sector this invention will capture:
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Reasonable Royalty
What percentage of gross sales should the inventor or assignee be paid?
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Public's "Guesstimation" of Royalty Value
Market SizeN/A[No votes]
xMarket ShareN/A[No votes]
xReasonable RoyaltyN/A[No votes]

N/A

License Availablity
If you are NOT the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
License Availablity
If you ARE the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
Competitive Advantage
Does this invention have a significant competitive advantage over similar technologies?
Yes

No



[No votes]
Most helpful competitive advantage comment
[No comments]

Commercial Alternatives
Are there viable commercial alternatives for this invention?
Yes

No



[No votes]
Most helpful commercial alternative comment
[No comments]

 Technical Review Submit all comments and votes
 Claims Submit all comments and votes
 


What is claimed is:

1. A method of data compression, comprising steps of:

separating an uncompressed, coded data stream into units;

comparing said units against at least one user-selected dictionary of units having compressed code equivalents for each unit stored in association with the uncompressed encoded units; and

outputting a header for compressed data comprising indications for defining the identity of each of said user-selected dictionaries used in compressing said data;

outputting the compressed code equivalents for incoming units for which a true comparison is found in said comparing step and outputting the uncompressed encoded character stream for any said unit for which no true comparison is found; and

arranging the entries in said dictionaries for said units in an order of merit which is the weighted frequency of use established by multiplying the average frequency of occurrence of each said unit in the language in which it is used by its length in characters and arranging the resulting products of said multiplying in decreasing order of magnitude.

2. A method as described in claim 1 wherein;

at least one of said dictionaries comprises a control table of entries comprising segment identifiers for each segment of each said user selected dictionary used in compressing said data.

3. A data compression and decompression system, comprising:

a transmitter and a receiver and means for connecting said transmitter to said receiver for the transmission and reception of compressed data;

said transmitter including means for accepting an incoming uncompressed encoded data stream and for separating said stream into units;

said transmitter also comprising means for comparing said units against at least one user selected compiled dictionary of units having compressed code equivalents for each unit stored in association with the uncompressed encoded units and further comprising;

means for outputting a compressed data header comprising indications for defining the identity of each said user selected dictionary used in compressing said data; and

means for outputting the compressed code equivalents for incoming units for which a true comparison is found in said comparing step and outputting the uncompressed encoded data stream for any said unit for which no true comparison is found;

said receiver including means for receiving and analyzing said header for defining the identity of each of said user selected dictionaries used in compressing said compressed data stream and;

means for separating said incoming compressed encoded data stream into units;

means for comparing said units against at least one of said user selected and header identified compiled dictionary of units having compressed code equivalents for each unit stored in association with the uncompressed encoded units; and

means for outputting uncompressed equivalents for incoming units for which a true comparison is found in said comparing step and for outputting said data stream for any said unit for which no true comparison is found.
 Description Submit all comments and votes
 


FIELD OF THE INVENTION

This invention relates to text compression and expansion methods and devices in general and specifically to the class of digital to digital converters such as those found in the patent classification Class 340, Sub. 347.

PRIOR ART

A wide variety of prior patents and published papers exist in this field. U.S. Pat. No. 4,545,032, for example teaches one of the basic techniques of a dictionary or tabular conversion in which numerical codes for the basic English language vocabulary are utilized. However, the words are broken up into prefixes, suffixes and stems and it does not appear that weighted ranking of the dictionary entries or tabular entries was considered or applied. Also it does not appear that the concept of being able to switch between different tables or dictionaries based upon the field of usage was considered at all.

U.S. Pat. No. 4,597,057 has another typical example of a compression technique in which standard ASCII coded text may be divided into alpha, numeric and punctuation elements as "words" with the "words" being divided into prefixes, suffixes and stems. The prefixes, suffixes and stems utilize numeric encoding in a fashion similar to the aforementioned U.S. Pat. No. 4,545,032.

U.S. Pat. No. 4,295,124 is an earlier example of a dictionary or tabular look-up type of conversion for applying numeric codes to English text characters. The input ASCII encoded textual words are utilized to generate a second representative code by hashing. This code can be utilized as the memory address for comparison against a prearranged dictionary or memory of textual words. When a match is found, the hashing address is sent as an identifier. If a match is not found, an auxiliary dictionary is built up but the word is transmitted uncompressed the first time it is encountered. The next time the same word is encountered, both dictionaries are checked and if the word is found in the second built-up dictionary, its hashing address is sent. While this description appears to be an efficient and useful technique, it does not employ the flexibility of numerous dictionaries that can be carefully selected for the field usage of the text and does not appear to employ a weighted figure of merit for assigning entries into the dictionaries.

U.S. Pat. No. 4,386,416 is similar to the aforementioned U.S. Pat. No. 4,295,124 but utilizes an escape code to indicate whether a code from a memory library address or a differently encoded representation of words not found in the library are utilized. Again, no weighted figure of merit for the dictionary entries appears to have been employed nor is the technique of utilizing a preamble or header to identify the particular subset of dictionaries employed in a given compression of text indicated.

The basic technique of encoding utilizing dictionaries for prefixes, suffixes and stems of words is found in numerous references such as the Proceedings of the IEEE, Vol. 55, No. 3, March, 1967, pages 390-396 in an article by H. E. White entitled, "Printed English Compression by Dictionary Encoding." A compression of only about two to one was found in this paper and it is found to be generally similar to those of the first two mentioned patents above.

The use of multiple tables or dictionaries itself is not totally new to the art. An example may be found in the article by J. Pike, "Text Compression Using A Four-Bit Coding Scheme," appearing in the Computer Journal, Vol. 24, No. 4, 1981, pages 324-330. However, the article fails to recognize that a truly flexible system utilizing a variety of dictionaries and a header or preamble in the compressed text to indicate which selection of dictionaries has been made for the given compression could be utilized. Further, the use of weighted entries for selection of the words to comprise the given dictionaries to nowhere described.

An early paper entitled, "Experiments in Text File Compression," by Rubin appearing in the Communications of the ACM, November, 1976, Vol. 19, No. 11, pages 617-623 treates the subject of weighted frequency of occurrence as a figure of merit for ranking words for dictionary construction for text compression. However, based on the experiments conducted by that author, it was concluded that frequency was the best method for maximizing the savings in choosing input groups, i.e., dictionary entries. This failed to take account of the possibility that different rankings would occur in different dictionaries for the same word and that switching between dictionaries that are better suited for a given text is a possibility at all.

An early paper showing the use of escape characters to effectively extend the size of the dictionary or change from the use of one dictionary to another is in the article entitled, "A New Technique for Compression and Data Storage," by Hahn appearing in the Communications of the ACM, August, 1974, Vol. 17, No. 8, pages 434-436. This article also shows the dynamic method of adding new or unknown symbols or words encountered to the dictionary. However, character groups rather than words are utilized and the author notes that the experiments with the technique showed somewhat lesser degree of compression than word codes for English text compression.

An overview paper dealing with several methods but not describing the use of the weighted word frequency technique nor the use of a preamble or header to indicate the use of multiple dictionaries is found in the IEEE Transactions on Software Engineering, Vol. SE6, No. 4, July, 1980, pages 340-347 in an article entitled, "Overhead Storage Considerations and a Multi-Linear Method for Data File Compression," by Young et al. A similar overview is found in the Computer publication of April, 1981, pages 71-75 in an article entitled, "An Overview of Data Compression Techniques," by Reghbati.

Finally, an article by Weiss et al appearing in the Journal of Library Automation, Vol. 11, No. 2, June, 1978, pages 97-105 presents a limited dictionary scheme utilizing an escape code or dictionary indication code to show to the decompression system when a compressed word has been taken from a dictionary file and to indicate, by its absence, that the word is not compressed and is transmitted to ordinary ASCII or EBCDIC characters. Pointers for escape characters identify themselves as pointers and indicate the dictionary entry for the word that they represent. Source files of EBCDIC or ASCII encoded text are read one word at a time and the words are compared against the dictionary. If the word is not found, the word and any trailing delimiter such as punctuation marks or spaces are written unchanged into the compressed file. If the word is found, a pointer indicating that the word was found in the dictionary and pointing to the dictionary location is written to the compressed file. This is essentially the same technique as the aforementioned U.S. Pat. No. 4,386,416 which differs only in the terminology employed to describe this method. The use of multiple dictionaries and a weighted frequency of occurrence method of assigning entries are not described in this article.

From all of the foregoing, it is apparent that while much work has been done on text compression methods, techniques and systems, the investigators to date have avoided the use of mutiple extensive dictionaries. This may have been due to the complexity of handling the data storage requirements and in switching among dictionaries. Additionally, early studies of the weighted frequency of occurrence method of assigning entries to the dictionaries have found, contrary to the findings of the present invention, that the weighted frequency of use for dictionary entries is not as beneficial as pure frequency of occurrence. The best known compression ratios from the above-noted prior art appear to be in the range of four or five to one, but we have unexpectedly discovered that by the use of multiple dictionaries with their entries being selected on the basis of weighted frequency of use, compression ratios on the order of more than six to one to as high as eight to one are routinely achievable.

OBJECTS OF THE INVENTION

In view of the shortcomings of the above-noted prior art, it is an object of this invention to provide an improved text compression and decompression technique of the dictionary look-up type in which the dictionaries themselves have their entries arranged by a weighted frequency of use for the words encountered.

Yet another subject of this invention is to provide an improved text compression and expansion technique in which multiple dictionaries are employed according to the specific field of use of the text being compressed and in which a preamble or header is utilized in the compression for indicating which selection of dictionaries has been made for compression.

These and still other objects of the invention not specifically enumerated are met in a preferred embodiment of the invention as will be further described with reference to a preferred embodiment as depicted in the drawings herein.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a schematic illustration of a text compression and decompression system in accordance with the invention.

FIG. 2 is a schematic illustration of the assignment of memory space into segments containing the actual dictionaries employed, an index for each dictionary segment and dictionary segment maps as well as control tables for use in compression and decompression.

FIG. 3 illustrates schematically the decompression logic flow practiced by the processor in FIG. 1.

FIGS. 4A-1, 4A-2, 4A, 4B, 4C, 4D, 4E, and 4G illustrate a fully augmented compression routine flow chart as is implemented in the microprocessor of FIG. 1 for compression of text.

FIG. 5 illustrates a typical example of the format of a compressed text string utilizing the techniques of the present invention.

FIGS. 6A and 6B are schematic drawings illustrating the assignment of the bit locations in the header or preamble section of a compressed data record which enables interpretation by the receiver or text expansion system for selecting the appropriate dictionaries and control tables for expansion of the compressed text.

BRIEF SUMMARY OF THE INVENTION

In the present invention, the foregoing objects are met by providing a plurality of language use specific dictionaries whose entries of words are ranked in a weighted frequency of usage ranking based on statistical studies of the areas of use employed. For example, words such as "docket" or "versus" or "case" will appear much more frequently in legal texts than in normal English usage. Similar professional jargon is found for other fields as well, engineering, business, accounting, medicine, agriculture, petro-chemicals, etc., etc., ad infinitum being possibilities. In the present invention, the user of a text compression and decompression system builds up dictionaries that are custom-tailored to the field of use. This is done by utilizing a scanning and analysis technique that incorporates counting both the number of characters in each unique word and the number of occurrences of the word within the general usage over a sample of texts from the user's environment. Multiple such dictionaries can be constructed and applied to maximum beneficial effect to achieve a high degree of compression for an individual user. A preamble or header is constructed in compressed text to indicated the user's selection of the appropriate dictionaries actually employed in doing the compression for a given text source input. The header is employed to instruct the receiver as to which dictionaries and control tables are to be loaded into memory for use in the decompression routine. Simple, single-byte entries in a control table can be employed for indicating the source in compressed text of each individual word or group of words when the words are found outside of a single byte control table itself. A hierarchy of dictionaries ranging from the single byte control table which contains the control codes and dictionary segment indicators as well as a list of somewhere between 80 and 224 frequently encountered words is the highest priority dictionary for searching for compression or expansion. Two and three byte encoded dictionary entries where the first byte indicates a specific dictionary segment in memory and the second byte indicates an offset or memory address position within the segment are also employed. The system may be implemented on a typical microprocessor computer system that has a bulk storage such as disk or cassette and a normal size working read/write memory into which the various control tables and dictionaries can be assembled for use by reading them from a bulk storage mechanism. Typical flow charts for compression and decompression are included and will be described in greater detail.

DETAILED SPECIFICATION

The compression technique of the present invention allows more data to be stored on a given medium and/or it allows more data to be sent in reduced time over a transmission link between any two locations. The normal compression ratio achieved may be as high as 6 to 8 although, when the effect of adding in the transmission header and dictionary indicators is taken into account, the overall compression ratio falls slightly. In a data processing organization, the delay for hard copy documentation is excessively costly. In many such locations, the need for sharing large quantities of information between additional remote locations results in investment in extensive communications networks. The growth of the networks and the major expense to the organizations employed is a factor dictating the necessity of developing efficient text compression and expansion methods and systems. In addition, visual displays and desk top computers have proliferated and a major part of information processed is textual data that is presented on the screen. Therefore, any effective compression technique that will minimize the amount of storage space and/or transmission medium employed will be of major benefit. Numerous compression techniques have been developed as alluded to earlier under the prior art section of this application and most of these techniques result in an overall compression factor in the range of 1.5 or 2 to 1.

The present text compression technique in the invention at hand is one that addresses the degree of repetitive occurrence and the length of certain words in the basic language of the persons generating and utilizing the text. Many options are provided. The options may be tailored to specific environments and to specific documents themselves by selecting the basic dictionary type employed. Either a predefined existing dictionary or dictionaries that have been generated especially for text compression or dictionaries that are created on-line for the individual text itself may be utilized in combination or separately. The dictionaries are arranged on byte wide boundaries of addresses to simplify processing. Those dictionaries that have a single byte address utilize the highest degree of compression but are also of limited length since only 256 different entries can exist for a given 8-bit byte. Two byte compression word dictionaries enable a much greater range of words to be defined, i.e., 65,536 words or two byte entry addresses are possible. Extension to a three byte address makes possible a maximum of 16,777,216 entries which is far in excess of all words encountered in the English language. Dictionaries for graphics displays and audio playback are also permitted since these are merely digital representations of specific signals for output by graphic or audio devices. Obviously, the technique is not limited to English language. Virtually any character-spelled language may be similarly treated by our methods. English is only used for convenience herein in explaining our invention.

The basic approach of the preferred text compression technique is to substitute a 1, 2 or 3 byte address pattern for actual English words found in a given document. The word "document" as used herein is used to define encoded textual data, graphics data or audio data that is created by or utilized by human operators. Key stroke entries may be recorded on a disk, a tape or the like or stored in memory as a source text of the typical English language. Such source text is actually a series of ASCII coded alphanumeric characters, space characters, punctuation marks, control characters, such as capitalization, case shift, line feed and the like, etc. The assignment of the dictionary word entries is based on their relative weighted merit according to their frequency of use and length. The relative weighted merit is based upon the length of each word and the frequency of its occurrence in a specific document or within a specific environment such as law, engineering, medicine or the like. The limitation of the compressed bit patterns to lengths of 1, 2 or 3 bytes which would be 8, 16 or 24 bits, makes implementation of these techniques on byte oriented computers available today highly feasible.

The basic compression technique will utilize at least one and usually several dictionaries. The most basic and compact dictionary is the 256 entry single byte compression table. This table is generated to contain the control patterns necessary to decode indications telling which dictionaries are being employed in a given compression, the most frequently occurring words in the environment of concern, and any necessary control patterns or signals. Single byte address compression is obviously limited for 8 bit bytes to 256 unique binary patterns. Some of these patterns must be reserved for control sequences and for definitions of the individual dictionary segments being employed in a given compression. The single byte compression table or memory will thus be limited to words which have the highest relative merit and which will fit into the 256 possible patterns minus the number of reserved control patterns. This approach may be used for small documents to great advantage. Small documents are those consisting of only 2 to 3 pages, for example.

If a 2 byte compression dictionary or memory is constructed, it will have a possible maximum of 65,536 entries defined. However, one bit must be reserved to indicate that a two-byte pattern is used, so only 32,768 actual dictionary entries are then possible. Viewed differently, 128 segments of 256 words each may be encoded. In many cases, however, the normal vocabulary in constant use for a specific group such as an accountant's or lawyer's office would be only a few thousand words. If the maximum basic dictionary size of 2 bytes or 65,536 possible entries is selected, at least one bit of the first byte would have to be used to identify the fact that a 2 byte encoding pattern is employed because the decompression routine must know how to group the compressed bytes either singly in 2's or in 3's. Thus, instead of 256 possibilities for the first byte, only 128 patterns are left in the first byte alone and a total of 128 sections of 256 patterns each can actually be selected. One hundred twenty-eight 8 bit patterns must be reserved in the control table to indicate which 256 word segment of the 2 byte dictionary is being employed in each instance. Thus, the first "byte" of the 2 byte address is taken from the control table. This will leave, of the original 256 possible patterns in the single byte control table, 128 patterns that may be employed for control sequences, high relative merit words and the like.

If the basic dictionary size is arbitrarily limited from 32,768 to only one-half that number, i.e., 16,384, then 192 bit patterns will be left available in the single byte table for the control sequences and high relative merit words. If a basic 2 byte dictionary size of only 8,192 patterns is selected, a likely and useful choice, then 224 bit patterns will be left over in the single byte table for control sequences, encoding format indicators and high relative merit words.

In general, the number of bit patterns left in a single byte control table out of the 256 original possible ones will be 256 less the basic dictionary size selected for containing the words in the 2 or 3 byte dictionaries that may be employed, divided by 256. That is, the manner of encoding the data in this invention is by direct table entries but indications must be placed in the compressed text to indicate which table is employed. A basic control table of 256 possible entries must have reserved in it pattern for ever 256 entries in the 2 byte dictionaries that may be employed and one pattern for every 65,536 entries in each three byte dictionaries employed. These are termed "dictionary segments" and a unique identifying pattern is reserved in the control table for each segment of each dictionary employed in each text compression example. Of course, the control table could be expanded to be a 2 byte table but this would have a negative effect upon the overall compression ratio. In most cases, higher compression factors can be achieved by selecting a smaller number of entries for the basic word dictionary size. Many locations or organizations have an actual working vocabulary of only a few thousand words. Therefore, the compression factor can be optimized by selecting a basic working vocabulary dictionary in the area of 8,192 words (1/4 of the possible size of a 2 byte pattern dictionary). This will require only 32 reserved bit patterns for the single byte control table since 32 times 256 equals 8,192. In the example just given, 224 possible entries would remain out of the 256 available originally in a single byte control table. The 224 entries could be advantageously assigned to high frequency of use words, control characters and alphanumeric characters if desired.

This brief description of the flexibility inherent in the present invention leads to a consideration of how the complex set of possible dictionaries selected for a given text compression can be communicated to the decompression or expansion user. This is accomplished in the present invention by utilizing a preamble or a header to the compressed text.

FIG. 6 illustrates schematically the header encoding format which is employed in the present invention. Line A illustrates the start of the header with byte 0 being identified. Bits AA are utilized in the present invention to define the type of single byte compression table that is being used. Two types are possible. Either an existing default value single byte compression table is employed which consists of a prearranged set of assignments for the 256 possible patterns in a single byte table or an alternative existing single byte compression table is to be used and may be found on a predefined external storage medium. Alternatively, a single byte compression table will actually be provided within the compressed text byte string or a dictionary of two or three byte address size could be transmitted. These four possible alternatives are encoded by setting the value of bits AA as follows: if bits AA are 00, a default single byte compression table is used in the remainder of the text compression string which follows the header until a new header appears. If the bit values for AA are 01, a reserved unassigned alternative is employed. This pattern could be used to indicate that the dictionary to be used will be transmitted in the compressed text, Bits 10 for AA define an existing single byte compression table is to be used which should be found by the user on the predefined external storage medium, such disk or tape. An example would be that a single byte compression table which has been optimized for cancer specialists in the medical field, for example, is to be employed for a given text compression. The final alternative where bits AA are 11, is used to indicate that a single byte compression table has actually been generated and will be supplied within the compression string which follows. The remainder of the bits B through G in byte 0, line A, FIG. 6, are utilized as indicators to define which types or choices of other word dictionaries are employed in the compression routine for the given stream of compressed text that follows the header.

For example, bit B is defined in the present invention to be the indicator for an extended 3-byte wide address dictionary. If the value of bit B is 0 then no 3-byte dictionary is employed and if it is a 1 then some defineable 3-byte dictionary is employed. The definition of which 3-byte dictionary, if any, is employed is treated later in the header format as will become apparent.

Bit C through bit G are each individual bit indicators used in the same fashion as bit B but to indicate instead the usage of alternative 2-byte wide address dictionaries. In the example shown, five different 2-byte wide address dictionaries could possibly be employed.

The remainder of the bytes in the header and the manner of employing them in conjunction with the indicator bits in byte 0 will now be given.

Byte 1 will either be the start of compressed text or it will be interpreted as a control table number if an individual control table other than the default condition control table is to be employed. Byte 1 will be null if bits AA in byte 0 are either 00 or 11. If bits AA equal 10, then a control table number of eight bits' width will appear in byte 1 to identify a specific control table to be accessed by the decompression system from some external storage medium. If bits AA equal 11, then byte 1 will actually be the start of a supplied control table in the compressed text stream and byte 1 through byte n will be the supplied control table which will be the sum of the length of all the supplied words which are to be assigned in the control table plus 1,024 more bytes. One thousand and twenty-four is equal to 256 times 4, where 256 four-byte patterns are written. Byte 1 of each group of 4 bytes is a length indicator of a possible 256 byte sequence. The next 3 bytes in each group of 4 bytes are the sequence number of the word in a supplied dictionary string. Thus, the control table supplied, when appropriate, by bytes 1 through n will contain first, all of the supplied word entries and second, their assignment within the 256 byte array of a 1 byte control table arrangement. Each assignment requires 4 bytes where the first byte in a group of 4 indicates the lengths in bytes up to 256 bytes long of the entry and the next 3 bytes out of a group of 4 indicate the sequence number of the actual word which should be associated with the table entry when the word is supplied in the dictionary string that follows the control table definition portion.

In FIG. 6, line B, 3 byte dictionary table definition segments for the header will begin. At byte n+1, 3 byte dictionary definitions will occur based on whether byte 0 bit B was a 0 or a 1. For example, if bit B of byte 0 was 0, then fields BA, BB and BC in bytes n+1 and n+2 in FIG. 6, line B are null, i.e., no bits are present since no 3 byte dictionary is employed that needs to be defined in greater detail. However, if byte 0, bit B was equal to a 1, then field BA is 1 bit wide and field BB in byte n+1 is 7 bits wide. Byte BB is interpreted as the number of 65,536 word segments of a 3 byte dictionary that are employed. Field BC, which is byte n+2, will be the number indicating which 3 byte dictionary was used. Field BA defines whether fields BB and BC are necessary at all. If field BA is 0, then BC is null or no bits since no specific 3 byte dictionary needs to be defined, i.e., a default condition or a prearranged 3 byte dictionary will be used. If, however, field BA is set to 1, then field BC will be the identifying number of the selected 3 byte dictionary to be utilized.

Continuing with the convention of establishing the header format, byte N+3 is a definition section for the first possible 2 byte auxiliary dictionary which may or may not be employed. Byte 0 bit C defines whether or not a 2 byte dictionary of one type is to be employed. If byte 0 bit C is equal to 0, then the fields CA, CB, and CC in bytes n+3 and n+4 are all set to 0 since no further definition is necessary. If, however, byte 0 bit C is a 1, then field CA will be 1 bit and field CB will be 7 bits in width. Bit CA is an indicator which defines whether a default 2 byte dictionary or a specifically to be defined 2 byte dictionary has been employed. If CA is set to 0, then a default condition or prearranged 2 byte dictionary is used and byte n+4 field CC, which would ordinarily by the specific number for a given 2 byte dictionary, does not need to be given at all since the default condition is invoked. If, however, field CA is equal to 1, then field CD is the number of 256 word long segments that have been utilized for the first 2 byte dictionary being defined and field CC is the identifying number for that 2 byte dictionary.

This format or convention continues in the same fashion for byte n+5 through byte n+12 as shown in lines B, C and D of FIG. 6 in accordance with the status of bits D, E, F and G in byte 0, line A of FIG. 6 as just described. If all five possible indicators for 2 byte dictionaries in byte 0 are employed, then the start of actual compressed text will not begin until byte N+13 in line D of FIG. 6 and will continue for an indeterminant number of bytes until the end of text occurs at byte n+n in line D of FIG. 6.

Several illustrative examples of the first byte, byte 0, in a given header will now be described. Case 1: let us suppose that byte 0 of the header has been set to all 0's. Referring to FIG. 6, and the description given for it, a single byte compression technique has been defined as being used since none of the 2 or 3 byte indicator bits B through G are set to a 1. In addition, since both bits AA are set to 0, a default single byte compression table has been utilized and, since this is the default table, it is by convention already in existence in the address space for the decompression program at the receiver or user's end. This implies that the receiver and the sender have previously arranged via communication to establish their default table values and enter them into their respective systems. Since this is a 1 byte wide table, there are 256 possible entries. Some of the entries will be allocated for control sequences, some for special characters, and some for letters and digits. However, the bulk of the entries will be reserved for words with a high relative figure of merit as encountered in a particular user's installation vocabulary.

The actual number of control sequences, special character, letters or digits and words of high relative figure of merit are chosen by the particular user based on his own preferences and experience. A method of assigning these patterns typically would be for the user to scan a representative variety of texts from his or her environment to statistically isolate the highest relative figure of merit words as they occur in the working vocabulary of their installation. An example of a technique for doing this appears later within this specification. However, some of the 256 possible patterns will have to be assigned for the normally occurring controls that will be encountered. For example, a space character and all of the upper and lower case alpha and numeric characters are usually defined with separate entries. End of line and end of document control codes need to be reserved in each case and a multiple space code is a useful addition. Commas and space combinations or period and space combinations are also usual definitions, as are arbitrarily employed escape characters to indicate change of coding technique or the like. A table switching indicator byte pattern should also be reserved which can be used by convention to indicate that the byte in the text following the occurrence of this particular byte pattern will be a table number which is to be employed instead of the table currently being employed.

In the example given above where byte 0 is all 0's, the compressed text will start at byte 1 in the compressed text byte string and the header itself is a single byte, byte 0. With the naming and format convention as described in FIG. 6, the first byte, byte 0, of the header is the control byte which defines the specific array and type of dictionaries that have been employed and the remaining format of the header fields defines the specific individual lengths and identities of the auxiliary dictionaries that were in employed in the given compression of text being encoded.

Case 2: Let us assume that byte 2 is set to 1 and seven 0's. This by the convention established for FIG. 6 in this example, indicates that a single byte compression technique itself is employed but a non-default or auxiliary singly byte definition table will be used by the decompression algorithm which must access this table residing on some external storage medium. The table will normally have been given a name by the user consisting of an installation chosen prefix such as BASTAB followed by a suffix of numerical value contained in byte 1, for example, in the header string. A naming convention for single byte compression tables on an external storage medium might be, for example, "BASTAB.0" which would be a default table value or "BASTAB.1" which would be the table supplied within a compressed text byte string, and so forth. "BASTAB.n would be an additional table residing on an external storage medium where n could have any value between 2 and 255. In such a scheme, the actual compressed text string would then begin at relative byte position 2 within the compressed text byte string. During decompression, following reading of byte 0, the decompression program would access the indentified single byte compression table on the external storage medium and read it into the decompression program address space in read/write memory of the processor.

As noted above, one of the bit patterns within a single byte compression table is normally reserved for a table switching indicator byte. This would be used for switching from one single byte compression table to another. Whenever a table switch pattern is encountered in the compressed byte string, the next byte by convention would be encoded and interpreted as containing the number of the table to be utilized next. The new table would then be read from the external storage medium and placed in the decompression program addressable memory space if it did not already reside therein.

Case 3: Let us assume that byte 0 is set to 11000000. This convention in accordance with the description of FIG. 6 indicates that a single byte compression table has been employed and that the initial table to be used for decompression of the text will be the one which is actually transmitted within a compressed text byte string starting at the relative byte position number 3. The relative byte positions 1 and 2 in the text string following the header contain the binary value of the total length of the supplied single byte compression table. That table is supplied within the compressed byte strings and must be extracted from the compressed byte string and placed into the decompression program addressable memory space. The compressed text table will then begin at byte 3 and will end at a count defined by the length of the table which will then be followed by the beginning of the actual compressed text.

Case 4: Let us assume that byte 0 is set to 00100000. This pattern, according to the convention of FIG. 6, defines that a combination of a single byte compression table of a type defined as the default control table and a 3-byte dictionary of a type defined in field BA has been used. Byte n+1 in this case will be byte 1 since no control table number or supplied control table are required in this configuration for the header as specified. The seven low order bits of byte 1 will be the length as a number (less 1) of 65,536 word-long segments that were utilized in the 3-byte dictionary employed. Field BC which will be byte 2 would then define the identifying number for a 3-byte dictionary if a specific one is employed and byte 3 would be the start of the compressed text or in the event that field BA (the high order bit of byte 1) was 0 specifying the default 3 byte table condition, byte 2 would be the start of the compressed text.

The number of words in the 3-byte dictionary that has been specified is the number of 65,536 word groupings employed. Given that 7 bits can define the length, somewhere between 1 and 128 such groups could be defined. Therefore, the number of words in the total length of the dictionary of the 3 byte compression type will be at least 65,536 words but no more than 8,388,608. In a single byte compression control table a number of control patterns equal to the number of 65,536 word groups employed will have to be reserved to identify which section of the 3-byte dictionary is being utilized at any given time. Thus to encode the occurrence of a word found in a 3-byte dictionary, the first portion of its identifier will be the 1 byte identifier from the control table which will indicate that segment of the 3-byte dictionary which is found to contain the word. The next 2 bytes will indicate the relative offset within the segment of the dictionary where the actual word encoded is found.

Thus, if a single group of 65,536 words were used in a 3 byte compression table, a single 1 byte bit pattern in a single byte compression control table will be used to tell the decompression program that the 3 byte decompression dictionary is necessary. When the decompression routine encounters the specific single byte pattern in the compressed text byte string, the subsequent 2 bytes must be read together to specify which word within the 65,536 word 3 byte dictionary is being represented from the original text.

As an aside, it may be noted that if the number of such 65,536 word groups is relatively small, then given the availability of random access memory in small systems today, the entire 3 byte compression dictionary may be loaded into the address space of the decompression program. Since each group of 65,536 words will typically occupy somewhat less than one-half a megabyte of storage, given the average word length of around five characters, it is feasible on large processors, that several megabytes of address space could be effectively allocated to a decompression program space. On small processors, there is usually room for only a few such gro