|
Description  |
|
|
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates generally to data storage and communication
systems, and more particularly to data compression systems and methods
which improve the capacity of data storage and communication.
2. Description of the Prior Art
Due to the insignificant differences between data compression in data
storage and data communication systems, only data storage systems are
referred to; particularly the data files stored in such systems. However,
all data storage systems can easily be extended to cover data
communications systems and other applications as well. A file is assumed
to be a sequential stream of bytes or characters, where a byte consists of
some fixed number of bits (typically 8), and the compression system
transforms this input byte stream into a "compressed" output stream of
bytes from which the original file contents can be reconstructed by a
decompression unit.
It is well-established that computer data files typically contain a
significant amount of redundancy. Many techniques have been applied over
the years to "compress" these files so that they will occupy less space on
the disk or tape storage medium or so that they can be transmitted in less
time over a communications channel such as a 1200 baud modem line. For
example, there are several widely used commercial programs available for
personal computers (e.g., ARC Software by Systems Enhancement Associates,
Inc., Wayne, N.J., 1985) which perform the compression and decompression
functions on files. It is not uncommon for such programs to reduce the
size of a given file by a 2:1 ratio (or better), although the amount of
reduction varies widely depending on the contents of the file.
There are many approaches in the prior art for compressing data. Some of
these approaches make implicit assumptions about certain types of files or
data within the files. For example, a bit image of a page produced using a
scanner typically has most of its pixels blank, and this tendency can be
exploited by a compression algorithm to greatly reduce the size of such
files. Similarly, word processing files contain many ASCII characters
which are easily compressed using knowledge of which characters (or words)
occur most frequently in the language of interest (e.g., English). Other
compression methods are independent of the file type and attempt to
"adapt" themselves to the data. In general, type-specific compression
techniques may provide higher compression performance than general-purpose
algorithms on the file for which the techniques are optimized, however
they tend to have much lower compression performance if the file model is
not correct. For instance, a compression method optimized for English text
might work poorly on files containing French text.
Typically, a storage system does not "know" what type of data is stored
within it. Thus, data-specific compression techniques are avoided, or they
are only used as one of a set of possible techniques. For example, ARC
uses many methods and picks the one that performs best for each file; note
however that this approach requires significant computational overhead
compared to using a single compression method.
Another important aspect of any compression method is the speed at which a
file can be processed. If the speed of compression (or decompression) is
so low as to significantly degrade system performance, then the
compression method is unacceptable even though it may achieve higher
compression ratios than competing methods. For example, with streaming
tape systems, if the file cannot be compressed fast enough to provide data
at the required rate for the tape drive, the tape will fall out of
streaming and the performance and/or capacity gains due to compression
will be nullified.
One of the most common compression techniques is known as run-length
encoding. This approach takes advantage of the fact that files often have
repeated strings of the same byte (character), such as zero or the space
character. Such strings are encoded using an "escape" character, followed
by the repeat count, followed by the character to be repeated. All other
characters which do not occur in runs are encoded by placing them as
"plain text" into the output stream. The escape character is chosen to be
a seldom used byte, and its occurrence in the input stream is encoded as a
run of length one with the escape character itself as the character.
Run-length encoding performs well on certain types of files, but can have
poor compression ratios if the file does not have repeated characters (or
if the escape character occurs frequently in the file). Thus, the
selection of the escape character in general requires an extra pass on the
data to find the least used byte, lowering the throughput of such a
system.
A more sophisticated approach is known as Huffman encoding (see, Huffman,
David A., "A Method for the Construction of Minimum Redundancy Codes",
Proceedings of the IRE, pp. 1098-1110, September 1952). In this method, it
is assumed that certain bytes occur more frequently in the file than
others. For example, in English text the letter "t" or "T" is much more
frequent than the letter "Q". Each byte is assigned a bit string, the
length of which is inversely related to the relative frequency of that
byte in the file. These bit strings are chosen to be uniquely decodeable
if processed one bit at a time. Huffman derived an algorithm for optimally
assigning the bit strings based on relative frequency statistics for the
file.
The Huffman algorithm guarantees that asymptotically the compression
achieved will approach the "entropy" of the file, which is precisely
defined as,
H=SUM-[p(i) log2(p(i))];
where
p(i)=probability of character i within the file=(# occurrences of i)/(total
# characters in file).
The units of H are in bits, and it measures how many bits (on the average)
are required to represent a character in the file. For example, if the
entropy were 4.0 bits using an 8-bit byte, a Huffman compression system
could achieve 2:1 compression on the file. The higher the entropy, the
more "random" (and thus less compressible) is the data.
Huffman encoding works very well on many types of files. However,
assignment of bit strings to bytes presents many practical difficulties.
For example, if a pre-assigned encoding scheme is used (e.g., based on
frequency of occurrence of letters in English), Huffman encoding may
greatly expand a file if the pre-assigned scheme assumes considerably
different frequency statistics than are actually present in the file.
Additionally, computing the encoding scheme based on the file contents not
only requires two passes over the data as well as applying the Huffman
algorithm to the frequency statistics (thus lowering system throughput),
but it also requires that the encoding table be stored along with the
data, which has a negative impact on the compression ratio. Furthermore,
the relative frequency of bytes can easily change dynamically within the
file, so that at any point the particular encoding assignment may perform
poorly.
There have been many variations on the Huffman approach (e.g., Jones,
Douglas W., "Application of Splay Trees to Data Compression",
Communications of the ACM, pp. 996-1007, Vol. 31, No. 8, August 1988) and
they usually involve dynamic code assignment based on the recent history
of input bytes processed. Such schemes circumvent the problems discussed
above. Other approaches include looking at two byte words (bi-grams) at
the same time and performing Huffman encoding on the words.
A recent variation of Huffman encoding is present in U.S. Pat. No.
4,730,348 to MacCrisken (and other patents referenced therein). In
MacCrisken, Huffman codes are assigned to bytes in the context of the
previous byte. In other words, a plurality of encoding tables are used,
each table being selected according to the previous byte. This approach is
based on the observation that, for example, in English the letter "u" does
not occur very frequently, but following a "q" it appears almost always.
Thus, the code assigned to "u" would be different depending on whether or
not the previous letter was "q" (or "Q"). For a similar scheme using
multiple tables and dynamic code assignment see, Jones, Douglas W.,
"Application of Splay Trees to Data Compression".
The above described Huffman-type approaches tend to be computationally
intensive and do not exceptionally achieve high compression ratios. One
explanation for this observation is that a pure Huffman code based on
8-bit bytes can achieve at best an 8:1 compression ratio, and only in the
optimal situation when the file consists of the same byte repeated over
and over (i.e. entropy=0). In the same scenario a simple run-length
encoding scheme could achieve better than a 50:1 compression ratio. The
average performance will be some combination of best and worst case
numbers, and limiting the best case must also limit the average. An ideal
Huffman code should be able to use "fractional" bits to optimize code
assignment, but the practical limitation of integral numbers of bits in
each code limits the Huffman performance to well below its theoretical
limit.
A totally different approach to compression was developed by Ziv and Lempel
(see, Ziv, J. and A. Lempel, A., "Compression of Individual Sequences via
Variable-Rate Coding", IEEE Transactions on Information Theory, Vol.
IT-24, pp. 530-536, September 1978 and then refined by Welch (See, Welch,
Terry A., "A Technique for High-Performance Data Compression", IEEE
Computer, pp. 8-19, June 1984). Instead of assigning variable length codes
to fixed size bytes, the Ziv-Lempel algorithm ("ZL") assigns fixed-length
codes to variable size strings. As input bytes from the file are
processed, a table of strings is built up, and each byte or string of
bytes is compressed by outputting only the index of the string in the
table. Typically this index is in the range 11-14 bits, and 12 bits is a
common number because it lends itself to a simple implementation. Since
the table is constructed using only previously encoded bytes, both the
compression and the decompression system can maintain the same table
without any extra overhead required to transmit table information. Hashing
algorithms are used to find matching strings efficiently. At the start of
the file, the table is initialized to one string for each character in the
alphabet, thus ensuring that all bytes will be found in at least one
string, even if that string only has length one.
The Ziv-Lempel algorithm is particularly attractive because it adapts
itself to the data and requires no pre-assigned tables predicated on the
file contents. Furthermore, since a string can be extremely long, the best
case compression ratio is very high, and in practice ZL outperforms
Huffman schemes on most file types. It is also quite simple to implement,
and this simplicity manifests itself in high throughput rates.
There are also some drawbacks, however, to the ZL compression method. The
ZL string search is a "greedy" algorithm. For example, consider the
string:
ABCDEFBCDEF;
where A,B,C,D,E,F are any distinct bytes. Note that the ZL string search
would add the following strings to its string table: AS, BC, CD, DE, EF,
BCD, DEF, the only strings of length two or greater that can be output
using this algorithm, up to the point shown, are BC and DE. In actuality
the string BCDEF has already occurred in the input. Thus, while ideally
the second BCDEF string would be referenced back to the original BCDEF, in
practice this does not occur.
A more significant disadvantage to the ZL approach is that the string table
for holding the compressed data will tend to fill up on long files. The
table size could be increased, however, this approach would require more
bits to represent a string and thus it would be less efficient. One
approach to handling this deficiency would be to discard all or part of
the table when it fills. Because of the structure of the algorithm, the
most recently found strings have to be discarded first, since they refer
back to previous strings. However, it is the most recent strings that have
been dynamically adapting to the local data, so discarding them is also
inefficient. Basically, the ZL string table has infinite length memory, so
changes in the type of data within the file can cause great encoding
inefficiencies if the string table is full.
It is also possible to design a compression system that utilizes more than
one method simultaneously, dynamically switching back and forth depending
on which method most efficient within the file. From an implementation
standpoint, such a scheme may be very costly (i.e., slow and/or
expensive), however the resulting compression rate could be very high.
One such method of dynamically switching back and forth is disclosed in
MacCrisken. As mentioned above, a hi-gram Huffman method is utilized as
the primary compression technique. Typically the compression and
decompression system start with a pre-defined (i.e. static) set of code
tables. There may be a set of such tables, perhaps one each for English,
French, and Pascal source code. The compression unit (sender) first
transmits or stores a brief description of which table is to be used. The
decompression unit (receiver) interprets this code and selects the
appropriate table. During compression, if it is determined that the
current table is not performing well, the sender transmits a special
("escape") Huffman code that tells the receiver to either select another
specific pre-defined table or to compute a new table based on the previous
data it has decompressed. Both sender and receiver compute the table using
the same algorithm, so there is no need to send the entire table, although
it takes some time to perform the computation. Once the new table is
computed, compression proceeds as before. It should be noted that although
there is considerable computational overhead, there is no reason why this
technique could not be further adapted to a dynamic Huffman scheme.
In addition to the Huffman encoding, MacCrisken used a secondary
string-based compression method. Both sender and receiver maintain a
history buffer of the most recently transmitted input bytes. For each new
input byte (A), the hi-gram Huffman code is generated, but an attempt is
also made to find the string represented by the next three input bytes
(ABC) in the history using a hashing scheme. The hash is performed on
three byte strings and a doubly-linked hash list is maintained to allow
discarding of old entries in the hash list. If a string is found, a
special Huffman escape code can be generated to indicate that a string
follows, and the length and offset of the string in the history buffer is
sent. The offset is encoded in 10 bits, while the length is encoded into 4
bits, representing lengths from 3-18 bytes. Before such a string is sent
however, the compression unit generates the Huffman codes for all the
bytes in the string and compares the size of the Huffman codes with the
size of the string bits. Typically the Huffman string escape code is four
bits, so it takes 19 bits to represent a string. The smaller of the two
quantities is sent.
Note that the MacCrisken string method avoids the problems of the
Ziv-Lempel method in that the string "table" never fills up, since the old
entries are discarded by removing them from the hash list. Thus, only the
most recent (within 1K bytes) strings occupy the table. Also it is not
"greedy" since in principle all matching strings can be found. In
practice, a limit on the length of the string search is imposed.
Additionally, the MacCriskin method is computationally inefficient because
it is effectively performing two compression algorithms at once, and thus
the computational overhead is quite high.
SUMMARY OF THE INVENTION
The present invention is a compression/decompression system which increases
the capacity of digital storage or transmission media, such as magnetic
disk or tape storage devices. The compression method is fully adaptive,
requiring no pre-initialized encoding tables, and is optimized for
byte-oriented character streams, such as computer files. It overcomes many
of the difficulties found in the prior art and generally achieves higher
compression ratios than the previous techniques as discussed above.
During compression, a history buffer of previously processed bytes is
maintained in the compression apparatus. Compression is achieved by
locating repeated strings of bytes in the history buffer. If no matching
string containing the byte currently being examined is found, the byte is
appended to the output data stream after a special tag bit to indicate
that the byte is "raw" (i.e., not a string). If such a string is found,
its length and relative position within the history buffer are encoded and
appended to the output (compressed) data stream. String length and
positions are encoded in such a fashion that even two-byte repeated
strings result in a compression ratio better than 1:1. In other words,
only in the case of a single "raw" byte does data "expansion" occur.
The string length encoding is variable length, and the string position may
also be encoded as a variable length field. Thus, the present invention
maps variable length input strings to variable length output codes.
A hash table is used to perform efficient string searches, and a hash
"refresh" method is utilized to minimize the computation overhead required
for maintaining the hash data structures. These techniques allow for
high-speed compression of the input data, at input rates up to several
megabytes/second using currently available integrated circuit technology.
The following is a more detailed description of the preferred embodiment of
the present invention which includes a method and apparatus for converting
an input data character string into a variable length encoded data string
in a data compression system. The data compression system comprises a
history array means. The history array means has a plurality of entries
and each entry of the history array means is for storing a portion of an
input data stream. The method of the preferred embodiment comprises the
following steps.
The first step includes performing a search in the history array means for
the longest data string which matches the input data stream. If such a
matching data string is found within the history array means, the second
step includes encoding the matching data string found in the history array
means by appending to the variable length encoded data stream a tag
indicating that the matching data string was found and by appending a
string substitution code. The string substitution code includes a variable
length indicator of the length of the matching data string and a pointer
to the location within the history array means of the matching data
string.
If a matching input data string is not found within the history array
means, the second step includes the step of encoding the first character
of the input data stream by appending to the variable length encoded data
stream a "raw" data tag which indicates that no matching data string was
found in the history array means and the first character of the input data
stream is also appended to the variable length encoded data stream. In
this way, the input data stream is converted into a variable length
encoded data stream.
The step of performing the search in the history array means for the
longest matching data string may further include the step of limiting the
search to a predetermined number of inquiries into the history array means
for the longest matching data string. Additionally, the step for
performing the search for the longest matching data string can also
include the step of performing a hashing function.
In order to perform the hashing function, a data compression system
includes certain hash data structures including a history array pointer, a
hash table means and an offset array means. The history array pointer
points to the latest entry in the history array means. The hash table
means has a plurality of entries and each entry in the hash table means
stores a pointer which points into the history array means. The offset
array means has a plurality of entries, and each entry provides a link to
one of the entries in the history array means. The step for performing the
hash function typically includes the following steps.
First, obtaining the result of the hashing function which provides a
pointer to one of the entries in the hash table means. Then, obtaining the
pointer stored in the hash table entry pointed to by the result of the
hash function. Next, calculating the difference between the history array
pointer and the pointer read from the hash table means and storing the
difference into the offset array entry pointed to by the history array
pointer. Lastly, storing the history array pointer into the hash table
entry pointed to by the hash function.
The preferred embodiment of the invention also includes a refresh function.
The refresh function periodically examines the pointers stored in the
entries of the hash table to determine whether the pointer of each entry
differs from the history pointer by a predetermined amount. If the
difference in the pointer and the history array pointer is greater than a
predetermined amount, then the entry in the hash table is replaced by an
invalid value which reinitializes the entry.
Additionally, the preferred embodiment provides an initialization routine
which effectively replaces all entries of the hash table with invalid
values which effectively initializes the table.
The preferred embodiment of the invention also includes a method for
decoding the variable length encoded data stream which is output from the
compression unit. The method for decomposition includes the following
steps.
First, the variable length encoded data stream is parsed into separate
portions and each separate portion starts with one of the tags. Next, the
tag of each separate portion is evaluated to determine whether the tag is
the raw data tag or the tag indicating an encoded matching data string.
When the tag indicates that there is an encoded matching data string, the
next step includes interpreting the length indicator and the pointer of
the substitution code for generating the matching data string. In this
way, a portion of the original input data stream is reconstructed.
Alternatively, when the tag is a raw data tag, then the first character of
the encoded input data stream is obtained and in this way a portion of the
original input data stream is reconstructed.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1a is a block diagram of a compression unit accepting uncompressed
data and outputting compressed data according to the present invention.
FIG. 1b is a block diagram of a decompression unit accepting compressed
data and outputing decompressed data according to the present invention.
FIG. 2 depicts the compression format used by the preferred embodiment of
the present invention.
FIG. 3 depicts a simplified example of compression encodings according to
the compression format depicted in FIG. 2.
FIG. 4 shows the data structures implemented by the preferred embodiment of
invention for performing searches on the input data stream.
FIG. 5a is a flow block diagram of the COMPRESSION OPERATION Routine
performed by the compression unit (FIG. 1a) for encoding the input data
stream.
FIG. 5b is a flow block diagram of the INITIALIZATION Routine referenced
during the COMPRESSION OPERATION Routine (FIG. 5a) for initializing the
hash table of the data structures shown in FIG. 4.
FIG. 5c is a flow block diagram of the REFRESH HASH Routines referenced
during the COMPRESSION OPERATION Routine (FIG. 5a) for partially
reinitializing the hash table of the data structures shown in FIG. 4.
FIG. 6 is a flow block diagram of the DECOMPRESSION OPERATION Routine.
FIG. 7 is a schematic block diagram of a hardwired representation of the
COMPRESSION OPERATION Routine (FIG. 5a).
FIG. 8 is a block diagram of the external RAM FIFO.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT
Referring to FIG. la and lb a compression unit 4 and a block diagrams of a
decompression unit 6 according to the present invention are depicted. Both
units 4 and 6 can be hardware modules o | | |