|
|
|
| United States Patent | 5614899 |
| Link to this page | http://www.wikipatents.com/5614899.html |
| Inventor(s) | Tokuda; Katsumi (Osaka, JP);
Sugimura; Ryoichi (Hyogo, JP) |
| Abstract | A text compressing apparatus comprising morpheme-parsing unit for taking
out words from sentences retrieved from an external storage unit,
dictionary retrieving unit for converting the post-morpheme-parse words
into form marks indicating the form of the detected words while referring
to a parse dictionary, a structure-parsing unit for generating equation
trees for each sentence, an expression rewriting unit for rewriting the
words which are referred to as nodes into their representative words, an
equation converting unit for converting the equation trees written into
the representative expressions into words, and Huffman coding unit for
converting the words from the equation tree converting unit into strings
of bits. |
|
|
|
Title Information  |
|
|
|
|
|
Drawing from US Patent 5614899 |
|
|
Apparatus and method for compressing texts |
|
|
|
|
|
| Publication Date |
March 25, 1997 |
|
|
|
|
|
| Filing Date |
December 2, 1994 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| Priority Data |
Dec 03, 1993[JP]5-304137 |
|
|
|
|
|
|
|
|
|
|
|
Title Information  |
|
|
References  |
|
|
| *references marked with an asterisk below are user-added references |
|
U.S. References |
|
|
|
|
|
|
U.S. References |
|
|
Foreign References |
|
|
|
|
|
|
Foreign References |
|
|
Other References |
|
|
|
|
|
|
Other References |
|
|
|
|
|
References  |
|
|
|
|
|
| Market Size |
|
Estimate the gross annual revenues of the relevant market
sector:
|
| | |
| |
|
|
| Market Share |
|
Estimate the percentage of the relevant market sector this invention will capture:
|
| | |
| |
|
|
| Reasonable Royalty |
|
What percentage of gross sales should the inventor or assignee be paid?
|
| | |
| |
|
|
|
Public's "Guesstimation" of Royalty Value
|
| Market Size | N/A | [No votes] | | x | Market Share | N/A | [No votes] | | x | Reasonable Royalty | N/A | [No votes] |
| | N/A | |
| |
|
|
|
|
|
|
|
|
|
|
|
|
Market Review  |
|
|
Technical Review  |
|
|
Claims  |
|
|
What is claimed is:
1. A text file compressing apparatus comprising:
word detecting means for detecting words in sentences comprising a text
file;
representative word storage means for classifying a plurality of words into
a plurality of sets of synonymous words and to store the sets with a
representative word for each set;
representative word retrieval means for retrieving the representative words
for the detected words from their respective sets in said representative
word storage means;
representative word rewrite means for rewriting the detected words into the
retrieved representative words; and
coding means for coding the text file after the detected words are
rewritten by said representative word rewrite means.
2. The text file compressing apparatus of claim 1, wherein said coding
means includes:
correspondence table storage means for storing a correspondence table
showing a relation between a plurality of words and string of bits; and
string-of-bits conversion means for converting each detected word in the
text file into the string of bits after the detected words are rewritten
into the representative words by said representative word rewrite means.
3. The text file compressing apparatus of claim 2, wherein:
the strings of bits in said correspondence table are variable; and
the representative word in each set is assigned a shortest string of bits
compared with other words in the set.
4. The text file compressing apparatus of claim 3, wherein words with high
probabilities with which the words appear in the texts are assigned short
strings of bits, and words with small probabilities are assigned long
strings of bits.
5. The text file compressing apparatus of claim 2, wherein words with high
probabilities with which the words appear in the texts are assigned short
strings of bits, and words with small probabilities are assigned long
strings of bits.
6. The text file compressing apparatus of claim 1, wherein said coding
means includes:
dictionary storage means for storing a dictionary including a plurality of
words arranged in an order of their respective probabilities; and
string-of-bits conversion means for converting the detected words in the
text file into strings of bits after the detected words are rewritten into
the representative words by said representative word conversion means,
said strings of bits indicating a location of each word in the dictionary.
7. The text file compressing apparatus of claim 6 further comprising:
probability computing means for detecting the words in the sentences
comprising the text file to compute probabilities with which the words
appear in the text; and
dictionary generation means for generating a dictionary by arranging the
detected words in relation with their respective probabilities, the
resulting dictionary being stored in said dictionary storage means,
whereby the text file compressing apparatus yields the text file including
strings of bits, and the dictionary generated by the dictionary generating
means as a result of an operation.
8. The text file compressing apparatus of claim 6, wherein the dictionary
stored in said dictionary storage means is shared by a plurality of text
files.
9. The text file compressing apparatus of claim 1 further comprising:
take-out means for taking out sentences separately from the text file;
structure-parse means for structure-parsing each sentence to generate an
equation tree, said equation tree having a tree structure with a plurality
of nodes linked one with another;
string-of-characters conversion means for converting the equation tree into
strings of characters, a parent-child relation being represented by a
parenthesis; and
sentence conversion means for converting each sentence taken out by said
take-out means into the strings of characters, and
wherein said correspondence table storage means stores the parenthesis in
relation with strings of bits,
whereby said converting means converts the parenthesis into a string of
bits.
10. The text file compressing apparatus of claim 9, wherein said
string-of-characters conversion means includes:
a string-of-characters buffer in which the equation tree in the form of
strings of characters are arranged;
first placement means for placing a word corresponding to a parent node at
a top of the equation tree, and the words corresponding to all children
nodes within said string-of-character buffer, all the children nodes being
parenthesized;
current node set means for making each child node into a current node in
turn;
second placement means for placing a word corresponding to the current node
and the words corresponding to all children nodes, all the children nodes
being parenthesized; and
repetition means for repeatedly activating said second placement means for
each child node.
11. A text file compressing apparatus for compressing a text file, said
text file compressing apparatus comprising:
detecting means for detecting words in sentences comprising a text file;
form storage means for storing form information for a plurality of words,
said form information comprising infinitives and form marks indicating
forms of the words;
form information retrieval means for retrieving the form information that
matches with the words detected by said detection means from said form
storage means;
form information rewrite means for rewriting the words detected by said
detection means into their respective form information; and
coding means for coding the text file after the words detected by said
detecting means are rewritten into the representative words by said form
information rewrite means.
12. The text file compressing apparatus of claim 11, wherein said coding
means includes:
correspondence table storage means for storing a correspondence table
showing a relation between a plurality of words and string of bits; and
string-of-bits conversion means for converting each detected word in the
text file into the string of bits after the detected words are rewritten
into the representative words by said representative word rewrite means.
13. The text file compressing apparatus of claim 12, wherein the
infinitives are assigned shorter strings of bits compared with their
different forms in said correspondence table.
14. The text file compressing apparatus of claim 13, wherein words with
high probabilities with which the words appear in the texts are assigned
short strings of bits, and words with small probabilities are assigned
long strings of bits.
15. The text file compressing apparatus of claim 14, wherein said
correspondence storage means further stores the form marks in relation
with the strings of bits,
whereby said string-of-bits conversion means converts the form marks into
the strings of bits.
16. The text file compressing apparatus of claim 12, wherein said coding
means includes:
dictionary storage means for storing a dictionary including a plurality of
words arranged in an order of their respective probabilities; and
string-of-bits conversion means for converting the detected words in the
text file into strings of bits after the detected words are rewritten into
the representative words by said representative word conversion means,
said strings of bits indicating a location of each word in the dictionary.
17. A text file compressing apparatus comprising:
first detecting means for detecting words in sentences comprising a text
file;
form storage means for storing form information for a plurality of words,
said form information comprising infinitives and form marks indicating
forms of the words;
form information retrieval means for retrieving the form information that
matches with the words detected by said first detection means from said
form storage means;
form information rewrite means for rewriting the words detected by said
first detection means into their respective form information;
second detecting means for detecting words in the sentences comprising the
text file;
representative word storage means for classifying a plurality of words into
a plurality of sets of synonymous words to store the sets with a
representative word for each set;
representative word retrieval means for retrieving the representative words
for the words detected by said second detection means from their
respective sets in said representative word storage means;
representative word rewrite means for rewriting the words detected by said
second detection means into the retrieved representative words; and
coding means for coding the text file after the words detected by said
second detection means are rewritten by said representative word rewrite
means.
18. The text file compressing apparatus of claim 17, wherein said coding
means includes:
correspondence table storage means for storing a correspondence table
showing a relation between a plurality of words and string of bits; and
string-of-bits conversion means for converting each detected word in the
text file into the string of bits after the detected words are rewritten
into the representative words by said representative word rewrite means.
19. The text file compressing apparatus of claim 18, wherein:
the strings of bits in said correspondence table are variable; and
the representative word in each set is assigned a shortest string of bits
compared with other words in the set and the infinitives are assigned
shorter strings of bits compared with their different forms in said
correspondence table.
20. The text file compressing apparatus of claim 19, wherein the
representative words and infinitives with high probabilities with which
the words appear in the texts are assigned short strings of bits, and
words with small probabilities are assigned long strings of bits in said
correspondence table.
21. The text file compressing apparatus of claim 20, wherein said
correspondence storage means further stores the form marks in relation
with the strings of bits,
whereby said string-of-bits conversion means converts the form marks into
the strings of bits.
22. A method for compressing a text file for a text file compressing
apparatus including a text file storage unit for storing text files, a
representative word storage unit for classifying a plurality of words into
a plurality of sets of synonymous words to store the sets with a
representative word for each set, and a code information storage unit for
storing code information used to code words in the text file, said method
comprising the steps of:
detecting words in sentences comprising the text file;
retrieving the representative words from the sets for the detected words;
rewriting the detected words with the retrieved representative words; and
coding the text file after the detected words are rewritten into the
representative words while referring to the code information in said code
information storage unit.
23. The method of claim 22, wherein said coding step includes:
converting the words in the text file into string of bits after the
detected words are rewritten into the representative words while referring
to the code information in said code information storage unit, the code
information being stored in the form of a correspondence table showing a
relation between a plurality of words and strings of bits.
24. The method of claim 22, wherein said coding step includes:
converting the words in the text file into string of bits after the words
are rewritten into the representative words while referring to the code
information, the code information being stored in said code information
storage unit in the form of a dictionary including words arranged in an
order of probabilities with which the words appear in the file text, said
strings of bits showing an order of each word in the code information.
25. The method of claim 24 further comprising the steps of:
detecting the words in the sentences comprising of the text file and
computing probabilities with which the words appear in the text; and
generating a dictionary by arranging the detected words in relation with
their respective probabilities, the resulting dictionary being stored in
said code information storage unit.
26. The method of claim 22 further comprising the steps of:
taking out sentences separately from the text file;
structure-parsing each sentence and generating an equation tree, said
equation tree having a tree structure with a plurality of nodes linked on
with another;
converting the equation tree into strings of characters, a parent-child
relation being represented by a parenthesis; and
converting each sentence taken out by said sentence taking out step into
the strings of characters,
wherein said code information storage unit stores the parenthesis in
relation with strings of bits,
whereby the parenthesis is converted into a string of bits in said
sentence-to-string-of-characters converting step.
27. The method of claim 26, wherein said
equation-tree-to-string-of-characters converting step includes:
arranging the equation tree in the form of strings of characters;
placing a word corresponding to a parent node at a top of the equation
tree, and the words corresponding to all children nodes arranged in said
arranging sub-step, all the children nodes being parenthesized;
making each child node into a current node in turn;
placing a word corresponding to the current node and the words
corresponding to all children nodes, all the children nodes being
parenthesized; and
repeating said secondly mentioned placing sub-step for each child node.
28. A method for compressing a text file for a text file compressing
apparatus including a text file storage unit for storing text files, a
form storage unit for storing a plurality of words in the form of
infinitives and form marks specifying a form of each word, and a code
information storage unit for storing code information used to code words
in the text file, said method comprising the steps of:
detecting words in sentences comprising the text file;
retrieving form information that matches with the detected words from said
form storage unit;
rewriting the detected words with their respective form information; and
coding the text file after the words are rewritten into the form
information while referring to the code information in said code
information storage unit.
29. The method of claim 28, wherein said coding step includes:
converting the words in the text file into string of bits after the
detected words are rewritten into the representative words while referring
to the code information in said code information storage unit, the code
information being stored in the form of a correspondence table showing a
relation between a plurality of words and strings of bits.
30. The method of claim 29, wherein said rewriting step includes:
rewriting the form marks in the text file into strings of bits while
referring to said correspondence table, said correspondence table further
showing a relation between the strings of bits and form marks.
31. A method for compressing a text file for a text file compressing
apparatus including a text file storage unit for storing text files, a
representative word storage unit for classifying a plurality of words into
a plurality sets of synonymous words to store the sets with a
representative word for each set, a form storage unit for storing a
plurality of words in the form of infinitives and form marks specifying a
form of each word, and a code information storage unit for storing code
information used to code words in the text file, said method comprising
the steps of:
detecting words in sentences comprising the text file;
retrieving form information that matches with the detected words from said
form storage unit;
rewriting the detected words with the form information;
detecting words in post-rewrite sentence in the rewriting step;
retrieving the representative words for the words detected in said
post-rewrite-word detecting step from the sets from said representative
word storage unit;
rewriting the post-rewrite words into their respective representative
words; and
coding the text file after the post-rewrite words are rewritten into their
respective representative words.
32. The method of claim 31, wherein said coding step includes:
converting the words in the text file into string of bits after the
detected words are rewritten into the representative words while referring
to the code information in said code information storage unit, the code
information being stored in the form of a correspondence table showing a
relation between a plurality of words and strings of bits.
33. A text file compressing apparatus comprising:
word detecting means for detecting words in sentences comprising a text
file;
representative word storage means for classifying a plurality of words into
a plurality of sets of synonymous words and to store the sets with a
representative word for each set;
representative word retrieval means for retrieving the representative words
for the detected words from their respective sets in said representative
word storage means;
representative word rewrite means for rewriting the detected words into the
retrieved representative words;
coding means for coding the text file after the detected words are
rewritten by said representative word rewrite means;
take-out means for taking out sentences separately from the text file;
structure-parse means for structure-parsing each sentence to generate an
equation tree, said equation tree having a tree structure with a plurality
of nodes linked one with another;
string-of-characters conversion means for converting the equation tree into
strings of characters, a parent-child relation being represented by a
parenthesis; and
sentence conversion means for converting each sentence taken out by said
take-out means into the strings of characters, and
wherein said correspondence table storage means stores the parenthesis in
relation with strings of bits,
whereby said converting means converts the parenthesis into a string of
bits.
34. The text file compressing apparatus of claim 33, wherein said
string-of-characters conversion means includes:
a string-of-characters buffer in which the equation tree in the form of
strings of characters are arranged;
first placement means for placing a word corresponding to a parent node at
a top of the equation tree, and the words corresponding to all children
nodes within said string-of-character buffer, all the children nodes being
parenthesized;
current node set means for making each child node a current node in turn;
second placement means for placing a word corresponding to the current node
and the words corresponding to all children nodes, all the children nodes
being parenthesized; and
repetition means for repeatedly activating said second placement means for
each child node.
35. A method for compressing a text file for a text file compressing
apparatus, the apparatus including a text file storage unit for storing
text files, a representative word storage unit for classifying a plurality
of words into a plurality of sets of synonymous words to store the sets
with a representative word for each set, and a code information storage
unit for storing code information used to code words in the text file,
said method comprising the steps of:
detecting words in sentences comprising the text file;
retrieving the representative words from the sets for the detected words;
rewriting the detected words with the retrieved representative words;
coding the text file after the detected words are rewritten into the
representative words while referring to the code information in said code
information storage unit;
taking out sentences separately from the text file;
structure-parsing each sentence and generating an equation tree, said
equation tree having a tree structure with a plurality of nodes linked one
with another;
converting the equation tree into strings of characters, a parent-child
relation being represented by a parenthesis; and
converting each sentence taken out by said sentence taking out step into
the strings of characters,
wherein said code information storage unit stores the parenthesis in
relation with strings of bits,
whereby the parenthesis is converted into a string of bits in said
sentence-to-string-of-characters converting step.
36. The method of claim 35, wherein said
equation-tree-to-string-of-characters converting step further comprises
the steps of:
arranging the equation tree in the form of strings of characters;
placing a word corresponding to a parent node at a top of the equation
tree, and the words corresponding to all children nodes arranged in said
arranging sub-step, all the children nodes being parenthesized;
making each child node into a current node in turn;
placing a word corresponding to the current node and words corresponding to
all children nodes, all the children nodes being parenthesized; and
repeating said secondly mentioned placing sub-step for each child node.
37. A text file compressing apparatus comprising:
word detecting means for detecting words in sentences comprising a text
file;
representative word storage means for classifying a plurality of words into
a plurality of sets of synonymous words and to store the sets with a
representative word for each set;
representative word retrieval means for retrieving the representative words
for the detected words from their respective sets in said representative
word storage means;
representative word rewrite means for rewriting the detected words into the
retrieved representative words;
coding means for coding the text file after the detected words are
rewritten by said representative word rewrite means;
take-out means for taking out sentences separately from the text file;
structure-parse means for structure-parsing each sentence to generate an
equation tree, said equation tree having a tree structure with a plurality
of nodes linked one with another;
string-of-characters conversion means for converting the equation tree into
strings of characters, a parent-child relation being represented by a
predetermined character; and
sentence conversion means for converting each sentence taken out by said
take-out means into the strings of characters, and
wherein said correspondence table storage means stores the predetermined
character in relation with strings of bits,
whereby said converting means converts the predetermined character into a
string of bits.
38. The text file compressing apparatus of claim 37, wherein said
string-of-characters conversion means includes:
a string-of-characters buffer in which the equation tree in the form of
strings of characters are arranged;
first placement means for placing a word corresponding to a parent node at
a top of the equation tree, and the words corresponding to all children
nodes within said string-of-character buffer, all the children nodes being
marked by the predetermined character;
current node set means for making each child node a current node in turn;
second placement means for placing a word corresponding to the current node
and the words corresponding to all children nodes, all the children nodes
being marked by the predetermined character; and
repetition means for repeatedly activating said second placement means for
each child node.
39. A method for compressing a text file for a text file compressing
apparatus, the apparatus including a text file storage unit for storing
text files, a representative word storage unit for classifying a plurality
of words into a plurality of sets of synonymous words to store the sets
with a representative word for each set, and a code information storage
unit for storing code information used to code words in the text file,
said method comprising the steps of:
detecting words in sentences comprising the text file;
retrieving the representative words from the sets for the detected words;
rewriting the detected words with the retrieved representative words;
coding the text file after the detected words are rewritten into the
representative words while referring to the code information in said code
information storage unit;
taking out sentences separately from the text file;
structure-parsing each sentence and generating an equation tree, said
equation tree having a tree structure with a plurality of nodes linked one
with another;
converting the equation tree into strings of characters, a parent-child
relation being represented by a predetermined character; and
converting each sentence taken out by said sentence taking out step into
the strings of characters,
wherein said code information storage unit stores the predetermined
character in relation with strings of bits,
whereby the predetermined character is converted into a string of bits in
said sentence-to-string-of-characters converting step.
40. The method of claim 39, wherein said
equation-tree-to-string-of-characters converting step further comprises
the steps of:
arranging the equation tree in the form of strings of characters;
placing a word corresponding to a parent node at a top of the equation
tree, and the words corresponding to all children nodes arranged in said
arranging sub-step, all the children nodes being marked by the
predetermined character;
making each child node into a current node in turn;
placing a word corresponding to the current node and words corresponding to
all children nodes, all the children nodes being marked by the
predetermined character and
repeating said secondly mentioned placing sub-step for each child node. |
|
|
|
|
Claims  |
|
|
Description  |
|
|
BACKGROUND OF THE INVENTION
(1) Field of the Invention
The present invention relates to an apparatus and method for compressing
electronic texts to reduce the data size thereof.
(2) Description of the Related Art
The introduction of electronic texts has advanced a so-called "paperless
system" in both the civil and public sectors. The electronic texts, or
otherwise known as text files, are the texts represented by an array of
character codes. The electronic texts are advantageous in that they are
easily copied or delivered, and occupy only a small storage area.
When an electronic text is transmitted to a distant correspondent via a
communication network, or delivered by means of a recording medium, such
as a floppy disk or CD-ROM, the data size of the electronic text is
reduced by compressing the character codes to save the communication cost
and increase the number of the texts to be stored on one disk.
The electronic texts are generally compressed by lossless coding methods,
such as LZW coding and Huffman coding. In the following, Huffman coding
will be explained as an example.
In Huffman coding, the probabilities with which words appear in the texts
are first determined. Next, bit strings are assigned to the words and an
example of the bit strings is shown in FIG. 1: the words assigned short
strings of bits have larger probabilities, whereas the words with smaller
probabilities map to longer strings of bits. The words in the electronic
text are converted into the strings of bits while referring to the
compression table shown in FIG. 1. Assume that a sentence "XXX Electric
will release a new product." consisting of six words, "XXX Electric",
"will", "release", "a", "new", and "product", is converted into a string
of bits. If a character code is 8 bits long, then, for example, a 32 bits
long (8.times.4=32) word "will" is converted into a 4-bit string "1000" in
Huffman coding, reducing the code size to one eighths (4/32=1/8).
Huffman coding is most effective when the probabilities are determined as
accurately as possible and the short strings of bits are assigned to the
words with large probabilities.
However, in practice, a writer avoids repeating words or phrases and
instead uses synonyms to have a wide vocabulary; for example, the writer
may use "an apology" for the first occurrence and "an excuse" for the
second. This reduces the probabilities for each word, and the resulting
Huffman coding is not satisfactorily efficient. In addition, the fact that
verbs and auxiliary verbs have different forms makes it difficult to
increase compression efficiency. For example, an irregular verb "be" has
six forms: "be", "is", "are", "was", "were", and "been". If the verb "be"
invariably appears as "is", then the word is compressed efficiently;
however, having the different forms reduces the probabilities for the verb
"be", and hence making Huffman coding less effective.
There are other coding methods known as "lossy coding". An example of the
lossy coding is disclosed in Japanese Laid-open Patent Application No.
4-156663. In the lossy coding, each sentence in one electronic text is
morpheme-parsed while referring to a certain dictionary to be partitioned
into words. Then, semantic information of each word is retrieved from the
dictionary, based on which the importance of each word is determined. If
the determined importance is below a predetermined level, such words are
deleted to reduce the amount of data. However, because the reduction is
limited to the data amount of the deleted words, the compression
efficiency is not satisfactory either. Besides, the restored data as a
result of compression and subsequent expansion are not true to the
original data because some words are omitted.
Huffman coding may be combined with the lossy coding to enhance the
compression efficiency: a text is coded after less important words are
deleted. However, the text is restored by omitting some words as is with
the case in the lossy coding alone.
SUMMARY OF THE INVENTION
Accordingly, the present invention has a first object to provide an
apparatus and method for compressing texts efficiently by rewriting
synonymous words with a representative word first and then applying a
lossless coding method.
The present invention has a second object to provide an apparatus and
method for compressing texts efficiently by rewriting words in different
forms with their respective infinitives first and then applying a lossless
coding method.
The first object can be fulfilled by a text file compressing apparatus
comprising: a word detecting unit for detecting words in sentences
composing a text file; a representative word storage unit for classifying
a plurality of words into a plurality of sets of synonymous words to store
the sets with a representative word for each set; a representative word
retrieval unit for retrieving the representative words for the detected
words from their respective sets in the representative word storage unit;
a representative word rewrite unit for rewriting the detected words into
the retrieved representative words; and a coding unit for coding the text
file after the detected words are rewritten by the representative word
rewrite unit.
The coding unit may include: a correspondence table storage unit for
storing a correspondence table showing a relation between a plurality of
words and string of bits; and a string-of-bits conversion unit for
converting each detected word in the text file into the string of bits
after the detected words are rewritten into the representative words by
the representative word rewrite unit.
The strings of bits in the correspondence table may be variable; and the
representative word in each set may be assigned a shortest string of bits
compared with other words in the set.
Words with high probabilities with which the words appear in the texts may
be assigned short strings of bits, and words with small probabilities may
be assigned long strings of bits.
The first object can be also fulfilled by a method for compressing a text
file for a text file compressing apparatus including a text file storage
unit for storing text files, a representative word storage unit for
classifying a plurality of words into a plurality of sets of synonymous
words to store the sets with a representative word for each set, and a
code information storage unit for storing code information used to code
words in the text file, the method comprising the steps of: detecting
words in sentences comprising the text file; retrieving the representative
words from the sets for the detected words; rewriting the detected words
with the retrieved representative words; and coding the text file after
the detected words are rewritten into the representative words while
referring to the code information in the code information storage unit.
The coding step may include: converting the words in the text file into
string of bits after the detected words are rewritten into the
representative words, while referring to the code information in the code
information storage unit, the code information being stored in the form of
a correspondence table showing a relation between a plurality of words and
strings of bits.
The coding step may include: converting the words in the text file into
string of bits after the words are rewritten into the representative words
while referring to the code information, the code information being stored
in the code information storage unit in the form of a dictionary including
words arranged in an order of probabilities with which the words appear in
the file text, the strings of bits showing an order of each word in the
code information.
According to the present invention, the synonymous words are rewritten into
an adequate representative word first and subsequently compressed based on
the probabilities with which the representative word appears in one text.
Thus, if a text includes a wide vocabulary, not only is the text
compressed efficiently, but also the resorted text is substantially true
to the original.
The second object can be fulfilled by a text file compressing apparatus for
compressing a text file, the text file compressing apparatus comprising: a
detecting unit for detecting words in sentences composing a text file; a
form storage unit for storing form information for a plurality of words,
the form information composing infinitives and form marks indicating forms
of the words; a form information retrieval unit for retrieving the form
information that matches with the words detected by the detection unit
from the form storage unit; a form information rewrite unit for rewriting
the words detected by the detection unit into their respective form
information; and a coding unit for coding the text file after the words
detected by the detecting unit are rewritten into the representative words
by the form information rewrite unit.
The coding unit may include: a correspondence table storage unit for
storing a correspondence table showing a relation between a plurality of
words and string of bits; and a string-of-bits conversion unit for
converting each detected word in the text file into the string of bits
after the detected words are rewritten into the representative words by
the representative word rewrite unit.
The infinitives may be assigned shorter strings of bits compared with their
different forms in the correspondence table.
Words with high probabilities with which the words appear in the texts may
be assigned short strings of bits, and words with small probabilities may
be assigned long strings of bits.
The correspondence storage unit may further store the form marks in
relation with the strings of bits,
whereby the string-of-bits conversion unit converts the form marks into the
strings of bits.
The second object can be also fulfilled by a method for compressing a text
file for a text file compressing apparatus including a text file storage
unit for storing text files, a form storage unit for storing a plurality
of words in the form of infinitives and form marks specifying a form of
each word, and a code information storage unit for storing code
information used to code words in the text file, the method comprising the
steps of: detecting words in sentences composing the text file; retrieving
form information that matches with the detected words from the form
storage unit; rewriting the detected words with their respective form
information; and coding the text file after the words are rewritten into
the form information while referring to the code information in the code
information storage unit.
The coding step may include: converting the words in the text file into
string of bits after the detected words are rewritten into the
representative words while referring to the code information in the code
information storage unit, the code information being stored in the form of
a correspondence table showing a relation between a plurality of words and
strings of bits.
The rewriting step may include: rewriting the form marks in the text file
into strings of bits while referring to the correspondence table, the
correspondence table further showing a relation between the strings of
bits and form marks.
According to the above structure, the words in different forms are
rewritten into their respective infinitives before they are compressed to
increase the probabilities. Thus, the compression efficiency can be
enhanced.
BRIEF DESCRIPTION OF THE DRAWING
These and other objects, advantages and features of the invention will
become apparent from the following description thereof taken in
conjunction with the accompanying drawings which illustrate specific
embodiments of the invention. In the drawings:
FIG. 1 is a view showing a compression table used in Huffman coding;
FIG. 2 is a view depicting a structure of a text compressing apparatus in
accordance with a first embodiment of the present invention;
FIG. 3 is a view showing a content of a parse dictionary;
FIGS. 4A through 4C are views showing sentence-to-equation-tree transition;
FIG. 5 is a view showing a content of an expression conversion dictionary;
FIG. 6 is a view showing an equation tree after the rewriting of a word
with a representative expression;
FIG. 7 is a view showing another compression table;
FIG. 8 is a flowchart detailing the operation of the text compressing
apparatus in the first embodiment;
FIG. 9 is a flowchart detailing the operation of an equation tree
converting unit in accordance with a second embodiment of the present
invention; and
FIGS. 10A through 10F are views showing a transition from the equation tree
to a string of characters.
DESCRIPTION OF THE PREFERRED EMBODIMENTS
FIRST EMBODIMENT
FIG. 2 is a view depicting the structure of a text compressing apparatus in
accordance with the first embodiment of the present invention. The text
compressing apparatus is installed in a text generating apparatus to
reduce the memory size of electronic texts stored in an external storage
unit. In this embodiment, assume that the electronic texts are written in
English.
The text compressing apparatus comprises a job buffer 10, a control unit
11, a parse dictionary 12, a morpheme-parsing unit 14, a dictionary
retrieving unit 15, a structure-parsing unit 16, an expression rewrite
dictionary 17, an expression rewriting unit 18, an equation tree
converting unit 19, a compression table 20, and a Huffman coding unit 21.
More precisely, the job buffer 10 is a text buffer with a capacity for
storing character codes of one electronic text.
The control unit 11 opens a file for a desired electronic text stored in an
external storage unit (not shown), and takes out all the sentences
separately therefrom to deliver the same to the morpheme-parsing unit 14.
Subsequently, the control unit 11 activates the morpheme-parsing unit 14,
dictionary retrieving unit 15, structure-parsing unit 16, expression
rewriting unit 18, equation tree converting unit 19, and Huffman coding
unit 21 in turn.
The parse dictionary 12 stores words referred to as vocabulary indexes in
relation with their parts of speech, form marks, and parse patterns as
shown in FIG. 3. The form marks identify infinitives of the vocabulary
indexes if they are in different forms. The form marks consist of
infinitives appended with marks "ORIG", "PRES", "PAST", or "PP" specifying
the infinitive, present tense, past tense, and past participle,
respectively. The parse pattern is the information indicating which
elements of sentence vocabulary indexes are; the elements of a sentence
includes a subject, a predicate, an object, a complement, etc.
In the drawing, a form mark "publish. PRES" indicates a vocabulary index to
its left is a present tense of a verb "publish", and a parsing pattern
"SVO" in its right means that the vocabulary index is placed between a
subject and a predicate. Similarly, a form mark "buy.PAST" indicates a
vocabulary index to its left is the past tense of a verb "buy", i.e.,
"bought"; a form mark "be.ORIG" indicates a vocabulary index in to left is
| | |