|
Description  |
|
|
BACKGROUND OF THE INVENTION
Field of the Invention
This invention relates generally to the storage and reproduction of human
speech by electronic means, and more particularly to the storage and
reproduction of human speech by means of a digital computer.
Means for electronically storing and reproducing human speech are well
known. Such means generally process the speech entirely in analog form,
and while storing and reproducing analog speech signals performs its
intended function satisfactorily, there are many desirable operations
which cannot be performed on analog speech signals. Among these desired
operations are synthesis of spoken words, computer recognition of spoken
words, visual display of phonetic elements of spoken words, and the like.
It has long been desired to use digital computers to perform the above
operations as well as others which cannot be performed or which cannot be
performed satisfactorily solely by means of analog storage and
reproduction of speech. Many techniques of storing and reproducing speech
in digital form have been tried. Once such technique comprises the storage
in digital form of a highly accurate representation of a relatively small
number of words. This technique enables a computer to reproduce the stored
words in audible form with a very natural sound. However, only a small
number of words can be stored in this manner, and the process of
generating the necessary data for storage is relatively difficult and
expensive. Accordingly, this technique finds a primary application in
reproducing, under digital control, one of a selected number of words.
Examples of devices embodying this technique include a machine, such as an
automobile, which automatically warns its operator in audible form of a
dangerous condition, and a talking toy.
Another technique for generation of speech by digital means comprises the
use of a phoneme generator. Such a device can generate many words and
requires much less data than does the previous technique, but a phoneme
generator provides words which are of low quality and, although
understandable, tend to have a flat, mechanical sound.
The above techniques, in addition to the drawbacks already discussed, can
only be used as devices to provide an audible output of speech data which
has been previously stored in a computer, an entirely different technique
must be used to enter data indicative of speech into a computer.
A primary method of entering speech data into a computer comprises a
microphone coupled to an analog to digital converter. The analog to
digital converter periodically samples an analog waveform provided by the
microphone and provides the samples to a computer for storage and further
processing. If the samples produced by this technique are later applied by
the computer to a digital to analog converter, an audible output
approximating the original speech input can be obtained. This technique
can provide a very high quality of reproduced speech which has
characteristics similar to the characteristics of the voice of the person
who originally spoke the words which were stored. However, to store very
much speech in this manner requires enormous amounts of computer memory.
Moreover, although this method can store and reproduce speech accurately,
the stored data is highly speaker dependent. The speaker dependence of
this technique severely limits its value insofar as any analysis of the
speech or actual recognition of the spoken words by the computer is
concerned.
Accordingly, there is a need for a way to store human speech in a computer
without using unduly large amounts of storage and in a manner which
enables the computer to analyze the speech and identify the spoken words
independent of the particular voice characteristics of the speaker.
SUMMARY OF THE INVENTION
The present invention provides a computer speech system which compresses a
segment of speech by storing companded representations of the differences
between successive maxima and minima of a speech waveform, together with
the elapsed times between said maxima and minima. The companded
differences and the elapsed times are packed together in a single byte for
storage, thereby enabling the data to be stored in a relatively small
amount of storage space. The particular stored data can be recognized by
the computer as representing specific words, or other parts of speech, and
this recognition is largely speaker independent.
A method of compressing a segment of speech according to the invention
comprises examining a stream of samples of the speech to find a sample
which comprises a maximum in relation to its adjacent predecessor and
successor samples, examining the following samples to find a sample which
comprises a minimum, and calculating (1) a first delta number indicative
of the algebraic difference between the minimum and maximum samples and
(2) a first time number indicative of how much time elapsed between the
occurrences of said samples. The following samples are then examined to
find a sample which comprises a maximum in relation to its adjacent
neighboring samples, and a second delta number indicative of the algebraic
difference between said maximum sample and the previously found minimum
sample is then calculated. A second time number indicative of how much
time elapsed between the occurrences of said samples is also calculated.
The delta number may comprise a companded representation of the algebraic
difference between successive maxima and minima. A delta number and its
corresponding time number may be packed, for example into a single data
byte for storage.
Various operations may be performed on the segment of speech as represented
by the delta numbers and time numbers. The delta numbers and their
corresponding time numbers may be compared with previously stored data,
and an output indicative of the result of said comparison may be provided.
This would be done, for example, if it were desired to compare a segment
of speech with a previously stored segment to determine whether the two
segments represented the same word or word part (phoneme).
The delta numbers and time numbers may be analyzed in accordance with a
predetermined procedure such as a mathematical rule or set of rules.
A visual display indicative of the delta numbers and the time numbers can
be generated, for example by plotting points on a computer monitor
according to the values of the delta numbers and the time numbers. One
such point might be plotted, for example, by reference to an x-y
coordinate system in which the x value is determined by the magnitude of
the delta number and the y value is determined by the magnitude of the
time number. Delta numbers representing a decreasing slope of the speech
segment, that is, a slope extending from a maximum to a minimum, may be
plotted on one side of the monitor screen and delta numbers representing
an increasing slope may be plotted on the other side of the screen to
provide a visual indication of the relative phase of the delta numbers.
Varying colors on the screen may be used to indicate repetition rates of
the various plotted points.
A delta number may be computed by taking the algebraic difference between a
sample and a previously determined reference point. For example, the first
delta number may be computed by taking the difference between the first
minimum sample and a max reference determined according to the preceding
maximum sample.
The delta numbers may be companded, for example by a logarithmic expansion,
to provide a companded delta number. The companded delta number may be
compared with a test value and, if the companded delta number exceeds the
test value, the companded delta number may be set equal to the test value.
This may be done, for example, to prevent overflow of a companded delta
number beyond the maximum size of the storage unit in which the companded
delta number is to be stored.
If a time number exceeds a predetermined time, the first delta number and
the first time number may be scaled according to a time adjustment factor,
and the scaled first delta number then companded to provide the companded
delta number. This may be done, for example, to break a long interval into
two shorter intervals whereby no stored time interval exceeds the maximum
storage space available to store a number representing a time interval.
A plurality of segments of human speech, each segment represented by a
stream of samples, may be stored according to the invention by selecting a
speech segment to be stored; selecting a segment type, one of which
comprises compressing; storing a value indicative of which type was
selected; deriving data indicative of the speech segment to be stored
according to which type was selected; and then storing the derived data.
The compressing type may comprise a method of compressing as outlined
above. Other segments types which may also be utilized include fricative
synthesis, hold, raw data, repeat, and an automatic type which
automatically selects between two or more of the remaining types according
to previously provided parameters.
Previously stored data indicative of a segment of human speech can be used
to generate a speech signal by computing a numerical value indicative of
the data; deriving a computer address according to the computer numerical
value; causing a computer to perform an operation accessing the derived
address; and applying the derived address to a data input of a signal
generating device such as a digital to analog converter. In other words,
the data for the digital to analog converter can be transferred from the
computer to the converter over an address bus rather than over a data bus.
Of course, it would also be possible to transfer the data to the digital
to analog converter by means of the data bus if it were more convenient to
do so.
The computer address may be derived by, for example, combining the
numerical value indicative of the data with a predetermined segment
address value The offset portion of the derived address may then by
applied to the data input of the digital to analog converter.
Apparatus for generating a speech signal as outlined above comprises a
computer including data storage means and an address bus, means for
storing the data, means for computing a numerical value indicative of the
data, means for deriving a computer address according to the numerical
value, means for causing the computer to put the derived address on the
address bus, and means for applying the derived address to a data input of
the signal generating device.
Other aspects and advantages of the present invention will become apparent
from the following detailed description, taken in conjunction with the
accompanying drawings, which illustrate by way of example the principles
of the invention.
BRIEF DESCRIPTION OF THE DRAWINGS
FIGS. 1A, 1B and 1C show a flow chart illustrating a method of compressing
speech according to the invention.
FIG. 2 is a graph illustrating certain of the steps of the flow chart of
FIG. 1.
FIG. 3 is a graph illustrating others of the steps of the flow chart of
FIG. 1.
FIG. 4 is a flow chart illustrating a compression procedure including the
step of comparing or the step of analyzing the data.
FIG. 5 is a flow chart similar to that of FIG. 4 but illustrating
displaying the data on a visual display.
FIG. 6 is a flow chart illustrating a method of reproducing speech which
has previously been compressed. FIG. 6 comprises two sheets, FIGS. 6a and
6b.
FIG. 7 illustrates a direct connection of computer processing circuitry to
a ROM socket of a computer.
FIG. 8 illustrates a conventional connection of speech processing circuitry
to a computer.
FIG. 9 illustrates an interface block as shown in FIG. 8.
FIGS. 10 through 12 show a schematic diagram of the blocks (except block
601) illustrated in FIGS. 7 and 8.
DESCRIPTION OF THE PREFERRED EMBODIMENT
As shown in the drawings for purposes of illustration, the invention is
embodied in a method and apparatus for compressing, storing and
reproducing human speech by means of a computer.
Various methods of storing and reproducing speech by means of computers
have been developed, among them the use of precomputed speech segments,
the use of phoneme speech synthesizers, and the storage of actual speech
by a process of sampling.
A computer speech system according to the invention provides a method for
compressing speech by storing key parameters of the speech in a relatively
small volume of storage. The parameters tend to be speaker independent and
can be displayed or analyzed, for example, by comparison with previously
stored speech data.
A method of compressing a segment of human speech represented by a stream
of samples, as illustrated in FIGS. 1a, 1b and 1c comprises: examining the
samples to find a sample which comprises a maximum in relation to its
adjacent predecessor and successor samples, as indicated by a "second
sample greater than first sample?" decision block 101, and a "next sample
greater then previous sample?" decision block 103. The next step comprises
examining the samples which follow the maximum sample to find a sample
which comprises a minimum in relation to its adjacent predecessor and
successor samples, as indicated by a "next sample less than previous
sample?" decision block 105. The next step comprises calculating (1) a
first delta number indicative of the algebraic difference between the
minimum and maximum samples found in the preceding steps, as indicated by
a "let diff equals max minus previous sample" block 107, and (2) a first
time number indicative of how much time elapsed between the occurrences of
said samples, as indicated by a calculation block 109. The next step
comprises examining the samples which follow said minimum sample to find a
sample which comprises a maximum in relation to its adjacent predecessor
and successor samples, as again indicated by the decision 103. The next
step comprises calculated (1) a second delta number indicative of the
algebraic difference between the maximum sample found in the preceding and
the minimum sample found previously, as indicated by a "let delta equal
previous sample minus min" block 111, and (2) a second time number
indicative of how much time elapsed between the occurrences of said
samples, as indicated by a calculation block 113.
A delta number indicative of an algebraic difference may comprise a
companded representation of said difference, as indicated in the
calculation blocks 109 and 113. A delta number and its corresponding time
number may be packed for storage, for example by putting the delta number
into one nibble of a data byte and the time number into the other nibble
of the same byte.
The first delta number may be calculated by taking the algebraic difference
between the minimum sample and a max reference, the max reference being
indicated by the designation "max" in the block 107, the max reference
having been determined according to the maximum sample. Similarly, the
second delta number may be calculated by taking the algebraic difference
between the maximum and a min reference value, as indicated by the
designation "min" in the block 111, the min reference value having been
determined according to the minimum sample.
A companded delta number may be compared with a test value, and if the
companded delta number exceeds the test value, the companded delta number
may be set equal to the test value, as indicated in the computation blocks
109 and 113.
If the first time number exceeds a predetermined time, as indicated in a
decision block 115, the first delta number and the first time number may
be scaled according to a time adjustment factor indicated by the
designation "adjtime" in a calculation block 117, and the scaled first
delta number may then be companded to provide a first companded delta
number. Similarly, if the second time number exceeds the predetermined
time as indicated by a decision block 119, the second delta number and the
second time number may be scaled according to a time adjustment factor and
the scaled delta number then companded to provide the companded second
delta number, as indicated in a computation block 121.
A "min" reference value may have been determined by taking a value of the
first of the stream of samples, as indicated in a block 123. Similarly, a
reference time number may be calculated indicative of how much time
elapsed between the first maximum sample and the first sample. This
reference delta number may then be companded to provide a companded delta
reference number and a max reference value may then be determined by
adding the companded reference delta number to said min reference value,
as indicated in the calculation block 113. A new min reference value may
thereafter be determined by subtracting the first companded delta number
as determined in the calculation block 109, from the max reference value
as also indicated in the calculation block 109, and then replacing the
previously determined min reference value with the new min reference
value. Similarly, after the second delta number has been calculated, a new
max reference value can be determined by adding the second companded delta
number to the min reference value and replacing the previously determined
max reference value with the new max reference value all as indicated in
the calculation block 113.
Following the flow chart of FIG. 1a, 1b and 1c in sequence, the compression
routine begins with a calculation block 125 comprising three steps. The
first step, "store type equal compress", stores a value indicative of the
compression type to distinguish the type of segment from some other type
which may be selected, as will be more particularly discussed in a
subsequent paragraph. The second step of the block 125 "let total equal
one", sets a counter designated as "total" equal to one, representing the
first sample. The third step of the block 125 "store value of first
sample", indicates that the value of the first of stream of samples is to
be stored. The block 125 leads to the decision block 101 as indicated by a
line 127.
If the second sample exceeds the first sample, as indicated by the decision
block 101, the steps in calculation block 123 are performed, as indicated
by a line 129 extending from a "yes" output of the decision block 101 to
the block 123. The first step of the block 123, "store slope sense equal
increase", indicates that a value indicating an increasing slope is to be
stored. The second step of the block 123, "let min equal value of first
sample", determines an initial min reference value.
A block 131 follows the block 123 as indicated by a line 135. The block 131
comprises a single step, "let time equal zero", to set a counter
designated as "time" to zero. The next step is the decision block 103 as
indicated by a line 137 connecting the block 131 to the decision block
103. If the next sample exceeds the previous sample, the "time" is
incremented, as indicated by a block 139 connected to a "yes" output of
the block 103 as indicated by a line 141. After the time is incremented,
another sample is tested, as indicated by a line 143 extending from block
139 back to the input of the decision block 103.
Eventually, a maximum will have been reached, which will be indicated by a
"no" output from the decision block 103. When this happens, the step in
the block 111 is next performed, as indicated by a line 145 extending from
the "no" output of the block 103 to the block 111.
After performing the step in the block 111, the decision block 119 is next
encountered, as indicated by a line 147 connecting the block 111 to the
decision block 119. A "no" output of the decision block 109 is connected
to the computation block 113 as indicated by a line 149, and a "yes"
output of the decision block 119 is connected to the computation block 121
as indicated by a line 151.
The computation block 113 comprises the steps of "compand delta to get comp
delta", "let test equal fifteen minus min", "if comp delta greater than
test, let comp delta equal test comp", "let max equal min plus comp
delta", "store comp delta as next value", "store time as next time", and
"increment total".
The steps of calculation block 121 include "let adjdelta equal delta/2",
"compand adjdelta to get comp adjdelta", "let adjtime equal time/2
(truncate remainder)", "let adjtest equal (15 minus min)/2", "if
compadjdiff greater than adjtest, let comp adjdiff equal adjtest", "let
max equal min plus (2.times.comp adjdiff)", "store comp adjdiff as next
value", "if time even, store (adjtime minus one) as next time else store
adjtime as next time", "store 14 as next value", "store 0 as next time",
"store comp adjdiff as next value", "store adjtime as next time", and "let
total equal total plus three".
After performing the steps in either calculation block 113 or calculation
block 121, which ever has been performed, the step in a decision block
153, specifically "more samples?", is next performed, as indicated by a
line 155 extending from both blocks 113 and 121 to decision block 153.
If there are more samples to be processed, a step in a block 157 is next
performed, as indicated by a line 159 connecting a "yes" output from the
decision block 153 to the block 157. If no more samples are to be
processed, execution of the compress type segment ends with storing the
contents of the "total" counter, as indicated by a block 161 connected to
a "no" output of the decision block 153 through a line 163.
Returning now to the decision block 101, if the second sample exceeds the
first sample, the steps of a block 165 are next performed, as indicated by
a line 167 connecting a "no" output of the decision block 101 to the block
165. The steps of the block 165 are similar to those of the block 123
except that the initial slope sense is "decrease" and the initially
defined reference value is designated as "max" rather than "min".
After the block 165, the block 157 is performed as indicated by the line
159 connecting the output of the block 165 with the block 157. The block
157 is similar to the block 131. The output of the block 157 is connected
to the input of the decision block 105 as indicated by a line 169.
| | |