|
Description  |
|
|
The present invention relates to speech encoding.
In a number of applications, a signal representing spoken language is
encoded in such a manner that it can be stored digitally so that it can be
transmitted at a later time, or reproduced locally by some particular
device.
In these two cases, a very low bit rate may be necessary either in order to
correspond with the parameters of the transmission channel, or to allow
for the memorization of a very extensive vocabulary.
A low bit rate can be obtained by utilizing speech synthesis from a text.
The code obtained can be an orthographic representation of the text itself,
which allows for the obtainment of a bit rate of 50 bits per second.
To simplify the decoder utilized in an installation for processing
information so coded, the code can be composed of a sequence of codes of
phoneme and prosodic markers obtained from the text, thus entailing a
slight increase in the bit rate.
Unfortunately, speech reproduced in this manner is not natural and, at
best, is very monotonic.
The principal reason for this drawback is the "synthetic" intonation which
one obtains with such a process.
This is very understandable when there is considered the complexity of the
intonation phenomena, which must not only comply with linguistic rules,
but also should reflect certain aspects of the personality and the state
of mind of the speaker.
At the present time, it is difficult to predict when the prosodic rules
capable of giving language "human" intonations will be available for all
of the languages.
There also exist coding processes which entail bit rates which are much
higher.
Such processes yield satisfactory results but have the principal drawback
of requiring memories having such large capacities that their use is often
impractical.
The invention seeks to remedy these difficulties by providing a speech
synthesis process which, while requiring only a relatively low bit rate,
assures the reproduction of the speech with intonations which approach
considerably the natural intonations of the human voice.
The invention has therefore as an object a speech encoding process
consisting of effecting a coding of the written version of a message to be
coded, characterized in that it includes, in addition, the coding of the
spoken version of the same message and the combining, with the codes of
the written message, the codes of the intonation parameters taken from the
spoken message.
The invention will be better understood with the aid of the description
which follows, which is given only as an example, and with reference to
the figures.
FIG. 1 is a diagram showing the path of optimal correspondence between the
spoken and synthetic versions of a message to be coded by the process
according to the invention.
FIG. 2 is a schematic view of a speech encoding device utilizing the
process according to the invention.
FIG. 3 is a schematic view of a decoding device for a message coded
according to the process of the invention.
The utilization of a message in a written form has as an objective the
production of an acoustical model of the message in which the phonetic
limits are known.
This can be obtained by utilizing one of the speech synthesis techniques
such as:
Synthesis by rule in which each acoustical segment, corresponding to each
phoneme of the message is obtained utilizing acoustical/phonetic rules and
which consists of calculating the acoustical parameters of the phoneme in
question according to the context in which it is to be realized.
G. Fant et al. O.V.E. II Synthesis, Strategy Proc. of Speech Comm. Seminar,
Stockholm 1962.
L. R. Rabiner, Speech Synthesis by Rule: An Acoustic Domain Approach. Bell
Syst. Tech. J. 47, 17-37, 1968.
L. R. Rabiner, A Model for Synthesizing Speech by Rule. I.E.E.E. Trans. on
Audio and Electr. AU 17, pp. 7-13, 1969.
D. H. Klatt, Structure of a Phonological Rule Component for a Synthesis by
Rule Program, I.E.E.E. Trans. ASSP-24, 391-398, 1976.
Synthesis by concatenation of phonetic units stored in a dictionary, these
units being possibly diphones (N. R. Dixon and H. D. Maxey, Technical
Analog Synthesis of Continuous Speech using the Diphone Method of Segment
Assembly, I.E.E.E. Trans. AU-16, 40-50, 1968.
F. Emerard, Synthese par Diphone et Traitement de la Prosodie --Thesis,
Third Cycle, University of Languages and Literature, Grenoble 1977.
The phonetic units can also be allophones (Kun Shan Lin et al. Text to
Speech Using LPC Allophone Stringing IEEE Trans. on Consumer Electronics,
CE-27, pp. 144-152, May 1981), demi-syllables (M. J. Macchi, A Phonetic
Dictionary for Demi-Syllabic Speech Synthesis Proc. of JCASSP 1980, p.
565) or other units (G. V. Benbassat, X. Delon), Application de la
Distinction Trait-Indice-Propriete a la construction d'un Logiciel pour la
Synthese. Speech Comm. J. Volume 2, No. 2-3 July 1983, pp. 141-144.
Phonetic units are selected according to rules more or less sophisticated
as a function of the nature of the units and the written entry.
The written message can be given either in its regular orthographic or in a
phonologic form. When the message is given in an orthographic form, it can
be transcribed in a phonologic form by utilizing an appropriate algorithm
(B. A. Sherward, Fast Text to Speech Algorithme For Esperant, Spanish,
Italian, Russian and English. Int. J. Man Machine Studies, 10, 669-692,
1978) or be directly converted in an ensemble of phonetic units.
The coding of the written version of the message is effected by one of the
above mentioned known processes, and there will now be described the
process of coding the corresponding spoken message.
The spoken version of the message is first of all digitized and then
analyzed in order to obtain an acoustical representation of the signal of
the speech similar to that generated from the written form of the message
which will be called the synthetic version.
For example, the spectral parameters can be obtained from a Fourier
transformation or, in a more conventional manner, from a linear predictive
analysis (J. D. Markel, A. H. Gray, Linear Prediction of Speech-Springer
Verlag, Berlin, 1976).
These parameters can then be stored in a form which is appropriate for
calculating a spectral distance between each frame of the spoken version
and the synthetic version.
For example, if the synthetic version of the message is obtained by
concatenations of segments analysed by linear prediction, the spoken
version can be also analysed using linear prediction.
The linear prediction parameters can be easily converted to the form of
spectral parameters (J. D. Markel, A. H. Gray) and an euclidian distance
between the two sets of spectral coefficients provides a good measure of
the distance between the low amplitude spectra.
The pitch of the spoken version can be obtained utilizing one of the
numerous existing algorithms for the determination of the pitch of speech
signals (L. R. Rabiner et al. A Comparative Performance Study of Several
Pitch Detection Algorithms, IEEE Trans. Acoust. Speech and Signal Process,
Volume. ASSP 24, pp. 399-417 Oct. 1976. B. Secrest, G. Doddington, Post
Processing Techniques For Voice Pitch Trackers --Procs. of the ICASSP
1982. Paris pp. 172-175).
The spoken and synthetic versions are then compared utilizing a dynamic
programming technique operating on the spectral distances in a manner
which is now classic in global speech recognition (H. Sakoe et S. Chiba
--Dynamic Programming Algorithm Optimisation For Spoken Word Recognition
IEEE Trans. ASSP 26-1, Fev. 1978).
This technique is also called dynamic time warping since it provides an
element by element correspondence (or projection) between the two versions
of the message so that the total spectral distance between them is
minimized.
In regard to FIG. 1, the abscissa shows the phonetic units up.sub.1
-up.sub.5 of the synthetic version of a message and the ordinant shows the
spoken version of the same message, the segments s.sub.1 -s.sub.5 of which
correspond respectively to the phonetic units up.sub.1 -up.sub.5 of the
synthetic version.
In order to correlate the duration of the synthetic version with that of
the spoken version, it suffices to adjust the duration of each phonetic
unit to make it equal in duration to each segment corresponding to the
spoken version.
After this adjustment, since the durations are equal, the pitch of the
synthetic version can be rendered equal to that of the spoken version
simply by rendering the pitch of each frame of the phonetic unit equal to
the pitch of the corresponding frame of the spoken version.
The prosody is then composed of the duration warping to apply to each
phonetic unit and the pitch contour of the spoken version.
There will now be examined the encoding of the prosody. The prosody can be
coded in different manners depending upon the fidelity/bit rate compromise
which is required.
A very accurate way of encoding is as follows.
For each frame of the phonetic units, the corresponding optimal path can be
vertical, horizontal or diagonal.
If the path is vertical, this indicates that the part of the spoken version
corresponding to this frame is elongated by a factor equal to the length
of the path in a certain number of frames.
Conversely, if the path is horizontal, this means that all of the frames of
the phonetic units under that portion of the path must be shortened by a
factor which is equal to the length of the path. If the path is diagonal,
the frames corresponding to the phonetic units should keep the same
length.
With an appropriate local constraint of the time warping, the length of the
horizontal and vertical paths can be reasonably limited to three frames.
Then, for each frame of the phonetic units, the duration warping can be
encoded with three bits.
The pitch of each frame of the spoken version can be copied in each
corresponding frame of the phonetic units using a zero or one order
interpolation.
The pitch values can be efficiently encoded with six bits.
As a result, such a coding leads to nine bits per frame for the prosody.
Assuming there is an average of forty frames per second, this entails about
four hundred bits per second, including the phonetic code.
A more compact way of coding can be obtained by using a limited number of
characters to encode both the duration warping and the pitch contour.
Such patterns can be identified for segments containing several phonetic
units.
A convenient choice of such segments is the syllable. A practical
definition of the syllable is the following:
[(consonant cluster)] vowel [(consonant cluster)] [ ]=optional.
A syllable corresponding to several phonetic units and its limits can be
automatically determined from the written form of the message. Then, the
limits of the syllable can be identified on the spoken version. Then if a
set of characteristic syllable pitch contours has been selected as
representative patterns, each of them can be compared to the actual pitch
contour of the syllable in the spoken version and there is then chosen the
closest to the real pitch contour.
For example, if there were thirty-two characters, the pitch code for a
syllable would occupy five bits.
In regard to the duration, a syllable can be split into three segments as
indicated above.
The duration warping factor can be calculated for each of the zones as
explained in regard to the previous method.
The sets of three duration warping factors can be limited to a finite
number by selecting the closest one in a set of characters.
For thirty-two characters, this again entails five bits per syllable.
The approach which has just been described requires about ten bits per
syllable for the prosody, which entails a total of 120 bits per second
including the phonetic code.
In FIG. 2, there is shown a schematic of a speech encoding device utilizing
the process according to the invention.
The input of the device is the output of a microphone.
The input is connected to the input of a linear prediction encoding and
analysis circuit 2; the output of the circuit 2 is connected to the input
of an adaptation algorithm operating circuit comprising a control circuit
3.
Another input of control circuit 3 is connected to the output of memory 4
which constitutes an allophone dictionary.
Finally, over a third input 5, the adaptation algorithm operating circuit
or control circuit 3 receives the sequences of allophones. The control
circuit 3 produces at its output an encoded message containing the
duration and the pitches of the allophones.
To assign a phrase prosody to an allophone chain, the phrase is registered
and analysed in the control circuit 3 utilizing linear prediction
encoding.
The allophones are then compared with the linear prediction encoded phrase
in control circuit 3 and the prosody information such as the duration of
the allophones and the pitch are taken from the phrase and assigned to the
allophone chain.
With the data rate coming from the microphone to the input of the circuit 2
of FIG. 2 being for example 96,000 bits per second, the available
corresponding encoded message at the output of the control circuit 3 will
have a rate of 120 bits per second.
The distribution of the bits is as follows.
Five bits for the designation of an allophone/phoneme (32 values).
Three bits for the duration (8 values).
Five bits for the pitch (32 values).
This makes up a total of thirteen bits per phoneme.
Taking into account that there are on the order of 9 to 10 phonemes per
second, a rate on the order of 120 bits per second is obtained.
The circuit shown in FIG. 3 is the decoding circuit for the signals
generated by the control circuit 3 of FIG. 2.
This device includes a concatenation algorithm elaboration circuit 6 one
input being adapted to receive the message encoded at 120 bits per second.
At another input, the circuit 6 is connected to an allophone dictionary 7.
The output of circuit 6 is connected to the input of a synthesizer 8 for
example, of the type TMS 5200 A. available from Texas Instruments
Incorporated of Dallas, Texas. The output of the synthesizer 8 is
connected to a loudspeaker 9.
Circuit 6 produces a linear prediction encoded message having a rate of
1.800 bits per second and the synthesizer 8 converts, in turn, this
message into a message having a bit rate of 64.000 bits per second which
is usable by loudspeaker 9.
For the English language, there has been developed an allophone dictionary
including 128 allophones of a length between 2 and 15 frames, the average
length being 4 or 5 frames.
For the French language, the allophone concatenation method is different in
that the dictionary includes 250 stable states and this same number of
transitions.
The interpolation zones are utilized for rendering the transitions between
the allophones of the English dictionary more regular.
The interpolation zones are also utilized for regularizing the energy at
the beginning and at the end of the phrases. To obtain a data rate of 120
bits per second, three bits per phoneme are reserved for the duration
information.
The duration code is the ratio of the number of frames in the modified
allophone to the number of frames in the original. This encoding ratio is
necessary for the allophones of the English language as their length can
vary from one to fifteen frames.
On the other hand, as the totality of transitions plus stable states in the
French language has a length of four to five frames, their modified length
can be equal to two to nine frames and the duration code can be a number
of frames in the totality of stable states plus modified transitions.
The invention which has been described provides for speech encoding with a
data rate which is relatively low with respect to the rate obtained in
conventional processes.
The invention is therefore particularly applicable for books with pages
including in parallel with written lines or images, an encoded
corresponding text which is reproduceable by a synthesizer.
The invention is also advantageously used in video text systems developed
by the applicant and in particular in devices for the audition of
synthesized spoken messages and for the visualization of graphic messages
corresponding to the type described in the French patent application No.
FR 8309194, filed 2 June 1983, by the applicant.
* * * * *
|
|
|
|
|
Description  |
|