|
Claims  |
|
|
We claim:
1. A speech encoding apparatus comprising:
input means for receiving speech including one or more words of human
language;
analysis means connected to said input means for analyzing said received
speech, generating a sequence of phonological linguistic unit indicia
corresponding to said received speech, grouping and phonological
linguistic unit indicia into syllables, and generating duration data
corresponding to the duration of said received speech for each
phonological linguistic unit indicia, said analysis means including:
a linear predictive coding analyzer connected to said input means for
providing linear predictive coding speech parameters including energy
parameters, pitch parameters, and reflection coefficient parameters from
the received speech,
phonological linguistic unit recognition means connected to said linear
predictive coding analyzer for receiving said linear predictive coding
speech parameters therefrom,
phonemic memory means having a plurality of templates of digital signals
representative of phonological linguistic unit speech parameters and
including standard durations corresponding to each of said phonological
linguistic unit speech parameters, said phonemic memory means being
connected to said phonological linguistic unit recognition means, and
said phonological linguistic unit recognition means producing speech
parameters indicative of the sequence of phonological linguistic unit
indicia in response to comparing said linear predictive coding speech
parameters with said plurality of templates in said phonemic memory means
to select the template from said phonemic memory means providing the best
match to respective linear predictive coding speech parameters and
generating said duration data as duration parameters based upon a standard
duration of the corresponding phonological linguistic unit data stored in
said phonemic memory means;
syllable recognition means connected to the output of said phonological
linguistic unit recognition means and being responsive to the sequence of
phonological linguistic unit indicia produced therefrom to determine
syllables in which said phonological liguistic unit indicia are grouped;
duration pattern memory means storing a plurality of predetermined duration
patterns corresponding to each syllable grouping of phonological
linguistic unit indicia;
duration pattern recognizer means operably connected to said analysis means
via said syllable recognition means and to said duration pattern memory
means for selecting a duration pattern from said plurality of
predetermined duration patterns for each syllable grouping of phonological
linguistic unit indicia as generated by said analysis means, said duration
pattern being selected in dependence upon said duration data corresponding
to each syllable grouping of phonological linguistic unit indicia; and
transmission means operably connected to said analysis means and said
duration pattern recognizer means for transmitting said speech parameters
indicative of said phonological liguistic unit indicia and duration
parameters indicative of duration pattern indicia corresponding to said
selected duration patterns as encoded speech data from which audible
synthesized speech having a duration contour approximating the duration
contour of the original speech as received by said input means may be
produced while employing a relatively low data rate.
2. A speech encoding apparatus as claimed in claim 1, wherein:
said analysis means generates phonological linguistic unit indicia
corresponding to phonemes of said received speech.
3. A speech encoding apparatus as claimed in claim 1, wherein:
said analysis means generates phonological linguistic unit indicia
corresponding to allophones of said received speech.
4. A speech encoding apparatus as claimed in claim 1, wherein:
said analysis means generates phonological linguistic unit indicia
corresponding to diphones of said received speech.
5. A speech encoding apparatus as claimed in claim 1, wherein:
said duration pattern recognizer means includes comparison means for
comparing said duration data for each syllable grouping of phonological
linguistic unit indicia with each of said duration patterns of said
duration pattern memory means and generating a measure of the similarity
therebetween, and selection means for selecting the duration pattern from
said plurality of predetermined duration patterns having the best measure
of similarity for each syllable grouping of phonological linguistic unit
indicia.
6. A speech encoding apparatus as claimed in claim 1, wherein:
said analysis means further includes syllable classifying means for
classifying each of said syllables as one of a predetermined plurality of
syllable types depending upon the type of phonological linguistic unit
indicia therein; and
said duration pattern recognizer means further includes means for selecting
said duration pattern corresponding to each syllable from among a
predetermined subset of said plurality of predetermined duration patterns,
said predetermined subset selected being based upon the syllable type of
said syllable.
7. A speech encoding apparatus as claimed in claim 6, wherein:
said syllable classifying means classifies said syllables dependent upon
the presence or absence of unvoiced initial consonant phonological
linguistic unit indicia and the presence or absence of unvoiced final
consonant phonological linguistic unit indicia.
8. A speech encoding apparatus as claimed in claim 7, wherein:
said syllable classifying means classifies said syllables one of four
differing syllable types, firstly those having unvoiced initial consonant
phonological linguistic unit indicia and having unvoiced final consonant
phonological linguistic unit indicia, secondly, those having unvoiced
initial consonant phonological linguistic unit indicia and having no
unvoiced final consonant phonological liguistic unit indicia, thirdly
those having no unvoiced initial consonant phonological linguistic unit
indicia and having unvoiced final consonant no unvoiced initial consonant
phonological linguistic unit indicia and having no unvoiced final
consonant phonological linguistic unit indicia.
9. A speech encoding apparatus as claimed in claim 1, wherein:
said analysis means generates said duration data by comparison of the
duration of said received speech corresponding to each phonological
linguistic unit indicia to a predetermined reference duration for said
phonological linguistic unit indicia.
10. A speech encoding apparatus as claimed in claim 9, wherein:
said duration pattern recognizer means selects said duration pattern
dependent upon the comparisons of said received speech duration and said
reference duration for each phonological linguistic unit indicia in any
initial consonant phonological linguistic unit indicia, the vowel
phonological linguistic unit indicia and any final consonant phonological
linguistic unit indicia.
11. A speech encoding apparatus as claimed in claim 1, wherein:
said transmission means further includes means for transmitting additional
speech parameters providing an indication of the grouping of phonological
linguistic unit indicia into syllables.
12. A speech encoding apparatus as claimed in claim 1, wherein:
said transmission means comprises machine readable optical bar code.
13. A speech producing apparatus comprising:
input means for receiving a sequence of enclosed speech data including a
first part containing a sequence of phonological linguistic unit indicia,
a second part containing syllable indicia for grouping said phonological
linguistic unit indicia into syllables, and a third part containing a
sequence of duration pattern indicia, each duration pattern indicia
indicating one of a plurality of predetermined duration pattern;
control means connected to said input means for converting said sequence of
encoded speech data into a sequence of speech synthesis parameters
including duration control parameters for parts of each syllable grouping
of said phonological linguistic unit indicia corresponding to said
sequence of duration pattern indicia, said control means including
phonemic memory means for storing speech synthesis parameters corresponding
to each of said phonological liguistic unit indicia,
duration pattern memory means for storing duration control parameters
corresponding to each of said plurality of predetermined duration
patterns,
recall means for recalling speech parameters corresponding to said sequence
of phonological linguistic unit indicia and for recalling duration control
parameters corresponding to said sequence of duration pattern indicia, and
concatenation means for combining said recalled duration control parameters
with said recalled speech synthesis parameters corresponding to syllable
groupings of said sequence of phonological linguistic unit indicia; and
speech synthesis means connected to said concatenation means of said
control means for generating one or more audible words of human language
corresponding to said speech synthesis parameters.
14. A speech producing apparatus as claimed in claim 13, wherein:
said phonological linguistic unit indicia correspond to phonemes.
15. A speech producing apparatus as claimed in claim 13, wherein:
said phonological linguistic unit indicia correspond to allophones.
16. A speech producing apparatus as claimed in claim 13, wherein:
said phonological linguistic unit indicia correspond to diphones.
17. A speech producing apparatus as claimed in claim 13, wherein:
said control means further includes syllable classifying means for
classifying each of said syllables into one of a predetermined plurality
of syllable types depending upon the type of phonological linguistic unit
indicia therein, the classification of said syllables by said syllable
classifying means being dependent upon the presence or absence of unvoiced
initial consonant phonological linguistic unit indicia and the presence or
absence of unvoiced final consonant phonological linguistic unit indicia.
18. A speech producing apparatus as claimed in
claim 17, wherein:
said syllable classifying means classifies said syllables into one of four
differing syllable types, firstly those having unvoiced initial consonant
phonological linguistic unit indicia and having unvoiced final consonant
phonological linguistic unit indicia, secondly those having unvoiced
initial consonant phonological linguistic unit indicia and having no
unvoiced final consonant phonological linguistic unit indicia, thirdly
those having no unvoiced initial consonant phonological linguistic unit
indicia and having unvoiced final consonant phonological linguistic unit
indicia, and fourthly those having no unvoiced initial consonant
phonological linguistic unit indicia and having no unvoiced final
consonant phonological linguistic unit indicia.
19. A speech producing apparatus as claimed in claim 13, wherein: said
duration pattern memory means stores a first duration control parameter
for initial consonant phonological linguistic unit indicia, a second
duration control parameter for vowel phonological linguistic unit indicia
and a third duration control parameter for final consonant phonological
linguistic unit indicia; and
said concatenation means combines recalled first duration control
parameters and recalled speech synthesis parameters corresponding to any
initial consonant phonological linguistic unit indicia, combines recalled
second duration control parameters and recalled speech synthesis
parameters corresponding to vowel phonological linguistic unit indicia,
and combines recalled third duration control parameters and recalled
speech synthesis parameters corresponding to any final consonant
phonological linguistic unit indicia for each syllable.
20. A speech producing apparatus as claimed in claim 13, wherein:
said input means comprises an optical bar code reader. |
|
|
|
|
Claims  |
|
|
Description  |
|
|
BACKGROUND OF THE INVENTION
The present invention falls in the category of improvements to low data
rate speech apparatuses and may be employed in electronic learning aids,
electronic games, computers and small appliances. The problem of low data
rate speech apparatuses is to provide electronically produced synthetic
speech of modest quality while retaining a low data rate. This low data
rate is required in order to reduce the amount of memory needed to store
the desired speech or in order to reduce the amount of information which
must be transmitted in order to specify the desired speech.
Previous solutions to the problem of providing acceptable quality low data
rate speech have employed the technique of storing or transmitting data
indicative of the string of phonological linguistic units corresponding to
the desired speech. The speech synthesis apparatus would include a memory
for storing speech synthesis parameters corresponding to each of these
phonological linguistic units. Upon reception of the string of
phonological linguistic units, either by recall from a phrase memory or by
data transmission, the speech synthesis apparatus would successively
recall the speech synthesis parameters corresponding to each phonological
linguistic unit indicated, generate the speech corresponding to that unit
and repeat. This technique has the advantage that the phonetic memory thus
employed need only include the speech parameters for each phonological
linguistic unit once, although such phonological linguistic unit may be
employed many times in production of a single phrase. The amount of data
required to specify one of these phonological linguistic units from among
the phonetic library is much less than that required to specify the speech
parameters for generation of that particular phonological linguistic unit.
Therefore, whether the phrase specifying data is stored in an additional
memory or transmitted to the apparatus, an advantageous reduction in the
data rate is thus achieved.
This technique has a problem in that the naturalness and intelligibility of
the speech thus produced is of a low quality. By recall of speech
synthesis parameters corresponding to individual phonological linguistic
units occurring in the phrase to be spoken rather than storing the speech
synthesis parameters corresponding directly to that phrase, the natural
intonation contour of the speech is destroyed. This has the disadvantage
of reducing the naturalness and intelligibility of the speech. The
naturalness and intelligibility and hence the quality of the speech thus
produced may be increased by storing or transmitting an indication of the
original, natural intonation contour for intonation control upon
synthesis. Storage or transmission of an indication of the natural
intonation contour increases the data rate required for specification of a
particular phrase or word. Thus, it is highly advantageous to provide a
manner of specifying the natural intonation contour at a low bit rate. By
combining the technique of specifying phonological linguistic units
together with a coded form of the natural intonation contour, a low data
rate speech system may be achieved having the required speech quality.
SUMMARY OF THE INVENTION
The object of the present invention is to provide an improvement in the
quality of low data rate speech by providing an indication of the original
spoken duration. In the present invention a low data rate is achieved by
encoding spoken input as a series of phonological linguistic units such as
phonemes, allophones or diphones and transmitting indicia corresponding to
these phonological linguistic units. Ordinarily such a technique destroys
the original duration contour of the spoken input. Some of this original
spoken duration contour is recovered by the use of syllable duration
patterns which represent an approximation of the original duration
contour.
In accordance with the principles of the present invention, the spoken
input is analyzed to determine the phonological linguistic units and the
syllables which it includes. In addition the relation of the duration of
individual phonological linguistic units to a standard length for each
type is also determined. This measure of the relative length of the
phonological linguistic units, allophones in the preferred embodiment, is
matched against a set of duration patterns for syllables. Once the best
match is found then an indication of the syllable duration pattern is
transmitted together with allophone indicia. The synthesis apparatus then
combines this data in order to produce speech. The syllable duration
patterns enable the synthesis apparatus to provide an approximation of the
duration contour of the original spoken input without sacrificing a low
data rate. This is achieved because it requires much less data to identify
syllable duration patterns than to transmit the actual duration contour.
In the preferred embodiment each syllable is classified as one of four
different types. These syllable types are determined depending on the
presence or absence of unvoiced consonants in any initial or final
consonant cluster. In accordance with this embodiment, the syllable
duration pattern indicia is interpreted differently for the different
syllable types. This data can be further compressed by using the allophone
indicia corresponding to each syllable to convey some of the duration
information.
In the preferred embodiment of the present invention, each syllable
duration pattern specifies three different duration parameters. The
duration patterns specify the duration of any initial consonants of the
syllable, the vowel of the syllable and any final consonants of the
syllable. Upon synthesis the allophone indicia and the duration pattern
indicia are combined for control of the speech produced.
BRIEF DESCRIPTION OF THE DRAWINGS
These and other objects of the present invention will become clear from the
detailed description of the invention which follows in conjunction with
the drawings in which:
FIG. 1 illustrates a block diagram of the system required to analyze the
pitch and duration patterns of specified speech in order to provide the
encoding in accordance with the present invention;
FIG. 2 illustrates an example of a natural pitch contour for a syllable
together with the corresponding pitch pattern;
FIG. 3 illustrates a flow chart of the steps required in the pitch pattern
analysis in accordance with the present invention;
FIG. 4 illustrates a flow chart of the steps required for the duration
pattern analysis in accordance with the present invention;
FIG. 5 illustrates an example of a speech synthesis system for production
of speech in accordance with the pitch and duration patterns of the
present invention;
FIGS. 6A and 6B illustrate a flow chart of the steps required for speech
synthesis based upon pitch and duration patterns in accordance with the
present invention;
FIG. 7 illustrates a flow chart corresponding to the steps necessary for
preprocessing in a text-to-speech embodiment of the present invention;
FIG. 8 illustrates the steps for preprocessing and an embodiment of the
present invention in which allophone, word boundary and prosody data are
transmitted to the speech synthesis apparatus;
FIG. 9 illustrates the steps required for determining the syllable type
from allophone data;
FIGS. 10A and 10B illustrate a flow chart of the steps required for
identifying syllable boundaries from allophone and word boundary data;
FIG. 11 is a flow chart illustrating the overall steps in a automatic
stress analysis technique;
FIGS. 12A and 12B illustrate a flow chart showing the assignment of delta
pitch and pitch pattern in the falling intonation mode, which is called as
a subroutine of the flow chart illustrated in FIG. 11;
FIGS. 13A and 13B illustrate a flow chart showing the assignment of delta
pitch and pitch pattern in a rising intonation mode, which is called as a
subroutine of the flow chart illustrated in FIG. 11;
FIG. 14 illustrates the steps for conversion of allophone data from word
mode to phrase mode in accordance with another embodiment of the present
invention; and
FIG. 15 illustrates the steps for conversion of allophone data specified in
a phrase mode into an individual word mode in accordance with a further
embodiment of the present invention.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
The present invention is in the field of low data rate speech, that is
speech in which the data required to specify a particular segment of human
speech is relatively low. Low data rate speech, if it is of acceptable
speech quality, has the advantage of requiring storage or transmission of
a relatively low amount of data for specifying a particular set of spoken
sounds. One previously employed method for providing low data rate speech
is to analyze speech and identify individual phonological linguistic units
within a string of speech. Each phonological linguistic unit represents a
humanly perceivable sub-element of speech. Once the string of phonological
linguistic units corresponding to a give segment of spoken source has been
identified, this low bit rate speech technique specifies the speech to be
produced by storing or sending a string of indicia corresponding to the
string of phonological linguistic units making up that segment of speech.
The specification of speech to be produced in this manner has a
disadvantage in that the natural intonation contour of the original spoken
input is destroyed. Therefore, the intonation contour of the reproduced
speech is wholly artificial. This results in an artificial intonation
contour which may be described as choppy or robot like. The provision of
such an intonation contour may not be disadvantageous in some applications
such as toys or games. However, it is considered advantageous in most
applications to provide an approximation of the original intonation
contour. The present invention is concerned with techniques for encoding
the natural intonation contour for transmission with the phonological
linguistic unit indicia in order to specify a more natural-sounding
speech.
In the preferred embodiment of the present invention, the speech is
produced via linear predictive coding by a single integrated chip
designated TMS5220A manufactured by Texas Instruments Incorporated. In
linear predictive coding speech synthesis a mathematical model of the
human vocal tract, is produced and individual features of the model vocal
tract are controlled by changing data called reflection coefficients. This
causes the mathematical model to change in analogy to the change in the
human vocal tract corresponding to movement of the lips, tongue, teeth and
throat. The TMS5220A integrated circuit speech synthesis device allows
independent control of speech pitch via control of the pitch period of an
excitation function. In addition, the TMS5220A speech synthesis device
permits independent control of speech duration by control of the amount of
time assigned for each data frame of speech produced. By independent
control of both the pitch and duration of the produced speech, a much more
natural intonation contour may be produced.
FIG. 1 illustrates the encoding apparatus 100 necessary for generating
speech parameter data corresponding to spoken or written text input in
accordance with the present invention. The output of the encoding
apparatus 100 includes a string of indicia corresponding to the
phonological linguistic units of the input, a string of pitch pattern
indicia selected from a pitch pattern library corresponding to the pitch
of the received input and a string of duration pattern indicia selected
from among a set of duration patterns within a duration pattern library
corresponding to a particular syllable type.
Encoding apparatus 100 includes two alternate input paths, the first via
microphone 101 for receiving spoken speech and the second via text input
114 for receiving inputs corresponding to printed text. The speech input
channel through microphone 101 will be first described. Microphone 101
receives spoken input and converts this into a varying electrical signal.
This varying electrical signal is applied to analog to digital converter
102. In accordance with known principles, analog to digital converter 102
converts the time varying electrical signal generated by a microphone 101
into a set of digital codes indicative of the amplitude of the signal at
sampled times. This set of sampled digital code values is applied to LPC
analyzer 103. LPC analyzer 103 takes the digital data from analog to
digital converter 102 and converts it into linear predictive coding
parameters for speech synthesis. LPC analyzer 103 generates an indication
of energy, pitch and reflection coefficients for successive time samples
of the input data. This set of energy, pitch and reflection coefficient
parameters could be employed directly for speech synthesis by the
aforementioned TMS5220A speech synthesis device. However, in accordance
with the principles of the present invention, these speech parameters are
subjected to further analysis in order to reduce the amount of data
necessary to specify a particular portion of speech. The present invention
operates in accordance with the principals set forth in U.S. Pat. No.
4,398,059 entitled "Speech Producing System" by Kun-Shan Lin, Kathleen M.
Goudie, and Gene A. Frantz. In this patent, the speech to be produced is
broken up into component allophones. Allophones are variants of phonemes
which form the basic elements of spoken speech. Allophones differ from
phonemes in that allophones are variants of phonemes depending upon the
speech environment within which they occur. For example, the P in "Push"
and the P in "Spain" are different allophone variants of the phoneme P.
Thus, the use of allophones in speech synthesis enables better control of
the transition between adjacent phonological linguistic units. Table 1
lists the allophones employed in the system of the present invention
together with an example illustrating the pronunciation of that allophone.
The allophones listed in Table I are set forth in a variety of categories
which will be further explained below.
The energy, pitch and reflection coefficient data from LPC analyzer 103 is
applied to allophone recognizer 104. Allophone recognizer 104 matches the
received energy, pitch and reflection coefficient data to a set of
templates stored in allophone library 105. Allophone library 105 stores
energy, pitch and reflection coefficient parameters corresponding to each
of the allophones listed in Table 1. Allophone recognizer 104 compares the
enrgy, pitch and reflection coefficient data from LPC analyzer 103
corresponding to the actual speech input to the individual allophone
energy, pitch and reflection coefficient parameters stored within
allophone library 105. Allophone recognizer 104 then selects a string of
allophone indicia which best matches the received data corresponding to
the actual spoken speech. Allophone recognizer 104 also produces an
indication of the relationship of the duration of the received allophone
to the standardized duration of the corresponding allophone data stored in
allophone library 105.
The string of allophone indicia from allophone recognizer 104 is then
applied to syllable recognizer 106. Syllable recognizer 106 determines the
syllable boundaries from the stig of allophone indicia from allophone
recognizer 104. In accordance with the principles of the present
invention, pitch and duration patterns are matched to syllables of the
speech to be produced. It has been found that the variation in pitch and
duration within smaller elements of speech is relatively minor and that
generation of pitch and duration patterns corresponding to syllables
results in an adequate speech quality. The output of syllable recognizer
106 determines the boundaries of the syllables within the spoken speech.
Speech encoding apparatus 100 may alternatively use a speech to syllable
recognizer (not shown) for determining the syllable boundaries within the
spoken speech input. A speech to syllable recognizer would receive the
energy, pitch and reflection coefficient parameters from LPC analyzer 103
and directly generate the syllable boundaries without the necessity for
determining allophones as an intermediate step. A further alternative
method for determining the syllable boundaries is hand editing (not
shown). This corresponds to a trained listener who inserts syllable
boundaries upon careful observation by listening to the input speech. In
any event, by this point the input speech has been analyzed to determine
the energy, pitch, reflection coefficients, allophones and syllable
boundaries.
This data, and in particular the pitch and syllable boundary data are
applied to pitch pattern recognizer 109. Pitch pattern recognizer 109
encodes the indication of the pitch of the original speech into one of a
predetermined set of pitch patterns for each syllable. An indication of
these syllable pitch patterns are stored within pitch pattern library 110.
Pitch pattern recognizer 109 compares the indication of the actual pitch
for each syllable with each of the pitch patterns stored within pitch
pattern library 110 and provides an indication of the best match. The
output of pitch pattern recognizer 109 is a pitch pattern code
corresponding to the best match for the pitch shape of each syllable to
the pitch patterns within pitch pattern library 110.
An indication of the pitch patterns stored within pitch pattern library 110
is shown in Table 2. Table 2 identifies each pitch pattern by an
identification number, an initial slope, a final slope and a turning
point. In accordance with the present invention, the pitch within each
syllable is permitted two differing slopes with an adjustable turning
point. It should be noted that the slope is restricted within the range of
.+-.2 in the preferred embodiment. Also it should be noted that the
preferred speech synthesis device, the TMS5220A, permits independent
variation of the pitch period rather than of the pitch frequency. A
negative number indicates a reduction in pitch period and therefore an
increase in frequency while a positive number indicates an increase in
pitch period and therefore a decrease in frequency. In the preferred
embodiment, the turning point occurs either at 1/4 of the syllable
duration, 1/2 of the syllable duration or 3/4 of the syllable duration.
Note that no turning point has been listed for those pitch patterns in
which the initial slope and the final slope are identical. In such a case
there is no need to specify a turning point, since wherever such a turning
point occurs, the change in pitch period will be identical. With an
allowed group of five initial slopes, five final slopes and three turning
points, one would ordinarily expect a total of 75 possible pitch patterns.
However, because some of these patters are redundant, particularly those
in which the initial and final slopes are identical, there are only the 53
variations listed. Because of this limitation upon the number of pitch
patterns, it is possible to completely specify a particular one of these
patterns with only six bits of data.
After the pitch pattern has been selected by pitch pattern recognizer 109,
the data is applied to syllable type recognizer 111. Syllable type
recognizer 111 classifies each syllable as one of four types depending
upon whether or not there are initial or final unvoiced consonant
clusters. Syllable type recognizer 111 examines the allophone indicia
making up each syllable and determines whether there are any consonant
allophone indicia prior to the vowel allophone indicia or any consonant
allophone indicia following the vowel allophone indicia which fall within
the class of unvoiced consonants. Based upon this determination, the
syllable is classified as one of four types.
Duration pattern recognizer 112 receives the syllable type data from
syllable type recognizer 111 as well as allophone and duration data. In
this regard it should be understood that each allophone may be pronounced
in a manner either longer or shorter than the standardized form stored
within allophone library 105. As previously noted, allophone recognizer
104 generates data corresponding to a comparison of the duration of the
actual allophone data received from LPC analyzer 103 and the standardized
allophone data stored within allophone library 105. Based upon this
comparison, an allophone duration parameter is derived. The aforementioned
TMS5220A speech synthesis device enables production of speech at one of
four differing rates covering a four to one time range. Duration pattern
library 113 stores a plurality of duration patterns for each of the
syllable types determined by syllable type recognizer 111. Each duration
pattern within duration pattern library 113 includes a first duration
control parameter for any initial consonant allophones, a second duration
control parameter for the vowel allophones and a third duration control
parameter for any final consonant allophone. The duration pattern
recognizer 112 compares the actual duration of speaking for the particular
allophone generated by allophone recognizer 104 with each of the duration
patterns stored within duration pattern library 113 for the corresponding
syllable type. Duration pattern recognizer 112 then determines the best
match between the actual duration of the spoken speech and the set of
duration patterns corresponding to that syllable type. This best match
duration pattern is then output by duration pattern recognizer 112. At the
output of duration pattern recognizer 112 is the allophone indicia
corresponding to the string of allophones within the spoken input, and the
pitch and duration patterns corresponding to each syllable of the spoken
input. In addition, duration pattern recognizer 112 may optionally also
output some indication of the syllable boundaries.
Elements 114 and 115 illustrate an alternative input to the speech encoding
apparatus 100. Text input device 114 receives the input of data
corresponding to ordinary printed text in plain language. This text input
is applied to text to alophone translator 115 which generates a string of
allophone indicia which corresponds to the printed text input. Such a text
to allophone conversion may take place in accordance with copending U.S.
Patent Application Ser. No. 240,694 filed Mar. 5, 1981. As an optional
further step, hand allophone editing 106 permits a trained operator to
edit the allophones from text to allophone converter 115 in order to
optimize the allophone string for the desired text input. The allophone
string corresponding to the text input is then applied to syllable
recognizer 106 where this data is processed as described above.
FIG. 2 illustrates an example of hypothetical syllable pitch data together
with the corresponding best match pitch pattern. Pitch track 200
corresponds to the actual primary pitch of the hypothetical syllable.
During the first part of the syllable 201, the speech is unvoiced,
therefore the pitch is set to 0. During a second portion 202, the
frequency begins at a level and gradually declines. During a middle
portion 203, the frequency gradually rises to a peak at 204 and then
declines. During a final portion 205, the decline has a change in slope
and becomes more pronounced.
The actual pitch track 200 is approximated by one of the plurality of
stored pitch patterns 210. Note pitch pattern 210 has a first portion 211
having an initial upward slope matching the initial portions of speech
segment 203. Pitch pattern 210 then has a falling final slope 212 which is
a best fit match to the part of speech segment 203 following a peak 204 as
well as the declining frequency portion 205. Note that the change between
the initial slope 211 and the final slope 212 occurs at a time 213, which
in this case is 1/2 the duration of the syllable. Upon resynthesis of the
syllable represented by pitch shape 200, the pitch pattern 210 is
employed.
FIG. 3 illustrates flow chart 300 showing the steps required for
determination of the best pitch pattern for a particular syllable. Pitch
pattern recognizer 109 preferrably performs the steps illustrated in flow
chart 300 in order to generate an optimal pitch pattern for each syllable.
In the preferred embodiment, flow chart 300 is performed by a programmed
general purpose digital computer. It should be understood that flow chart
300 does not illustrate the exact details of the manner for programming
such a general purpose digital computer, but rather only the general
oulines of this programming. However, it is submitted that one skilled in
the art of programming general purpose digital computers would be able to
practice this aspect of the present invention from the flow chart
illustrated in 300 once the design choice of the particular general
purpose digital computer and the particular applications language has been
made. Therefore, the exact operation of the apparatus performing the steps
listed in flow chart 300 will not be described in greater detail.
Flow chart 300 starts by reading the speech data (processing block 301)
generated by LPC analyzer 103. Program 300 next reads the syllable
boundaries (processing block 302) generated by syllable recognizer 106.
Program 300 next locates the pitch data corresponding to a particular
syllable (processing block 303). Program 300 then locates the segments of
data (known as frames) which correspond to voiced speech (processing block
304). In the hypothetical example illustrated in FIG. 2, the syllable
includes eight frames, a single initial unvoiced frame and seven following
voiced frames. Because speech primary pitch corresponds only to voiced
speech, those unvoiced portions of the speech are omitted. It is well
known that each syllable includes at least one vowel which is voiced and
which may have initial and/or final voiced consonants. The hypothetical
example illustrated in FIG. 2 includes an unvoiced portion 201 which
corresponds to an unvoiced initial allophone. The remaining portions of
the syllable illustrated in FIG. 2 are voiced.
The comparison of the pitch data to the respective pitch shapes occurs in
four different loops. Program 300 first tests to determine whether or not
the program is in the first loop (decision block 305). If this is true,
then the comparison of pitch data to pitch shapes is made on all voiced
frames (processing block 306). This comparison is made in a loop including
processing blocks 307-309 and decision block 310. Processing block 307
recalls the next pitch shape. A figure of merit corresponding to the
amount of similarity between the actual pitch data and the pitch shape is
calculated (processing block 308). This figure of merit for the particular
pitch shape is then stored in correspondence to that pitch shape
(processing block 309). Program 300 then tests to determine whether or not
the last pitch shape in the set of pitch shapes has been computed
(decision block 310). In the event that the last pitch shape has not been
compared then program 300 returns to processing block 307 to repeat this
loop. In the event that the last pitch shape within the set of pitch
shapes has been compared, then program 300 returns to decision block 305.
Upon subsequent loops, program 300 tests to determine whether or not this
is the second loop (decision block 311). If this is the second loop,
program 300 causes the comparisons to be made based upon the actual pitch
data omitting the first frame of pitch data (processing block 312).
Simililarly, if it is the third loop as determined by decision block 313,
then the comparison is made omitting the last frame of pitch data
(processing block 314). Lastly, upon the fourth loop as determined by
decision block 315, the pitch shape comparison is made with the pitch data
by omitting both the first and the last frames (processing block 316).
After passing through each of the four above-mentioned loops, program 300
locates the best figure of merit previously calculated (processing block
317). Program 300 then identifies the pitch shape which corresponds to
this best figure of merit (processing block 318). At this point, program
300 is exited (exit block 319).
FIG. 4 illustrates program 400 which shows the general steps for performing
the duration pattern selection. As explained above in conjunction with
FIG. 3, in the preferred embodiment the procedures illustrated in program
400 are executed by a general purpose digital computer. Although program
400 does not describe the detailed steps required for any particular
general purpose computer to perform this procedure, it is believed that
this description is sufficient to enable one skilled in the art to
properly program a general purpose digital computer once the design choice
of that computer and that language to be employed has been made.
Program 400 begins by reading the speech data (processing block 401).
Program 400 next reads the allophone durations (processing block 402). The
allophone durations are generated by allophone recognizer 104 which
compares the standard allophone length stored within allophone library 105
with the actual length of the received allophone. Program 400 next reads
the syllable boundaries (processing block 403). Program 400 next
determines the syllable type (processing block 404). This syllable type
determination will be more fully described below in conjunction with FIG.
9.
Program 400 next enters a loop for comparison of the allophone durations
with the stored duration patterns. Program 400 first recalls the next
duration pattern corresponding to the previously determined syllable type
(processing block 405). Program 400 then calculates a figure of merit
based upon the comparison of the actual allophone durations with the
| | |