WikiPatents - Community Patent Review
Create Free Account  |  License or Sell Your Patent  |  WikiPatents Marketplace  |  WikiPatents Blog
Username:  Password:  
    
Advanced Search
Low data rate speech encoding employing syllable pitch patterns    
United States Patent4802223   
Link to this pagehttp://www.wikipatents.com/4802223.html
Inventor(s)Lin; Kun-Shan (Lubbock, TX); Reimer; Jay B. (Lubbock, TX)
AbstractThe present invention is a speech encoding technique useful in low data rate speech. Spoken input is analyzed to determine its basic phonological linguistic units and syllables. The pitch track for each syllable is compared with each of a predetermined set of pitch patterns. A pitch pattern forming the best match to the actual pitch track is selected for each syllable. Phonological linguistic unit indicia and pitch pattern indicia are transmitted to a speech synthesis apparatus. This synthesis apparatus matches the pitch pattern indicia to syllable groupings of the phonological linguistic unit indicia. During speech synthesis, sounds are produced corresponding to the phonological linguistic unit indicia with their primary pitch controlled by the pitch pattern indicia of the corresponding syllable. This achieves some measure of approximation to the primary pitch of the original spoken input at a low data rate. In the preferred embodiment, each pitch pattern includes an initial pitch slope, which may be zero indicating no change in pitch, a final pitch slope and a turning point between these two slopes.
   














 Title Information Submit all comments and votes
 
Patent Text Patent PDF Print Page Summary File History
Plain text PDF images Print Summary File History
Drawing from US Patent 4802223
Low data rate speech encoding employing syllable pitch patterns - US Patent 4802223 Drawing
Low data rate speech encoding employing syllable pitch patterns
Inventor     Lin; Kun-Shan (Lubbock, TX); Reimer; Jay B. (Lubbock, TX)
Owner/Assignee     Texas Instruments Incorporated (Dallas, TX)
Patent assignment
All assignments
Publication Date     January 31, 1989
Application Number     06/548,262
PAIR File History     Application Data   Transaction History
Image File Wrapper   Patent Term   Fees
Litigation
Filing Date     November 3, 1983
US Classification     704/207
Int'l Classification     G10L 005/00
Examiner     Kemeny; Emanuel S.
Assistant Examiner    
Attorney/Law Firm     Hiller; William E. Merrett; N. Rhys , Sharp; Melvin ,
Address
Parent Case    
Priority Data    
USPTO Field of Search     381/51 381/52 381/53 381/51 381/52 381/53 364/513.5
Patent Tags     low data rate speech encoding employing syllable pitch patterns
   
Enter a comma (,) or semicolon (;) between multiple tag words/phrases.
Describe this patent:
 Amusing   
 Clever   
 Complex   
 Efficient   
 Historic   
 Important   
 Innovative   
 Interesting   
 Practical   
 Simple   
[no votes]
Patent WIKI

Share information and news about this patent, including information and news about the technology, inventors, company, ligation and licensing.

 References Submit all comments and votes
 
*references marked with an asterisk below are user-added references
 U.S. References
 
Add a new US reference:  
ReferenceRelevancyCommentsReferenceRelevancyComments
4489433
Suehiro
704/221
Dec,1984

[0 after 0 votes]
4398059
Lin
704/267
Aug,1983

[0 after 0 votes]
3892919
Ichikawa
704/267
Jul,1975

[0 after 0 votes]
 Foreign References
 Other References
 Market Review Submit all comments and votes
   
Market Size
Estimate the gross annual revenues of the relevant market sector:
> $10B
$5B - $10B
$2B - $5B
$500M - $2B
$100M - $500M
$10M - $100M
$1M - $10M
$500K - $1M
$100K - $500K
< $100K
[No votes]
$0
 
$0   $2.5B   $5B   $7.5B   $10B
Market Share
Estimate the percentage of the relevant market sector this invention will capture:
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Reasonable Royalty
What percentage of gross sales should the inventor or assignee be paid?
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Public's "Guesstimation" of Royalty Value
Market SizeN/A[No votes]
xMarket ShareN/A[No votes]
xReasonable RoyaltyN/A[No votes]

N/A

License Availablity
If you are NOT the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
License Availablity
If you ARE the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
Competitive Advantage
Does this invention have a significant competitive advantage over similar technologies?
Yes

No



[No votes]
Most helpful competitive advantage comment
[No comments]

Commercial Alternatives
Are there viable commercial alternatives for this invention?
Yes

No



[No votes]
Most helpful commercial alternative comment
[No comments]

 Technical Review Submit all comments and votes
 Claims Submit all comments and votes
 


What is claimed is:

1. A speech encoding apparatus comprising:

input means for receiving speech including one or more words of human language;

analysis means connected to said input means for analyzing said received speech, generating a sequence of phonological linguistic unit indicia corresponding to said received speech, grouping said phonological linguistic unit indicia into syllables, and generating pitch track data corresponding to said received speech:

pitch pattern memory means storing a plurality of predetermined pitch patterns;

pitch pattern recognizer means connected to said analysis means and to said pitch pattern memory means for selecting a pitch pattern from said plurality of predetermined pitch patterns for each syllable grouping of phonological linguistic unit indicia as generated by said analysis means, said pitch pattern being selected in dependence upon said pitch track data corresponding to each syllable grouping of phonological linguistic unit indicia; and

transmission means connected to said analysis means and said pitch pattern recognizer means for transmitting said phonological linguistic unit indicia and pitch pattern indicia corresponding to said selected pitch patterns.

2. A speech encoding apparatus as claimed in claim 1, wherein:

said analysis means generated phonological linguistic unit indicia corresponding to phonemes of said received speech.

3. A speech encoding apparatus as claimed in claim 1, wherein:

said analysis means generates phonological linguistic unit indicia corresponding to allophones of said received speech.

4. A speech encoding apparatus as claimed in claim 1, wherein:

said analysis means generates phonological linguistic unit indicia corresponding to diphones of said received speech.

5. A speech encoding apparatus as claimed in claim 1, wherein:

said pitch pattern recognizer means includes comparison means connected to said analysis means and said pitch pattern memory means for comparing the pitch track data for each syllable grouping of phonological linguistic unit indicia with each of said pitch patterns of said pitch pattern memory means and generating a measure of the similarity of said pitch track data to each of said pitch patterns, and selection means for selecting the pitch pattern from said plurality of predetermined pitch patterns having the best measure of similarity for each syllable grouping of phonological linguistic unit indicia.

6. A speech encoding apparatus as claimed in claim 5, wherein:

said analysis means generates said pitch track data in a plurality of frames of data for each syllable; and

said comparison means further includes first recomparison means for comparing said pitch track data omitting the first frame of data for each syllable with each of said pitch patterns and generating a measure of similarity, and second recomparison means for comparing said pitch track data omitting the last frame of data for each syllable with each of said pitch patterns and generating a measure of similarity.

7. A speech encoding apparatus as claimed in claim 6, wherein:

said comparison means further includes third recomparison means for comparing said pitch track data omitting the first and last frames of data for each syllable with each of said pitch patterns and generating a measure of similarity.

8. A speech encoding apparatus as claimed in claim 5, wherein:

said pitch pattern memory means in storing said plurality of predetermined pitch patterns includes therein a plurality of predetermined pitch slopes from which an initial pitch slope a final pitch slope, which an initial pitch slope, a final pitch slope, and a turning point may be selected for each of said plurality of pitch patterns.

9. A speech encoding apparatus as claimed in claim 1, wherein:

said transmission means further includes means for transmitting an indication of the grouping of said phonological linguistic unit indicia into syllables.

10. A speech encoding apparatus as claimed in claim 1, wherein:

said transmission means comprises machine readable optical bar codes.
 Description Submit all comments and votes
 


BACKGROUND OF THE INVENTION

The present invention falls in the category of improvements to low data rate speech apparatuses and may be employed in electronic learning aids, electronic games, computers and small appliances. The problem of low data rate speech apparatuses is to provide electronically produced synthetic speech of modest quality while retaining a low data rate. This low data rate is required in order to reduce the amount of memory needed to store the desired speech or in order to reduce the amount of information which must be transmitted in order to specify the desired speech.

Previous solutions to the problem of providing acceptable quality low data rate speech have employed the technique of storing or transmitting data indicative of the string of phonological linguistic units corresponding to the desired speech. The speech synthesis apparatus would include a memory for storing speech synthesis parameters corresponding to each of these phonological linguistic units. Upon reception of the string of phonological linguistic units, either by recall from a phrase memory or by data transmission, the speech synthesis apparatus would successively recall the speech synthesis parameters corresponding to each phonological linguistic unit indicated, generate the speech corresponding to that unit and repeat. This technique has the advantage that the phonetic memory thus employed need only include the speech parameters for each phonological linguistic unit once, although such phonological linguistic unit may be employed many times in production of a single phrase. The amount of data required to specify one of these phonological linguistic units from among the phonetic library is much less than that required to specify the speech parameters for generation of that particular phonological linguistic unit. Therefore, whether the phrase specifying data is stored in an additional memory or transmitted to the apparatus, an advantageous reduction in the data rate is thus achieved.

This technique has a problem in that the naturalness and intelligibility of the speech thus produced is of a low quality. By recall of speech synthesis parameters corresponding to individual phonological linguistic units occurring in the phrase to be spoken rather than storing the speech synthesis parameters corresponding directly to that phrase, the natural intonation contour of the speech is destroyed. This has the disadvantage of reducing the naturalness and intelligibility of the speech. The naturalness and intelligibility and hence the quality of the speech thus produced may be increased by storing or transmitting an indication of the original, natural intonation contour for intonation control upon synthesis. Storage or transmission of an indication of the natural intonation contour increases the data rate required for specification of a particular phrase or word. Thus, it is highly advantageous to provide a manner of specifying the natural intonation contour at a low bit rate. By combining the technique of specifying phonological linguistic units together with a coded form of the natural intonation contour, a low data rate speech system may be achieved having the required speech quality.

SUMMARY OF THE INVENTION

The object of the present invention is to provide an improvement in the quality of low data rate speech by providing an indication of the original spoken pitch track. In the present invention, a low data rate is achieved by encoding spoken input as a series of phonological linguistic units such as phonemes, allophones or diphones and transmitting indicia corresponding to these phonological linguistic units. Ordinarily such a technique destroys the original pitch contour of the spoken input. Some of this original spoken pitch contour is recovered by the use of syllable pitch patterns which represent an approximation of the original pitch contour.

In accordance with the principles of the present invention, the spoken input is analyzed to determine the phonological linguistic units and the syllables which it includes. In addition, the pitch track for each syllable is also determined. This measure of the pitch track of the syllables of the spoken input is compared with a predetermined set of pitch patterns. Once the best match is found, then an indication of the syllable pitch pattern is together with phonological linguistic unit indicia. The synthesis apparatus then combines this data in order to produce speech. The syllable pitch patterns enable the synthesis apparatus to provide an approximation of the pitch contour of the original spoken input without sacrificing a low data rate. This is achieved because it requires much less data to identify syllable pitch patterns than to transmit the actual pitch contour.

In the preferred embodiment the actual pitch contour is compared to the set of pitch patterns in more than one manner. Firstly the comparison is made based upon all voiced frames in the syllable. Then a recomparison is made while omitting the first voiced frame of the syllable. A second recomparison is made omitting the final voiced frame. Lastly a third recomparison is made omitting both the first and last unvoiced frames. The pitch pattern with the best match in any of these comparisons is the pattern selected for transmission with the phonological linguistic unit indicia.

In the preferred embodiment of the present invention each syllable pitch pattern specifies three different pitch parameters. The pitch patterns specify an initial pitch slope for control of the change of pitch during an initial portion of the syllable and a final pitch slope for control of the change in pitch for the final portion of the syllable. Finally the pitch pattern specifies the turning point or place within the syllable where the slope changes from the specified initial slope to the final slope. With this specification of pitch the speech produced is controlled to provide a greater quality of speech.

BRIEF DESCRIPTION OF THE DRAWINGS

These and other objects of the present invention will become clear from the detailed description of the invention which follows in conjunction with the drawings in which:

FIG. 1 illustrates a block diagram of the system required to analyze the pitch and duration patterns of specified speech in order to provide the encoding in accordance with the present invention;

FIG. 2 illustrates an example of a natural pitch contour for a syllable together with the corresponding pitch pattern;

FIG. 3 illustrates a flow chart of the steps required in the pitch pattern analysis in accordance with the present invention;

FIG. 4 illustrates a flow chart of the steps required for the duration pattern analysis in accordance with the present invention;

FIG. 5 illustrates an example of a speech synthesis system for production of speech in accordance with the pitch and duration patterns of the present invention;

FIGS. 6A and 6B illustrate a flow chart of the steps required for speech synthesis based upon pitch and duration patterns in accordance with the present invention;

FIG. 7 illustrates a flow chart corresponding to the steps necessary for preprocessing in a text-to-speech embodiment of the present invention;

FIG. 8 illustrates the steps for preprocessing and an embodiment of the present invention in which allophone, word boundary and prosody data are transmitted to the speech synthesis apparatus;

FIG. 9 illustrates the steps required for determining the syllable type from all allophone data;

FIGS. 10A and 10B illustrate a flow chart of the steps required for identifying syllable boundaries from allophone and word boundary data;

FIG. 11 is a flow chart illustrating the overall steps in a automatic stress analysis technique;

FIGS. 12A and 12B illustrate a flow chart showing the assignment of delta pitch and pitch pattern in the falling intonation mode, which is called as a subroutine of the flow chart illustrated in FIG. 11;

FIGS. 13A and 13B illustrate a flow chart showing the assignment of delta pitch and pitch pattern in a rising intonation mode, which is called as a subroutine of the flow chart illustrated in FIG. 11;

FIG. 14 illustrates the steps for conversion of allophone data from word mode to phrase mode in accordance with another embodiment of the present invention; and

FIG. 15 illustrates the steps for conversion of allophone data specified in a phrase mode into an individual word mode in accordance with a further embodiment of the present invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

The present invention is in the field of low data rate speech, that is speech in which the data required to specify a particular segment of human speech is relatively low. Low data rate speech, if it is of acceptable speech quality, has the advantage of requiring storage or transmission of a relatively low amount of data for specifying a particular set of spoken sounds. One previously employed method for providing low data rate speech is to analyze speech and identify individual phonological linguistic units within a string of speech. Each phonological linguistic unit represents a humanly perceivable sub-element of speech. Once the string of phonological linguistic units corresponding to a given segment of spoken source has been identified, this low bit rate speech technique specifies the speech to be produced by storing or sending a string of indicia corresponding to the string of phonological linguistic units making up that segment of speech.

The specification of speech to be produced in this manner has a disadvantage in that the natural intonation contour of the original spoken input is destroyed. Therefore, the intonation contour of the reproduced speech is wholly artificial. This results in an artificial intonation contour which may be described as choppy or robot like. The provision of such an intonation contour may not be disadvantageous in some applications such as toys or games. However, it is considered advantageous in most applications to provide an approximation of the original intonation contour. The present invention is concerned with techniques for encoding the natural intonation contour for transmission with the phonological linguistic unit indicia in order to specify a more natural-sounding speech.

In the preferred embodiment of the present invention, the speech is produced via linear predictive coding by a single integrated chip designated TMS5220A manufactured by Texas Instruments Incorporated. In linear predictive coding speech synthesis a mathematical model of the human vocal tract, is produced and individual features of the model vocal tract are controlled by changing data called reflection coefficients. This causes the mathematical model to change in analogy to the change in the human vocal tract corresponding to movement of the lips, tongue, teeth and throat. The TMS5220A integrated circuit speech synthesis device allows independent control of speech pitch via control of the pitch period of an excitation function. In addition, the TMS5220A speech synthesis device permits independent control of speech duration by control of the amount of time assigned for each data frame of speech produced. By independent control of both the pitch and duration of the produced speech, a much more natural intonation contour may be produced.

FIG. 1 illustrates the encoding apparatus 100 necessary for generating speech parameter data corresponding to spoken or written text input in accordance with the present invention. The output of the encoding apparatus 100 includes a string of indicia corresponding to the phonological linguistic units of the input, a string of pitch pattern indicia selected from a pitch pattern library corresponding to the pitch of the received input and a string of duration pattern indicia selected from among a set of duration patterns within a duration pattern library corresponding to a particular syllable type.

Encoding apparatus 100 includes two alternate input paths, the first via microphone 101 for receiving spoken speech and the second via text input 114 for receiving inputs corresponding to printed text. The speech input channel through microphone 101 will be first described. Microphone 101 receives spoken input and converts this into a varying electrical signal. This varying electrical signal is applied to analog to digital converter 102. In accordance with known principles, analog to digital converter 102 converts the time varying electrical signal generated by a microphone 101 into a set of digital codes indicative of the amplitude of the signal at sampled times. This set of sampled digital code values is applied to LPC analyzer 103. LPC analyzer 103 takes the digital data from analog to digital converter 102 and converts it into linear predictive coding parameters for speech synthesis. LPC analyzer 103 generates an indication of energy, pitch and reflection coefficients for successive time samples of the input data. This set of energy, pitch and reflection coefficient parameters could be employed directly for speech synthesis by the aforementioned TMS5220A speech synthesis device. However, in accordance with the principles of the present invention, these speech parameters are subjected to further analysis in order to reduce the amount of data necessary to specify a particular portion of speech. The present invention operates in accordance with the principles set forth in U.S. Pat. No. 4,398,059 entitled "Speech Producing System" by Kun-Shan Lin, Kathleen M. Goudie, and Gene A. Frantz. In this patent, the speech to be produced is broken up into component allophones. Allophones are variants of phonemes which form the basic elements of spoken speech. Allophones differ from phonemes in that allophones are variants of phonemes depending upon the speech environment within which they occur. For example, the P in "Push" and the P in "Spain" are different allophone variants of the phoneme P. Thus, the use of allophones in speech synthesis enables better control of the transition between adjacent phonological linguistic units. Table 1 lists the allophones employed in the system of the present invention together with an example illustrating the pronunciation of that allophone. The allophones listed in Table 1 are set forth in a variety of categories which will be further explained below.

The energy, pitch and reflection coefficient data from LPC analyzer 103 is applied to allophone recognizer 104. Allophone recognizer 104 matches the received energy, pitch and reflection coefficient data to a set of templates stored in allophone library 105. Allophone library 105 stores energy, pitch and reflection coefficient parameters corresponding to each of the allophones listed in Table 1. Allophone recognizer 104 compares the energy, pitch and reflection coefficient data from LPC analyzer 103 corresponding to the actual speech input to the individual allophone energy, pitch and reflection coefficient parameters stored within allophone library 105. Allophone recognizer 104 then selects a string of allophone indicia which best matches the received data corresponding to the actual spoken speech. Allophone recognizer 104 also produces an indication of the relationship of the duration of the received allophone to the standardized duration of the corresponding allophone data stored in allophone library 105.

The string of allophone indicia from allophone recognizer 104 is then applied to syllable recognizer 106. Syllable recognizer 106 determines the syllable boundaries from the string of allophone indicia from allophone recognizer 104. In accordance with the principles of the present invention, pitch and duration patterns are matched to syllables of the speech to be produced. It has been found that the variation in pitch and duration within smaller elements of speech is relatively minor and that generation of pitch and duration patterns corresponding to syllables results in an adequate speech quality. The output of syllable recognizer 106 determines the boundaries of the syllables within the spoken speech.

Speech encoding apparatus 100 may alternatively use a speech to syllable recognizer (not shown) for determining the syllable boundaries within the spoken speech input. A speech to syllable recognizer would receive the energy, pitch and reflection coefficient parameters from LPC analyzer 103 and directly generate the syllable boundaries without the necessity for determining allophones as an intermediate step. A further alternative method for determining the syllable boundaries is hand editing (not shown) 108. This corresponds to a trained listener who inserts syllable boundaries upon careful observation by listening to the input speech. In any event, by this point the input speech has been analyzed to determined the energy, pitch, reflection coefficients, allophones and syllable boundaries.

This data, and in particular the pitch and syllable boundary data are applied to pitch pattern recognizer 109. Pitch pattern recognizer 109 encodes the indication of the pitch of the original speech into one of a predetermined set of pitch patterns for each syllable. An indication of these syllable pitch patterns are stored within pitch pattern library 110. Pitch pattern recognizer 109 compares the indication of the actual pitch for each syllable with each of the pitch patterns stored within pitch pattern library 110 and provides an indication of the best match. The output of pitch pattern recognizer 109 is a pitch pattern code corresponding to the best match for the pitch shape of each syllable to the pitch patterns within pitch pattern library 110.

An indication of the pitch patterns stored within pitch pattern library 110 is shown in Table 2. Table 2 identifies each pitch pattern by an identification number, an initial slope, a final slope and a turning point. In accordance with the present invention, the pitch within each syllable is permitted two differing slopes with an adjustable turning point. It should be noted that the slope is restricted within the range of .+-.2 in the preferred embodiment. Also it should be noted that the preferred speech synthesis device, the TMS5220A, permits independent variation of the pitch period rather than of the pitch frequency. A negative number indicates a reduction in pitch period and therefore an increase in frequency while a positive number indicates an increase in pitch period and therefore a decrease in frequency. In the preferred embodiment, the turning point occurs either at 1/4 of the syllable duration, 1/2 of the syllable duration or 3/4 of the syllable duration. Note that no turning point has been listed for those pitch patterns in which the initial slope and the final slope are identical. In such a case there is no need to specify a turning point, since wherever such a turning point occurs, the change in pitch period will be identical. With an allowed group of five initial slopes, five final slopes and three turning points, one would ordinarily expect a total of 75 possible pitch patterns. However, because some of these patterns are redundant, particularly those in which the initial and final slopes are identical, there are only the 53 variations listed. Because of this limitation upon the number of pitch patterns, it is possible to completely specify a particular one of these patterns with only six bits of data.

After the pitch pattern has been selected by pitch pattern recognizer 109, the data is applied to syllable type recognizer 111. Syllable type recognizer 111 classifies each syllable as one of four types depending upon whether or not there are initial or final unvoiced consonant clusters. Syllable type recognizer 111 examines the allophone indicia making up each syllable and determines whether there are any consonant allophone indicia prior to the vowel allophone indicia or any consonant allophone indicia following the vowel allophone indicia which fall within the class of unvoiced consonants. Based upon this determination, the syllable is classified as one of four types.

Duration pattern recognizer 112 receives the syllable type data from syllable type recognizer 111 as well as allophone and duration data. In this regard it should be understood that each allophone may be pronounced in a manner either longer or shorter than the standardized form stored within allophone library 105. As previously noted, allophone recognizer 104 generates data corresponding to a comparison of the duration of the actual allophone data received from LPC analyzer 103 and the standardized allophone data stored within allophone library 105. Based upon this comparison, an allophone duration parameter is derived. The aforementioned TMS5220A speech synthesis device enables production of speech at one of four differing rates covering a four to one time range. Duration pattern library 113 stores a plurality of duration patterns for each of the syllable types determined by syllable type recognizer 111. Each duration pattern within duration pattern library 113 includes a first duration control parameter for any initial consonant allophones, a second duration control parameter for the vowel allophone and a third duration control parameter for any final consonant allophone. The duration pattern recognizer 112 compares the actual duration of speaking for the particular allophone generated by allophone recognizer 104 with each of the duration patterns stored within duration pattern library 113 for the corresponding syllable type. Duration pattern recognizer 112 then determines the best match between the actual duration of the spoken speech and the set of duration patterns corresponding to that syllable type. This best match duration pattern is then output by duration pattern recognizer 112. At the output of duration pattern recognizer 112 is the allophone indicia corresponding to the string of allophones within the spoken input, and the pitch and duration patterns corresponding to each syllable of the spoken input. In addition, duration pattern recognizer 112 may optionally also output some indication of the syllable boundaries.

Elements 114 and 115 illustrate an alternative input to the speech encoding apparatus 100. Text input device 114 receives the input of data corresponding to ordinary printed text in plain language. This text input is applied to text to alophone translator 115 which generates a string of allophone indicia which corresponds to the printed text input. Such a text to allophone conversion may take place in accordance with copending U.S. patent application Ser. No. 240,694 filed Mar. 5, 1981. As an optional further step, hand allophone editing 106 permits a trained operator to edit the allophones from text to allophone converter 115 in order to optimize the allophone string for the desired text input. The allophone string corresponding to the text input is then applied to syllable recognizer 106 where this data is processed as described above.

FIG. 2 illustrates an example of hypothetical syllable pitch data together with the corresponding best match pitch pattern. Pitch track 200 corresponds to the actual primary pitch of the hypothetical syllable. During the first part of the syllable 201, the speech is unvoiced, therefore the pitch is set to 0. During a second portion 202, the frequency begins at a level and gradually declines. During a middle portion 203, the frequency gradually rises to a peak at 204 and then declines. During a final portion 205, the decline has a change in slope and becomes more pronounced.

The actual pitch track 200 is approximated by one of the plurality of stored pitch patterns 210. Note pitch pattern 210 has a first portion 211 having an initial upward slope matching the initial portions of speech segment 203. Pitch pattern 210 then has a falling final slope 212 which is a best fit match to the part of speech segment 203 following peak 204 as well as the declining frequency portion 205. Note that the change between the initial slope 211 and the final slope 212 occurs at a time 213, which in this case is 1/2 the duration of the syllable. Upon resynthesis of the syllable represented by pitch shape 200, the pitch pattern 210 is employed.

FIG. 3 illustrates flow chart 300 showing the steps required for determination of the best pitch pattern for a particular syllable. Pitch pattern recognizer 109 preferrably performs the steps illustrated in flow chart 300 in order to generate an optimal pitch pattern for each syllable. In the preferred embodiment, flow chart 300 is performed by a programmed general purpose digital computer. It should be understood that flow chart 300 does not illustrate the exact details of the manner for programming such a general purpose digital computer, but rather only the general outlines of this programming. However, it is submitted that one skilled in the art of programming general purpose digital computers would be able to practice this aspect of the present invention from the flow chart illustrated in 300 once the design choice of the particular general purpose digital computer and the particular applications language has been made. Therefore, the exact operation of the apparatus performing the steps listed in flow chart 300 will not be described in greater detail.

Flow chart 300 starts by reading the speech data (processing block 301) generated by LPC analyzer 103. Program 300 next reads the syllable boundaries (processing block 302) generated by syllable recognizer 106. Program 300 next locates the pitch data corresponding to a particular syllable (processing block 303). Program 300 then locates the segments of data (known as frames) which correspond to voiced speech (processing block 304). In the hypothetical example illustrated in FIG. 2, the syllable includes eight frames, a single initial unvoiced frame and seven following voiced frames. Because speech primary pitch corresponds only to voiced speech, those unvoiced portions of the speech are omitted. It is well known that each syllable includes at least one vowel which is voiced and which may have initial and/or final voiced consonants. The hypothetical example illustrated in FIG. 2 includes an unvoiced portion 201 which corresponds to an unvoiced initial allophone. The remaining portions of the syllable illustrated in FIG. 2 are voiced.

The comparison of the pitch data to the respective pitch shapes occurs in four different loops. Program 300 first tests to determine whether or not the program is in the first loop (decision block 305). If this is true, then the comparison of pitch data to pitch shapes is made on all voiced frames (processing block 306). This comparison is made in a loop including processing blocks 307-309 and decision block 310. Processing block 307 recalls the next pitch shape. A figure of merit corresponding to the amount of similarity between the actual pitch data and the pitch shape is calculated (processing block 308). This figure of merit for the particular pitch shape is then stored in correspondence to that pitch shape (processing block 309). Program 300 then tests to determine whether or not the last pitch shape in the set of pitch shapes has been computed (decision block 310). In the event that the last pitch shape has not been compared then program 300 returns to processing block 307 to repeat this loop. In the event that the last pitch shape within the set of pitch shapes has been compared, then program 300 returns to decision block 305.

Upon subsequent loops, program 300 tests to determine whether or not this is the second loop (decision block 311). If this is the second loop, program 300 causes the comparisons to be made based upon the actual pitch data omitting the first frame of pitch data (processing block 312). Similarly, if it is the third loop as determined by decision block 313, then the comparison is made omitting the last frame of pitch data (processing block 314). Lastly, upon the fourth loop as determined by decision block 315, the pitch shape comparison is made with the pitch data by omitting both the first and the last frames (processing block 316).

After passing through each of the four above-mentioned loops, program 300 locates the best figure of merit previously calculated (processing block 317). Program 300 then identifies the pitch shape which corresponds to this best figure of merit (processing block 318). At this point, program 300 is exited (exit block 319).

FIG. 4 illustrates program 400 which shows the general steps for performing the duration pattern selection. As explained above in conjunction with FIG. 3, in the preferred embodiment the procedures illustrated in program 400 are executed by a general purpose digital computer. Although program 400 does not describe the detailed steps required for any particular general purpose computer to perform this procedure, it is believed that this description is sufficient to enable one skilled in the art to properly program a general purpose digital computer once the design choice of that computer and that language to be employed has been made.

Program 400 begins by reading the speech data (processing block 401). Program 400 next reads the allophone durations (processing block 402). The allophone durations are generated by allophone recognizer 104 which compares the standard allophone length stored within allophone library 105 with the actual length of the received allophone. Program 400 next reads the syllable boundaries (processing block 403). Program 400 next determines the syllable type (processing block 404). This syllable type determination will be more fully described below in conjunction with FIG. 9.

Program 400 next enters a loop for comparison of the allophone durations with the stored duration patterns. Program 400 first recalls the next duration pattern corresponding to the previously determined syllable type (processing block 405). Program 400 then calculates a figure of merit based upon the comparison of the actual allophone durations with the allophone durations of the duration pattern (processing block 406). This comparison takes place by comparing the relative length of the initial consonant allophones with a first portion of the duration pattern, comparing the relative length of the vowel allophone with a second number of the duration pattern and comparison of the relative duration of any final consonant allophones with the third parameter of the duration pattern. Once this figure of merit has been calculated, it is stored in conjunction with the particular duration pattern (processing block 407). At this point program 400 tests to determine whether the last duration pattern has been compared (decision block 408). If the last duration pattern has not been compared, then program 400 returns to processing block 405 to begin the loop again.

In the event that the comparison has been made for each of the duration patterns of the corresponding syllable type then processing block 409 finds the best figure of merit (processing block 409). Program 400 next identifies the particular duration pattern having the previously discovered greatest figure of merit (processing block 410). This duration pattern is the duration pattern which speech encoding apparatus 100 transmits. At this point program 400 is exited by an exit block 411.

This technique may be used in other manners. As an example it is possible to form speech parameter patterns of speech energy sequences, linear predictive coding reflection coefficients or formant frequencies. These type of speech parameters may be matched against prestored patterns in the manner disclosed in regard to pitch and duration. After the best match is found the indicia corresponding to the best speech parameter pattern is identified for transmission to the speech synthesis apparatus. These other speech parameter patterns may be related to other phonological linguistic indicia than the syllables previously disclosed. For example, these other speech parameter patterns may be related to the phonemes, allophones, diphones, demisyllables as well as the syllables disclosed above. As will be further detailed below in relation to pitch and duration patterns, upon synthesis the information of the phonological linguistic unit indicia and the speech pattern indicia are combined to generate the speech.

FIG. 5 illustrates speech producing apparatus 500 in accordance with a preferred embodiment of the present invention. Speech producing apparatus 500 receives input in the form of printed bar code by an optical wand 501. This input data has been encoded in the format described above including allophone indicia, syllable pitch pattern indicia and syllable duration pattern indicia. This data is transmitted to analog to digital converter 502 for conversion into a digital form.

The digital data from analog to digital converter 502 is applied to microprocessor unit 503. Also coupled to microprocessor unit 503 is Random Access Memory 504 and Read Only Memory 505. In accordance with the programming permanently stored within Read Only Memory 505, microprocessor unit 503 identifies the proper allophone indicia and transmits these to stringer 506. In addition, microprocessor unit 503 calculates the proper pitch and duration control parameters from the pitch pattern indicia and the duration pattern indicia. The pitch and duration pattern data are also stored within Read Only Memory 505. Microprocessor unit 503 employs Random Access Memory 504 for storing intermediate values of calculations and for buffering both input and output data.

Stringer 506 combines control data received from microprocessor unit 503 and speech parameters recalled from phonetic memory 507 to generate the speech synthesis parameters for application to synthesizer 508. Phonetic memory 507 includes speech parameters corresponding to each of the permitted allophone indicia. Phonetic memory 507 corresponds substantially to allophone library 105 used as a template for allophone recognizer 104. Stringer 506 recalls the speech parameters from phonetic memory 507 corresponding to received allophone indicia and combines these speech parameters with speech control parameters generated by microprocessor unit 503 in order to control speech synthesizer 508 to generate the desired words.

Speech synthesizer 508 receives the speech parameters from stringer 506 and generates electrical signals corresponding to spoken sounds. These signals are amplified by amplifier 509 and reproduced by speaker 510.

It should be understood that the optical bar code input illustrated in FIG. 5 is merely a preferred embodiment of the use of the present invention. Other forms of input into speaking apparatus 500 may be found advantageous in other applications.

FIG. 6 illustrates program 600 which outlines the major steps required of microprocessor unit 503 in order to generate the proper control parameters for transmission to stringer 506. As in the examples illustrated in FIGS. 3 and 4, program 600 is not intended to illustrate the exact detailed steps required of the microprocessor unit 503, but rather is intended to convey sufficient information to enable one skilled in the art to produce such a detailed program once the selection of the particular microprocessor unit and its associated instruction set is made.

Program 600 starts by input 601 in which microprocessor unit 503 receives the digital data from analog to digital converter 502. Program 600 next deciphers the enciphering of the data received from analog to digital converter 502. In the preferred embodiment, the optical bar code which is read by optical wand 501 is enciphered in some manner to increase its redundancy thereby increasing the possibility of correctly reading of this data. Program 300 next identifies the allophone indicia and the overhead data for later use. The allophone indicia corresponds to the allophones to be spoken by speaking apparatus 500. The overhead data corresponds to such things as the initial pitch, which may be called the base pitch, the permitted pitch range or phrase delta pitch for the particular phrase for control of the expressiveness of the phrase, the word endings, the particular pitch and duration patterns corresponding to each syllable and additional redundancy data such as the number of allophone indicia within the phrase. This data, in particular the pitch pattern data and the duration pattern data corresponding to syllables made up of groups of allophone indicia are employed for generation of speech control parameters for transmission to stringer 506.

Program 600 next identifies the next syllable to be spoken. This identification of the syllable to be spoken may be by means of overhead codes which identify the particular allophone indicia within each syllable. In addition, as will be shown below, microprocessor unit 503 may be programmed in order to determine the syllable boundaries from the types of allophone codes and word boundaries. In any event, program 600 now is concerned with the allophone indicia corresponding to a particular syllable and the overhead data which is employed to control the intonation of that particular syllable. Program 600 then identifies the syllable based upon the presence or absence of any unvoiced initial consonant allophone indicia and unvoiced final consonant allophone indicia. This determination is more clearly illustrated in conjunction with FIG. 9.

Program 600 next selects the particular duration control pattern to be applied to synthesizer 508 during the synthesis of the particular allophone. This is accomplished by recalling the syllable duration pattern (processing block 606) which it should be noted is dependent upon the syllable type. Program 600 next tests to determine whether the next allophone to be spoken is in an initial consonant cluster (decision block 607) and if so assigns the initial duration from the duration pattern to this allophone (processing block 608). If this is not an initial consonant cluster allophone, then program 600 checks to determine whether it is a vowel allophone (decision block 609). If this is the case, then program 600 assigns the medial duration of the duration pattern to this allophone (processing block 610). In the event that the allophone is neither one of the initial consonant allophones nor the vowel allophone, then it must be one of the allophones of the final consonant cluster. In such a case the final duration of the duration pattern is assigned to this allophone (processing block 611).

Program 600 next assigns the pitch to be used in speaking the allophone under consideration. It will be recalled that in the preferred embodiment, synthesizer 508 is embodied by a TMS5220A speech synthesis device available from Texas Instruments Incorporated. This speech synthesis device allows independent control of primary speech pitch by independent control of the pitch period of an excitation function. The following illustrates the manner in which this pitch period is set.

Program 300 first recalls the pitch pattern data corresponding to the particular syllable (processing block 612). As can be seen from a study of Table 2, each particular pitch pattern generally has an initial slope, a final slope and a turning point. As will be more fully understood below, the initial and final slopes enable change of the pitch period of the excitation function of the speech synthesizer 508 during the time that a particular syllable is synthesized.

The pitch period is then set to be equal to the base pitch which is used to determine the register of the voice to be produced and i