|
|
|
| United States Patent | 5652828 |
| Link to this page | http://www.wikipatents.com/5652828.html |
| Inventor(s) | Silverman; Kim Ernest Alexander (Danbury, CT) |
| Abstract | Improved automated synthesis of human audible speech from text is
disclosed. Performance enhancement of the underlying text
comprehensibility is obtained through prosodic treatment of the
synthesized material, improved speaking rate treatment, and improved
methods of spelling words or terms for the sysstem user. Prosodic shaping
of text sequences appropriate for the discourse in large groupings of text
segments, with prosodic boundaries developed to indicate conceptual units
within the text groupings, is implemented in a preferred embodiment. |
|
|
|
Title Information  |
|
|
|
|
|
Drawing from US Patent 5652828 |
|
|
Automated voice synthesis employing enhanced prosodic treatment of text,
spelling of text and rate of annunciation |
|
|
|
|
|
| Publication Date |
July 29, 1997 |
|
|
|
|
|
| Filing Date |
March 1, 1996 |
|
|
|
|
|
|
|
|
|
|
|
| Parent Case |
RELATED APPLICATIONS
This application is a continuation of U.S. patent application Ser. No.
08/460,030 filed Jun. 2, 1995 which is a continuation of U.S. patent
application Ser. No. 08/033,528 now abandoned filed Mar. 19, 1993 both of
which are titled "IMPROVED AUTOMATED VOICE SYNTHESIS EMPLOYING ENHANCED
PROSODIC TREATMENT OF TEXT, SPELLING OF TEXT AND RATE OF ANNUNCIATION". |
|
|
|
|
|
|
|
|
|
|
|
|
|
Title Information  |
|
|
References  |
|
|
| *references marked with an asterisk below are user-added references |
|
U.S. References |
|
|
| Add a new US reference: |
| | Reference | Relevancy | Comments | Reference | Relevancy | Comments | 5384893 Hutchins
Jan,1995 |      Your vote accepted [0 after 0 votes] | | 5212731 Zimmermann
May,1993 |      Your vote accepted [0 after 0 votes] | | 5040218 Vitale et al.
Aug,1991 |      Your vote accepted [0 after 0 votes] | | 4979216 Malsheen et al.
Dec,1990 |      Your vote accepted [0 after 0 votes] | | 4964167 Kunizawa et al.
Oct,1990 |      Your vote accepted [0 after 0 votes] | | 4908867 Silverman
Mar,1990 |      Your vote accepted [0 after 0 votes] | | 4907279 Higuchi et al.
Mar,1990 |      Your vote accepted [0 after 0 votes] | | 4896359 Yamamoto
Jan,1990 |      Your vote accepted [0 after 0 votes] | | 4831654 Dick
May,1989 |      Your vote accepted [0 after 0 votes] | | 4829580 Church
May,1989 |      Your vote accepted [0 after 0 votes] | | 4783811 Fisher et al.
Nov,1988 |      Your vote accepted [0 after 0 votes] | | 4783810 Kroon
Nov,1988 |      Your vote accepted [0 after 0 votes] | | 4695962 Goudie
Sep,1987 |      Your vote accepted [0 after 0 votes] | | 4692941 Jack et al.
Sep,1987 |      Your vote accepted [0 after 0 votes] | | 4685135 Lin et al.
Aug,1987 |      Your vote accepted [0 after 0 votes] | | 4689817 Kroon
Aug,1987 |      Your vote accepted [0 after 0 votes] | | 4470150 Ostrowski
Sep,1984 |      Your vote accepted [0 after 0 votes] | | 3704345 Coker et al.
Nov,1972 |      Your vote accepted [0 after 0 votes] | | |
|
|
|
|
U.S. References |
|
|
Foreign References |
|
|
|
|
|
|
Foreign References |
|
|
Other References |
|
|
| Add a new Other reference: |
| Post related web sites and other references in this section |
| | Reference | Relevancy | Comments | Sagisaka, "Speech synthesis from text"; IEEE communications magazine, pp. 35-41 vol. 28 iss. 1, Jan. 1990.
. Mar,2007 |      Your vote accepted [0 after 0 votes] | | Fitzpatrick et al, "Parsing for prosody: what a text-to-speech system needs from syntax", pp. 188-194, 27-31 Mar. 1989.
. Mar,2007 |      Your vote accepted [0 after 0 votes] | | Moulines et al, "A real-time French text-to-speech system generating high-quality synthetic speech"; ICASSP 90, pp. 309-312 vol. 1, 3-6 Apr. 1990.
. Mar,2007 |      Your vote accepted [0 after 0 votes] | | Willemse et al, "Context free wild card parsing in a text-to-speech system"; ICASSP 91, pp. 757-760 vol. 2, 14-17 May 1991.
. Mar,2007 |      Your vote accepted [0 after 0 votes] | | "Assigning Intonational Features in Synthesized Spoken Directions", James Raymond Davis and Julia Hirschberg; 26th Annual Mtg of Assoc. Computational Lingustics; 1988 pp. 187-193.
. Mar,2007 |      Your vote accepted [0 after 0 votes] | | "The Intonational Structuring of Discourse", Julia Hirschberg and Janet Pierrehumbert; Association of Computational Linguistics; 1986 (ACL-86).
. Mar,2007 |      Your vote accepted [0 after 0 votes] | | "Synthesis by Rule of Prosodic Features in Word Concatenation Synthesis", J. S. Young, F. Fallside; Int. Journal Man-Machine Studies (1980) V12, pp. 241-258.
. Mar,2007 |      Your vote accepted [0 after 0 votes] | | "Speech Timing and Intelligibility", A.W.F. Huggins; Attention and Performance VII; Hillsdale, N.J.: Erlbaum 1978.
. Mar,2007 |      Your vote accepted [0 after 0 votes] | | "Speech Synthesis from Concept: A Method for Speech Output From Information Systems", S.J. Young and F. Fallside; J. Acoust. Soc. Am. 66(3), Sep. 1979, pp. 685-695.
. Mar,2007 |      Your vote accepted [0 after 0 votes] | | "Perception of Synthetic Speech Produced Automatically by Rule: Intelligibility of Eight Text-to-Speech Systems"; B. G. Green, J. S. Logan, D. B. Pisoni; Behavior Research Methods, Instruments, & Computers, V18, pp. 100-107, 1986.
. Mar,2007 |      Your vote accepted [0 after 0 votes] | | "Perceptiual Evaluation of DECtalk: A Final report on Version 1.8*"; B. G. Greene, L. M. Manous, D. B. Pisoni; Research on Speech Perception Progress Report No. 10; Bloomington IN. Speech Research Laboratory, Indiana University (1984).
. Mar,2007 |      Your vote accepted [0 after 0 votes] | | "Evaluating Synthesizer Performance: Is Segmental Intelligibility Enough"; K. Silverman, S. Basson, S. Levas, International Conf. on Spoken language Processing, 1990.
. Mar,2007 |      Your vote accepted [0 after 0 votes] | | Kim E. A. Silverman, Doctoral Thesis: "The Structure and Processing of Fundamental Frequency Contours", University of Cambridge (UK) 1987.
. Mar,2007 |      Your vote accepted [0 after 0 votes] | | "From Text to Speech:: The MIT talk System", J. Allen, M. S. Hunnicutt and D. Klatt, Cambridge University Press (1987).
. Mar,2007 |      Your vote accepted [0 after 0 votes] | | "Evaluating the Overall Comprehensibility of Speech Synthesizers", T. Boogaart, K. Silverman; Proc. Int'l Conf. on Spoken Language Processing (1990).
. Mar,2007 |      Your vote accepted [0 after 0 votes] | | "On Evaluating Synthetic Speech: What Load Does It Place on a Listener's Cognitive Resources", Proc. 3rd Austal. Int'l Conf. Speech Science & Technology (1990) K. Silverman, S. Basson, S. Levas.
. Mar,2007 |      Your vote accepted [0 after 0 votes] | | "Human Factors and Synthetic Speech"; J. C. Thomas and M. B. Rosson; Human Computer Interaction--INTERACT '84; North Holland Elsevier Science Publishers (1984) pp. 219-224.. Mar,2007 |      Your vote accepted [0 after 0 votes] | | |
|
|
|
|
Other References |
|
|
|
|
|
References  |
|
|
|
|
|
| Market Size |
|
Estimate the gross annual revenues of the relevant market
sector:
|
| | |
| |
|
|
| Market Share |
|
Estimate the percentage of the relevant market sector this invention will capture:
|
| | |
| |
|
|
| Reasonable Royalty |
|
What percentage of gross sales should the inventor or assignee be paid?
|
| | |
| |
|
|
|
Public's "Guesstimation" of Royalty Value
|
| Market Size | N/A | [No votes] | | x | Market Share | N/A | [No votes] | | x | Reasonable Royalty | N/A | [No votes] |
| | N/A | |
| |
|
|
|
|
|
|
|
|
|
|
|
|
Market Review  |
|
|
Technical Review  |
|
|
Claims  |
|
|
What is claimed is:
1. A method of synthesizing human audible speech from restricted text having a predetermined information content and predetermined format characteristics, the method
comprising the steps of:
generating prosody indica for the restricted text as a function of the predetermined information content and predetermined format characteristics by performing the steps of:
a) identifying major prosodic groupings within the restricted text by utilizing major demarcation features which are a function of the predetermined format characteristics to define the beginning and end of the major prosodic groupings;
b) identifying prosodic subgroupings within the major prosodic groupings according to prosodic rules for analyzing the restricted text as a function of the predetermined information content for predetermined textual markers indicative of
prosodically isolatible subgroupings not delineated by the major demarcations dividing the prosodic major groupings;
c) identifying within the prosodic subgroupings prosodically separable subgroup components;
d) generating prosodic indica which include salience signifiers, the salience signifiers controlling the salience of segments of the synthesized speech, the step of generating the prosodic indica including the steps of:
(i) generating salience signifiers within the prosodic subgroupings in accordance with predetermined salience placement rules relating to the components of the subgroupings themselves;
(ii) modifying the salience at the beginning and end of each prosodic subgroup; and
(iii) modifying the salience at the beginning and end of each major prosodic grouping; and
generating and outputting audible speech from the restricted text and prosodic indica.
2. The method of claim 1,
wherein the predetermined information content includes a carrier phrase including word strings that have a structuring purpose and information words;
wherein the step of identifying major prosodic groupings includes the step of identifying the carrier phrase.
3. The method of claim 2, wherein the information words include names with prefixed titles and wherein the method further comprises the steps of:
increasing a speaking rate of the word strings that have a structuring purpose relative to a speaking rate of the information words.
4. The method of claim 3, wherein the information words include names which include prefixed titles followed by a word of the name, the method further comprising the step of:
modifying the generated salience indicators to assign less salience to the prefixed title than the word following the prefixed title.
5. The method of claim 4, wherein a first time speech is generated from a word it is assigned greater salience then when speech is subsequently generated from the same word.
6. The method of claim 5, further comprising the steps of:
repeatedly outputting the audible speech corresponding to a first segment of text;
decreasing a rate of annunciation of the first segment of text after a first number of successive repeats of the audible speech corresponding to the first segment of text.
7. The method of claim 6,
wherein the step of modifying the salience at the beginning and end of each prosodic subgroup includes the steps of:
modifying the generated salience signifiers to increase the salience at the beginning of each prosodic subgroup; and
modifying the generated salience signifiers to decrease the salience at the end of each prosodic subgroup; and
wherein the step of modifying the salience at the beginning and end of each major prosodic grouping includes the steps of:
modifying the generated salience signifiers to increase the salience at the beginning of each major prosodic grouping; and
modifying the generated salience signifiers to decrease the salience at the end of each prosodic subgroup.
8. The method of claim 6, wherein each word of a name includes a plurality of letters, the method further comprising the steps of:
arranging the letters of a word of a name into groups; and
generating indica of prosodic boundaries between the groups of letters to insert a slight pause between the groups of letters when audible speech is generated therefrom.
9. The method of claim 8, further comprising the step of:
generating audible speech representing the spelling of the name following the generation of audible speech from the groups of letters.
10. The method of claim 9, further comprising the steps of:
allowing users to obtain repeats of audible speech segments generated from text segments;
changing the rate of annunciation of a first audible speech segment after a first number of successive repeats of the first audible speech segment for the first user;
decreasing the rate of annunciation of a second audible speech segment generated from a second text segment for the first user after the first number of successive repeats of the first audible speech segment; and
increasing the rate of annunciation for a third audible speech segment generated from a third text segment if the first user does not obtain repeats of the second audible speech segment.
11. The method of claim 10, further comprising the step of:
adjusting the initial annunciation rate for subsequent users as a function of the number of consecutive prior users for whom the rate of annunciation has been altered.
12. The method of claim 1, wherein the step identifying within the prosodic subgroupings prosodically separable subgroup components includes the steps of:
a) identifying predetermined textual indicators which mark divisions of text groupings around them;
b) utilizing the predetermined textual indicators to separate the text within the prosodic subgrouping into units of nominal text which do not include said predetermined textual indicators; and
c) identifying within the units of nominal text other indicators of textual groupings that are not predetermined textual indicators.
13. The method of claim 12, further comprising the steps of:
repeatedly outputting the audible speech corresponding to a first segment of text;
decreasing a rate of annunciation of the first segment of text after a first number of successive repeats of the audible speech corresponding to the first segment of text.
14. The method of claim 13,
wherein the prosodic indica are generated by a set of prosody rules with predetermined discourse constraints which are a function of the context of the synthesis of the restricted text; and
wherein the restricted text includes name and address information.
15. The method of claim 14,
wherein the a major prosodic grouping is a sentence, a prosodic subgrouping is a name including a plurality of words, and a subgroup component is a word in a name.
16. The method of claim 15, wherein the salience signifiers are indica of pitch.
17. The method of claim 16, further comprising the step of:
arranging letters of a name into groups;
generating indica of prosodic boundaries between the groups of letters.
18. The method of claim 17, wherein the generated indica of prosodic boundaries between groups of letters results in the insertion of a slight pause between the groups of letters when audible speech is generated therefrom.
19. The method of claim 18, further comprising the step of:
generating audible speech representing the spelling of the name following the generation of audible speech from the groups of letters.
20. The method of claim 16, further comprising the step of:
generating audible speech representing the spelling of a name.
21. The method of claim 1, wherein the audible speech is generated for a plurality of users, the method further comprising the steps of:
outputting at a first annunciation rate and to a first user, a first segment of audible speech corresponding to a first segment of text;
repeatedly outputting to the first user the first segment of audible speech; and
decreasing a rate of annunciation of the first segment of audible speech after a first number of successive repeats of the first segment of audible speech.
22. The method of claim 21, further comprising the step of:
outputting the first segment of audible speech corresponding to the first segment of text to a second user at a second annunciation rate which is determined as a function of the number of times the first segment of audible speech was output to
the first user.
23. The method of claim 22, wherein the second annunciation rate is lower than the first annunciation rate.
24. The method of claim 1, further comprising the steps of:
allowing users to obtain repeats of audible speech segments generated from text segments;
changing the rate of annunciation of a first audible speech segment after a first number of successive repeats of the first audible speech segment for the first user;
decreasing the rate of annunciation of a second audible speech segment generated from a second text segment for the first user after the first number of successive repeats of the first audible speech segment; and
increasing the rate of annunciation for a third audible speech segment generated from a thirds text segment if the first user does not obtain repeats of the second audible speech segment.
25. The method of claim 24, further comprising the step of:
adjusting the initial annunciation rate for subsequent users as a function of the number of consecutive prior users for whom the rate of annunciation has been altered.
26. A method of synthesizing human audible speech from text including a predetermined information content and having predetermined format characteristics, the method comprising the steps of:
generating prosody indica for the text as a function of the predetermined information content and predetermined format characteristics of the text by performing the steps of:
a) identifying major prosodic groupings within the restricted text by utilizing major demarcation features which are a function of the predetermined format characteristics to define the beginning and end of the major prosodic groupings;
b) identifying prosodic subgroupings within the major prosodic groupings according to prosodic rules for analyzing the restricted text as a function of the predetermined information content for predetermined textual markers indicative of
prosodically isolatible subgroupings not delineated by the major demarcations dividing the prosodic major groupings;
c) identifying within the prosodic subgroupings prosodically separable subgroup components, at least one subgroup component being a word in the name;
d) generating prosodic indica which include salience signifiers, the salience signifiers controlling the salience of segments of the synthesized speech, the step of generating the prosodic indica including the steps of:
(i) generating salience signifiers within the prosodic subgroupings in accordance with salience placement rules solely relating to the components of the subgroupings themselves;
(ii) modifying the generated salience signifiers to increase the salience at the start of each prosodic subgroup and to further signify the salience at the end of each prosodic subgroup; and
(iii) further modifying the salience signifiers to further increase the salience of the beginning of the major prosodic grouping and further signify the salience of the end of the major prosodic grouping.
27. The method of claim 26, further comprising the steps of:
arranging letters of the name into groups;
generating indica of prosodic boundaries between the groups of letters, the generated indica of prosodic boundaries between groups of letters resulting in the insertion of a slight pause between the groups of letters when audible speech is
generated therefrom.
28. The method of claim 27, wherein the audible speech is generated for a plurality of users, the method further comprising the steps of:
outputting to a first user at a first annunciation rate a first segment of audible speech corresponding to a first segment of text;
repeatedly outputting to the first user the first segment of audible speech; and
decreasing the rate of annunciation of the first segment of audible speech after a first number of successive repeats of the first segment of audible speech.
29. An apparatus for synthesizing human audible speech from a machine readable representation of restricted text having a predetermined information content and predetermined format characteristics, comprising:
prosody preprocessor means for receiving the restricted text and for generating prosody indica by assigning the prosody indica on the basis of the predetermined informational content of the restricted text, means for:
a) identifying major prosodic groupings by utilizing major demarcation features to define the beginning and end of the major prosodic groupings;
b) identifying prosodic subgroupings within the major prosodic groupings according to prosodic rules for analyzing the text for predetermined textual markers indicative of prosodically isolatible subgroupings not delineated by the major
demarcations dividing the prosodic major groupings;
c) identifying within the prosodic subgroupings prosodically separable subgroup components; and
d) generating prosodic indicia which include salience signifiers utilizable by the speech synthesizer means to vary the salience of segments of the synthesized speech such that:
(i) the salience signifiers within the prosodic subgroupings are first generated in accordance with predetermined salience placement rules solely relating to the components themselves,
(ii) thereafter the first generated salience signifiers are modified to increase the salience at the start of the prosodic subgroup and further signify the salience at the end of the prosodic subgroup, and
(iii) the salience signifiers arc subsequently further modified to further increase the salience of the beginning of the major prosodic grouping and further signify the salience of the end of the major prosodic grouping; and
speech synthesizer means for synthesizing human audible speech from text, the speech synthesizer means including means for generating prosody indica on unrestricted text and for interpreting and executing prosody indica received from the prosody
preprocessor means, the prosody indica from the prosody preprocessor means being used to override and supplement the prosody indica generated by the internal prosody indica generating means. |
|
|
|
|
Claims  |
|
|
Description  |
|
|
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to automated synthesis of human speech from computer readable text, such as that stored in databases or generated by data processing systems automatically or via a user. Such systems are under current consideration
and are being placed in use for example, by banks or telephone companies to enable customers to readily access information about accounts, telephone numbers, addresses and the like.
Text-to-speech synthesis is seen to be potentially useful to automate or create many information services. Unfortunately to date most commercial systems for automated synthesis remain too unnatural | | |