WikiPatents - Community Patent Review
Create Free Account  |  License or Sell Your Patent  |  WikiPatents Marketplace  |  WikiPatents Blog
Username:  Password:  
    
Advanced Search
Method and apparatus for automatic generation of vocal emotion in a synthetic text-to-speech system    

Custom CD of patents similar to US5860064 : Method and apparatus for automatic generation of vocal emotion in a synthetic text-to-speech system - $19.95
United States Patent5860064   
Link to this pagehttp://www.wikipatents.com/5860064.html
Inventor(s)Henton; Caroline G. (Santa Cruz, CA)
AbstractA method and apparatus for the automatic application of vocal emotion parameters to text in a text-to-speech system. Predefining vocal parameters for various vocal emotions allows simple selection and application of vocal emotions to text to be output from a text-to-speech system. Further, the present invention is capable of generating vocal emotion with the limited prosodic controls available in a concatenative synthesizer.
   














 Title Information Submit all comments and votes
 
Patent Text Patent PDF Print Page Summary File History
Plain text PDF images Print Summary File History
Drawing from US Patent 5860064
Method and apparatus for automatic generation of vocal emotion in a

     synthetic text-to-speech system - US Patent 5860064 Drawing
Method and apparatus for automatic generation of vocal emotion in a synthetic text-to-speech system
Inventor     Henton; Caroline G. (Santa Cruz, CA)
Owner/Assignee     Apple Computer, Inc. (Cupertino, CA)
Patent assignment
All assignments
Company News
Publication Date     January 12, 1999
Application Number     08/805,893
PAIR File History     Application Data   Transaction History
Image File Wrapper   Patent Term   Fees
Litigation
Filing Date     February 24, 1997
US Classification     704/260 204/266
Int'l Classification     G10L 005/00
Examiner     Dorvil; Richemond
Assistant Examiner    
Attorney/Law Firm     Carr & Ferrell LLP
Address
Parent Case     This application is a continuation of application Ser. No. 08/062,363, filed May 13, 1993, now abandoned.
Priority Data    
USPTO Field of Search     395/2.09 395/2.69 395/2.79 395/2.67 704/260 704/259 704/270 704/200 704/266 704/272 704/276
Patent Tags     automatic generation vocal emotion a synthetic text-to-speech
   
Enter a comma (,) or semicolon (;) between multiple tag words/phrases.
Describe this patent:
 Amusing   
 Clever   
 Complex   
 Efficient   
 Historic   
 Important   
 Innovative   
 Interesting   
 Practical   
 Simple   
[no votes]
Patent WIKI

Share information and news about this patent, including information and news about the technology, inventors, company, ligation and licensing.

 References Submit all comments and votes
 
*references marked with an asterisk below are user-added references
 U.S. References
 
Add a new US reference:  
ReferenceRelevancyCommentsReferenceRelevancyComments
5396577
Oikawa
704/260
Mar,1995

[0 after 0 votes]
5278943
Gasper
704/200
Jan,1994

[0 after 0 votes]
5151998
Capps
704/278
Sep,1992

[0 after 0 votes]
4406626
Anderson
704/270
Sep,1983

[0 after 0 votes]
4397635
Samuels
434/178
Aug,1983

[0 after 0 votes]
4337375
Freeman
704/260
Jun,1982

[0 after 0 votes]
3704345
Coker
704/266
Nov,1972

[0 after 0 votes]
4779209
Stapleford
704/278
Dec,1969

[0 after 0 votes]
 Foreign References
 Other References
 Market Review Submit all comments and votes
   
Market Size
Estimate the gross annual revenues of the relevant market sector:
> $10B
$5B - $10B
$2B - $5B
$500M - $2B
$100M - $500M
$10M - $100M
$1M - $10M
$500K - $1M
$100K - $500K
< $100K
[No votes]
$0
 
$0   $2.5B   $5B   $7.5B   $10B

[0 market size comments]
Market Share
Estimate the percentage of the relevant market sector this invention will capture:
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%

[0 market share comments]
Reasonable Royalty
What percentage of gross sales should the inventor or assignee be paid?
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%

[0 reasonable royalty comments]
Public's "Guesstimation" of Royalty Value
Market SizeN/A[No votes]
xMarket ShareN/A[No votes]
xReasonable RoyaltyN/A[No votes]

N/A

[0 Guesstimation of Royalty Value Comments]
License Availablity
If you are NOT the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
[0 license availability comments]
License Availablity
If you ARE the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
[0 owner/assignee comments]
Competitive Advantage
Does this invention have a significant competitive advantage over similar technologies?
Yes

No



[No votes]
Most helpful competitive advantage comment
[No comments]

[0 competitive advantage comments]
Commercial Alternatives
Are there viable commercial alternatives for this invention?
Yes

No



[No votes]
Most helpful commercial alternative comment
[No comments]

[0 commercial alternatives comments]
 Technical Review Submit all comments and votes
 Claims Submit all comments and votes
 


What is claimed is:

1. A method for automatic application of vocal emotion to previously entered text to be outputted by a synthetic text-to-speech system, said method comprising:

selecting a portion of said previously entered text;

manipulating a visual appearance of the selected text to selectively choose a vocal emotion to be applied to said selected text;

obtaining vocal emotion parameters associated with said selected vocal emotion; and

applying said obtained vocal emotion parameters to said selected text to be outputted by said synthetic text-to-speech system.

2. The method of claim 1 wherein said vocal emotion parameters comprise pitch mean, pitch range, volume and speaking rate.

3. The method of claim 2 wherein said text-to-speech system is a concatenative system.

4. The method of claim 3 wherein said vocal emotion is one of multiple vocal emotions available for selection.

5. The method of claim 4 wherein said multiple vocal emotions comprises anger, happiness, curiosity, sadness, boredom, aggressiveness, tiredness and disinterest.

6. A method for providing vocal emotion to previously entered text in a concatenative synthetic text-to-speech system, said method comprising:

selecting said previously entered text;

manipulating a visual appearance of the selected text to select a vocal emotion from a set of vocal emotions;

obtaining vocal emotion parameters predetermined to be associated with said selected vocal emotion, said vocal emotion parameters specifying pitch mean, pitch range, volume and speaking rate;

applying said obtained vocal emotion parameters to said selected text; and

synthesizing speech from the selected text.

7. The method of claim 6 wherein said set of vocal emotions comprises anger, happiness, curiosity, sadness, boredom, aggressiveness, tiredness and disinterest.

8. An apparatus for automatic application of vocal emotion parameters to previously entered text to be outputted by a synthetic text-to-speech system, said apparatus comprising:

a display device for displaying said previously entered text;

an input device for permitting a user to selectively manipulate a visual appearance of the entered text and thereby select a vocal emotion;

memory for holding said vocal emotion parameters associated with said selected vocal emotion; and

logic circuitry for obtaining said vocal emotion parameters associated with said selected vocal emotion from said memory and for applying said obtained vocal emotion parameters to the manipulated text to be outputted by said synthetic text-to-speech system.

9. The apparatus of claim 8 wherein said vocal emotion parameters comprise pitch mean, pitch range, volume and speaking rate.

10. The apparatus of claim 9 wherein said text-to-speech system is a concatenative system.

11. The apparatus of claim 10 wherein said vocal emotion is one of multiple vocal emotions available for selection.

12. The apparatus of claim 11 wherein said multiple vocal emotions comprises anger, happiness, curiosity, sadness, boredom, aggressiveness, tiredness and disinterest.

13. A method for converting text to speech that enables a user to interactively apply vocal parameters to user-selectable text, comprising the steps of:

selecting a portion of visually displayed text;

selectively manipulating the selected portion of text to modify a visual appearance of the selected portion of text and to modify certain vocal parameters associated with the selected portion of text; and

applying the modified vocal parameters associated with the selected portion of text to synthesize speech from the modified text.

14. The method of claim 13 further comprising the step of, in response to manipulation, generating corresponding vocal parameter control data for transfer, in conjunction with said text, to an electronic text-to-speech synthesizer.

15. The method of claim 13 wherein said vocal parameters include a volume parameter, said control means include a volume handle and the step of responding includes, in response to said user vertically dragging said volume handle, the step of manipulating said volume parameter and modifying said selected portion of text to occupy a different amount of vertical space.

16. The method of claim 15 wherein said step of manipulating modifies a text-height display characteristic.

17. The method of claim 13 wherein the step of manipulation is performed by control means, said vocal parameters include a rate parameter, said control means include a rate handle and the step of responding includes, in response to said user horizontally dragging said rate handle, modifying said rate parameter and modifying said selected portion of text to occupy a different amount of horizontal space.

18. The method of claim 17 wherein said step of manipulating modifies a text-width display characteristic.

19. The method of claim 13 wherein said vocal parameters include a volume parameter and a rate parameter, said control means include a volume/rate handle and the step of manipulating includes, in response to said user vertically dragging said volume/rate handle, modifying said volume parameter and modifying said selected portion of text to occupy a different amount of vertical space, and, in response to said user horizontally dragging said volume/rate handle, modifying said rate parameter and modifying said selected portion of text to occupy a different amount of horizontal space.

20. The method of claim 13 wherein said vocal parameters include volume, rate and pitch, each of said vocal parameters has a predetermined base value, and a plurality of predetermined combinations of said vocal parameters each defines a respective emotion grouping.

21. The method of claim 20 wherein the step of manipulation is performed by control means, and said control means include a plurality of emotion controls which are each user activatable to select a corresponding one of said emotion groupings.

22. The method of claim 21 wherein said emotion controls include a plurality of differently colored emotion buttons each indicating a different emotion.

23. The method of claim 22 wherein said user selecting one of said emotion buttons selects one of said emotion groupings and correspondingly modifies a color characteristic of said selected portion of text.

24. The method of claim 13 wherein said vocal parameters are specified as a variance from a predetermined base value.

25. A computer-readable storage medium storing program code for causing a computer to perform the steps of:

permitting a user to select a portion of text;

permitting a user to manipulate the selected text with a plurality of user-manipulatable control means;

responding to each user-manipulation of one of said control means by modifying a plurality of corresponding vocal parameters of the selected text and modifying a displayed appearance of said portion of text; and

synthesizing speech from the modified text.

26. A system for converting text to speech that enables a user to interactively apply vocal parameters to user-selectable text, comprising:

means for a user to select a portion of text;

a plurality of interactive user manipulatable means for controlling vocal parameters associated with the selected portion of text;

means, responsive to said control means, for modifying a plurality of vocal parameters associated with the portion of text and for modifying a displayed appearance of said portion of text; and

means for synthesizing speech from the modified text.

27. A method of converting text to speech, comprising:

entering text;

displaying a portion of the entered text;

selecting a portion of the displayed text;

manipulating an appearance of the selected text to selectively change a set of vocal emotion parameters associated with the selected text; and

synthesizing speech having a vocal emotion from the manipulated portion of text;

whereby the vocal emotion of the synthesized speech depends on the manner in which the appearance of the text is manipulated.

28. A method according to claim 27 wherein the step entering is followed immediately by the step of displaying.
 Description Submit all comments and votes
 


CROSS-REFERENCE TO RELATED APPLICATIONS

This application is related to co-pending patent application Ser. No. 08/061,608 entitled "GRAPHICAL USER INTERFACE FOR SPECIFICATION OF VOCAL EMOTION IN A SYNTHETIC TEXT-TO-SPEECH SYSTEM" having the same inventive entity, assigned to the assignee of the present application, and filed with the United States Patent and Trademark Office on the same day as the present application.

FIELD OF THE INVENTION

The present invention relates generally to the field of sound manipulation, and more particularly to graphical interfaces for user specification of sound attributes in synthetic text-to-speech systems. Still further, the present invention relates to the parameters which are specified and/or altered by user interaction with the graphical interface. More particularly, the present invention relates to providing vocal emotion sound qualities to synthetic speech through user interaction with a graphical interface editor to specify such vocal emotion.

BACKGROUND OF THE INVENTION

For a considerable time in the history of speech synthesis, the speech produced has been mostly `neutral` in tone, or in the worst case, monotone, i.e., it has sounded disinterested, or deficient, in vocal emotionality. This is why the synthesized intonation produced by prior art systems frequently sounded robotic, wooden and otherwise unnatural. Furthermore, synthetic speech research has been directed primarily towards maximizing intelligibility rather than including naturalness or variety. Recent investigations into techniques for adding emotional affect to synthesized speech have produced mixed results, and have concentrated on parametric synthesizers which generate speech through mathematical manipulations rather than on concatenative systems which combine segments of stored natural speech.

Text-to-speech systems usually incorporate rules for the application of intonational attributes for the text submitted for synthetic output. However, these rule systems generate generally neutral tones and, further, are not well suited for authoring or editing emotional prose at a high level. The problem lies not only in the terminology, for example "baseline-pitch", but also in the difficulty of quantifying these terms. If given the task of entering a stage play into a synthetic speech environment, it would be unbearable (or, at the very least, highly challenging for the layperson) to have to choose numerical values for the various speech parameters in order to incorporate vocal emotion into each word spoken.

For example, prior art speech synthesizers have provided for the customization of the prosody or intonation of synthetic speech, generally using either high-level or low-level controls. The high-level controls generally include text mark-up symbols, such as a pause indicator or pitch modifier. An example of prior art high-level text mark-up phonetic controls is taken from the Digital Equipment Corporation DECtalk DTC03 (a commercial text-to-speech system) Owner's Manual where the input text string:

It's a mad mad mad mad world.

can have its prosody customized as follows:

It's a ›/!mad ›.backslash.!mad ›/!mad ›.backslash.!mad ›/.backslash.!world.

where ›/! indicates pitch rise, and ›.backslash.! indicates pitch fall.

Some prior art synthesizers also provide the user with direct control over the output duration and pitch of phonetic symbols. These are the low-level controls. Again, examples from DECtalk:

›ow<1000>!

causes the sound ›ow! (as in "over") to receive a duration specification of 1000 milliseconds (ms); while

›ow<,90>!

causes ›ow! to receive its default duration, but it will achieve a pitch value of 90 Hertz (Hz) at the end; while

›ow<1000,90>!

causes ›ow! to be 1000 ms long, and to be 90 Hz at the end.

So, on the one hand, the disadvantage of the high-level controls is that they give only a very approximate effect and lack intuitiveness or direct connection between the control specification and the resulting or desired vocal emotion of the synthetic speech. Further, it may be impossible to achieve the desired intonational or vocal emotion effect with such a coarse control mechanism.

And on the other hand, the disadvantage of the low-level controls is that even the intonational or vocal emotion specification for a single utterance can take many hours of expert analysis and testing (trial and error), including measuring and entering detailed Hertz and milliseconds specifications by hand. Further, this is clearly not a task an average user can tackle without considerable knowledge and training in the various speech parameters available.

What is needed, therefore, is an intuitive graphical interface for specification and modification of vocal emotion of synthetic speech. Of course, other graphical interfaces for modification of sound currently exist. For example, commercial products such as SoundEdit.RTM., by Farallon Computing, Inc., provide for manipulation of raw sound waveforms. However, SoundEdit.RTM. does not provide for direct user manipulation of the waveform (instead, the portion of the waveform to be modified is selected and then a menu selection is made for the particular modification desired).

Further, manipulation of raw waveforms does not provide a clear intuitive means to specify vocal emotion in the synthetic speech because of the lack of clear connection between the displayed waveform and the desired vocal emotion. Simply put, by looking at a waveform of human speech, a user cannot easily ascertain how it (or modifications to it) will sound when played through a loudspeaker, particularly if the user is attempting to provide some sort of vocal emotion to the speech.

By contrast, the present invention is completely intuitive. The present invention provides for authoring, direct manipulation and visual representation of emotional synthetic speech in a simplified format with a high level of abstraction. A user can easily predict how the text authored with the graphical editor of the present invention will sound because of the power of the explicit and intuitive visual representation of vocal parameters.

Further, the present invention provides for the automatic specification of prosodic controls which create vocal emotional affect in synthetic speech produced with a concatenative speech synthesizer.

First of all, it is important to understand that speech has two main components: verbal (the words themselves), and vocal (intonation and voice quality). The importance of vocal components in speech may be indicated by the fact that children can understand emotions in speech before they can understand words. Intonation is effected by changes in the pitch, duration and amplitude of speech segments. Voice quality (e.g. nasal, breathy, or hoarse) is intrasegmental, depending on the individual vocal tract. Note that a glossary has been included as Appendix A for further clarification of some of the terms used herein.

Along a sliding scale of `affect`, voices may be heard to contain personalities, moods, and emotions. Personality has been defined as the characteristic emotional tone of a person over time. A mood may be considered a maintained attitude; whereas an emotion is a more sudden and more subtle response to a particular stimulus, lasting for seconds or minutes. The personality of a voice may therefore be regarded as its largest effect, and an emotion its smallest. The term `vocal emotion` will be used herein to encompass the full range of `affect` in a voice.

The full range of attributes may be created in synthesized speech. Voice parameters affected by emotion are the pitch envelope (a combination of the speaking fundamental frequency, the pitch range, the shape and timing of the pitch contour), overall speech rate, utterance timing (duration of segments and pauses), voice quality, and intensity (loudness).

If computer memory and processing speed were unlimited, one method for creating vocal emotions would be to simply store words spoken in varying emotional ways by a human being. In the present state of the art, this approach is impractical. Rather than being stored, emotions have to synthesized on-line and in real-time. In parametric synthesizers (of which DECtalk is the most well-known and most successful), there may be as many as thirty basic acoustic controls available for altering pitch, duration and voice quality. These include e.g., separate control of formants' values and bandwidths; pitch movements on, and duration of, individual segments; breathiness; smoothness; richness; assertiveness; etc. Precision of articulation of individual segments (e.g., fully released stops, degree of vowel reduction), which is controllable in DECtalk, can also contribute to the perception of emotions such as tenderness and irony. These parameters may be manipulated to create voice personalities; DECtalk is supplied with nine different `Voices` or personalities. It should be noted that intensity (volume) is not controllable within an utterance in DECtalk.

With a concatenative speech synthesizer, the type used in the preferred embodiment of the present invention, the range of acoustic controls is severely limited. Firstly, it is not possible to alter the voice quality of the speaker, since the speech is created from the recording of only one live speaker (who has their individual voice quality) speaking in one (neutral) vocal mode, and parameters for manipulating positions of the vocal folds are not possible in this type of synthesizer. Secondly, precision of articulation of individual segments is not controllable with concatenative synthesizers. It is nonetheless possible with the speech synthesizer used in the preferred embodiment of the present invention to control the parameters listed below:

TABLe 1 ______________________________________ Parameter Speech Synthesizer Commands ______________________________________ 1. Average speaking pitch Baseline Pitch (pbas) 2. Pitch range Pitch Modulation (pmod) 3. Speech rate Speaking rate (rate) 4. Volume Volume (volm) 5. Silence Silence (slnc) 6. Pitch movements Pitch rise (/), pitch fall (.backslash.) 7. Duration Lengthen (>), shorten (<) ______________________________________

Although there are seven parameters listed in the table above, the present invention claims that for concatenative synthesizers, it is possible to produce a wide range of emotional affect using the interplay of only five parameters--since Speech rate and Duration, and Pitch range and Pitch movements are, respectively, effected by the same acoustic controls. In other words, the present invention is capable of providing an automatic application of vocal emotion to synthetic speech through the interplay of only the first five elements listed in the table above.

Further, the present invention is not concerned with the details of how emotions are perceived in speech (since this is known to be idiosyncratic and varies among users), but rather with the optimal means of producing synthesized emotions from a restricted number of parameters, while still maintaining optimal quality in the visual interface and synthetic speech domains.

SUMMARY AND OBJECTS OF THE INVENTION

It is an object of the present invention to provide a synthetic speech utterance with a more natural intonation.

It is a further object of the present invention to provide a synthetic speech utterance with one or more desired vocal emotions.

It is a still further object of the present invention to provide a synthetic speech utterance with one or more desired vocal emotions by the mere selection of the one or more desired vocal emotions.

The foregoing and other advantages are provided by a method for automatic application of vocal emotion to text to be output by a text-to-speech system, said automatic vocal emotion application method comprising: i) selecting a portion of said text; ii) selecting a vocal emotion to be applied to said selected text; iii) obtaining vocal emotion parameters associated with said selected vocal emotion; and iv) applying said obtained vocal emotion parameters to said selected text to be output by said text-to-speech system.

The foregoing and other advantages are also provided by an apparatus for automatic application of vocal emotion parameters to text to be output by a text-to-speech system, said automatic vocal emotion application apparatus comprising: i) a display device for displaying said text; ii) an input device for user selection of said text and for user selection of a vocal emotion to be applied to said selected text; iii) memory for holding said vocal emotion parameters associated with said selected vocal emotion; and iv) logic circuitry for obtaining said vocal emotion parameters associated with said selected vocal emotion from said memory and for applying said obtained vocal emotion parameters to said selected text to be output by said text-to-speech system.

Other objects, features and advantages of the present invention will be apparent from the accompanying drawings and from the detailed description which follows.

BRIEF DESCRIPTION OF THE DRAWINGS

The present invention is illustrated by way of example and not limitation in the figures of the accompanying drawings, in which like references indicate similar elements, and in which:

FIG. 1 is a block diagram of a computer system which might utilize the present invention;

FIG. 2 is a screen display of the graphical user interface editor of the present invention;

FIG. 3 is a screen display of the graphical user interface editor of the present invention depicting an example of volume and duration text-to-speech modification;

FIG. 4 is a screen display of the graphical user interface editor of the present invention depicting an example of vocal emotion text-to-speech modification;

FIG. 5 is a flowchart of the graphical user interface editor to vocal emotion text-to-speech modification communication and translation of the present invention.

DETAILED DESCRIPTION OF THE INVENTION

FIG. 1 is a generalized block diagram of an appropriate computer system 10 which might utilize the present invention and includes a CPU/memory unit 11 that generally comprises a microprocessor, related logic circuitry, and memory circuitry. A keyboard 13, or other textual input device such as a write-on tablet or touch screen, provides input to the CPU/memory unit 11, as does input controller 15 which by way of example can be a mouse, a 2-D trackball, a joystick, etc. External storage 17, which can include fixed disk drives, floppy disk drives, memory cards, etc., is used for mass storage of programs and data. Display output is provided by display 19, which by way of example can be a video display or a liquid crystal display. Note that for some configurations of computer system 10, input device 13 and display 19 may be one and the same, e.g., display 19 may also be a tablet which can be pressed or written on for input purposes.

Referring now to FIG. 2, the preferred embodiment of the graphical user interface editor 201 of the present invention can be seen (note that the emotion/color/font style indications in parenthesis are not shown in the screen display of the present invention and are only included in FIG. 2 for purposes of clarity of the present invention). Editor 201, shown residing within a window running on an Apple Macintosh computer in the preferred embodiment, provides the user with the capability to interactively manipulate text in such a way as to intuitively alter the vocal emotion of the synthetic speech generated from the text.

As will be explained more fully herein, graphical editor 201 provides for user modification of the volume and duration of speech synthesized text. As will also be explained more fully herein, graphical editor 201 also provides for user modification of the vocal emotion of speech synthesized text via selection buttons 211 through 217 (note that the emotion/color/font style indications in parenthesis are not shown in the screen display of the present invention and are only included in FIG. 2 for purposes of clarity of the present invention). User interaction is further provided by selection pointer 205, manipulable via input controller 15 of FIG. 1, and insertion point cursor 203.

Text Selection

In the preferred embodiment of the present invention, the user selects a word of text by manipulating input controller 15 so that pointer 205 is placed on or alongside the desired word and then initiating the necessary selection operation, e.g., depressing a button on the mouse in the preferred embodiment. Note that letters, words, phrases, sentences, etc., are all selectable in a similar fashion, by manipulating pointer 205 during the selection operation, as is well known in the art and commonly referred to as `clicking and dragging` or `double clicking`. Similarly, other well known text selection mechanisms, such as keyboard control of cursor 203, are equally applicable to the present invention.

Volume and Duration

Once a portion of text has been selected, the volume and duration of the resulting speech output can be modified by the user. In the preferred embodiment of the present invention, when a portion of text has been selected a box surrounding the selected portion of text is displayed. Note that other well known text selection display indicating mechanisms, such as reverse video, background highlighting, etc., are equally applicable to the present invention. In the preferred embodiment of the present invention, this surrounding selection box further includes three types of sizing grips or handles which can be utilized to modify the volume and duration of the selected portion of text.

Referring now to FIG. 3, the textual portion of the graphical editor 201 of FIG. 2 can be seen (with different textual examples than in the earlier figure). FIG. 3 depicts a series of selections and modifications of a sample sentence using the graphical editor of the present invention. Throughout this example, note the surrounding selection box 311 which is displayed whenever a portion of text is selected. Further, note the sizing grips or handles 313 through 317 on the surrounding selection box 311.

As was stated above, whenever a portion of text is selected, that portion becomes surrounded by a selection box 311 having handles 313 through 317. In the preferred embodiment of the present invention, manipulation of handle 313 affects the volume of the selected portion of text while manipulation of handle 317 affects the duration (for how long the text-to-speech system will play that portion of text) of the selected portion of text. In the preferred embodiment of the present invention, manipulation of handle 315 affects both the volume and duration of the selected portion of text.

By way of further explanation, manipulating handles 313-317 of surrounding selection box 311 provides an intuitive graphical metaphor for the desired result of the synthetic speech generated from the selected text. Manipulating handle 313 either raises or lowers the height of the selected portion of text and thereby alters the resulting synthetic text-to-speech system volume of that portion of text upon output through a loudspeaker. Similarly, manipulating handle 317 either lengthens or shortens the selected portion of text and thereby alters the resulting synthetic text-to-speech system duration of that portion of text upon output through a loudspeaker. Further, manipulating handle 315 affects both volume and duration by simultaneously affecting both the height and length of the selected portion of text.

Reviewing the example of FIG. 3, the first sentence 301, which states "Pete's goldfish was delicious." (intended to represent a comment by Pete's cat, of course), is shown in its original unaltered default or Normal condition (and is therefore displayed in black, as will be explained more fully below). In the second sentence 303 the same sentence as sentence 301 is shown after the word "was" has been selected and modified. By way of explanation of the manipulation of volume and duration of synthetic speech generated from a text string, sample text string 303 comprising the sentence "Pete's goldfish was delicious." has had the word "was" selected according to the method described above. Again, once a portion of text has been selected, manipulation handles 313-317 are displayed on surrounding selection box 311. In this example, and according to the method described above, the resulting synthetic text-to-speech system output volume of the word "was" has been increased by manipulating volume handle 313 in an upward direction via pointer 205 and input controller 15. This increased volume is evident by comparing the height of the word "was" in text example 303 (before modification) to text example 305 (after modification). The word "was" in text example 305 is taller than the word "was" in text example 303 and will therefore be output at a louder volume by the synthetic text-to-speech system.

As a further example of the present invention, the word "goldfish" has been selected in text example 305, as is evident by selection box 311 and handles 313-317. In this example, and according to the method described above, the resulting synthetic text-to-speech system output duration of the word "goldfish" has been increased by manipulating duration handle 317 in a rightward direction via pointer 205 and input controller 15. This increased duration is evident by comparing the length of the word "goldfish" in text example 305 (before modification) to text example 307 (after modification). The word "goldfish" in text example 307 is longer than the word "goldfish" in text example 305 and will therefore be output for a longer duration by the synthetic text-to-speech system.

As a still further example of the graphical interface editor of the present invention, the word "Pete's" has been selected in text example 307, as is evident by selection box 311 and handles 313-317. In this example, and according to the method described above, the resulting synthetic text-to-speech system output volume and duration of the word "Pete's" has been increased by manipulating volume/duration handle 315 in a diagonally upward and rightward direction via pointer 205 and input controller 15. This increased volume and duration is evident by comparing the height and length of the word "Pete's" in text example 307 (before modification) to text example 309 (after modification). The word "Pete's" in text example 309 is taller and longer than the word "Pete's" in text example 307 and will therefore be output at a louder volume and for a longer duration by the synthetic text-to-speech system.

Thus, in the graphical interface editor of the present invention, the control of text volume and duration, as output from the text-to-speech system, takes advantage of the two natural intuitive spatial axes of a computer display: volume the vertical axis; duration the horizontal axis.

Further, note button 218 of FIG. 2. If a user desires to return a portion of text to its default size (volume and duration) settings, once that portion has again been selected, rather than requiring the user to manipulate any of the handles 313-317, the user need merely select button 218, again via pointer 205 and input controller 15 of FIG. 1, which automatically returns the selected text to its default size and volume/duration settings.

Emotion

Once a portion of text has been selected (again, according to the methods explained above as well as other well known methods), the vocal emotion of that selected text can be modified by the user. Again, in the preferred embodiment of the present invention, when a portion of text has been selected a selection box surrounding the selected portion of text is displayed.

Referring now to FIG. 4 (note that the emotion/color/font style indications in parentheses are not shown in the screen display of the present invention and are only included in the figure for purposes of clarity of the present invention), as with the examples of FIG. 3, only the textual portion of the graphical editor 201 of FIG. 2 can be seen (with further textual examples than the earlier figures). By comparison to text example 309 of FIG. 3, the first sentence 401 of FIG. 4 is shown after the text has been selected and an emotion (`Happy` in this example) has been selected or specified. In the preferred embodiment of the present invention, when a portion of text has been selected, referring again to the graphical interface editor 201 of FIG. 2, an emotional state or intonation can be chosen via pointer 205, input controller 15, and emotion selection buttons 211-217. As such, referring back to FIG. 4, sentence 401 can be specified as `Happy` via selection button 212 of FIG. 2. Conversely, after the text has been selected, sentence 403 of FIG. 4 comprising "You'll have no dinner tonight." (intended to be Pete's response to his cat) can likewise be specified as `Angry` via selection button 211 of FIG. 2. Note also the variations in volume and duration (evident by the variations in text height and length of the sentence) previously specified according to the methods described above.

In the preferred embodiment of the present invention, when a portion of text is specified as having a certain emotional quality, the specified text is displayed in a color intended to convey that emotion to the user of the text-to-speech or graphical interface editor system. For example, in the preferred embodiment of the present invention, sentence 401 of FIG. 4 was specified as `Happy`, via emotion selection button 212, and is therefore displayed in yellow (not shown in the figure--but indicated within the parentheses) while sentence 402 was specified as `Angry`, via emotion selection button 212, and is therefore displayed in red (also not shown in the figure--but indicated within the parenthesis).

By comparison, sentence 403 is specified according to the default emotion of `Normal` and is therefore displayed in black (not shown in the figure--but indicated within the parentheses). Note that although the emotion of `Normal` is the default emotion (meaning that `Normal` is the default emotional specification given all text until some other emotion is specified), selection of the `Normal` emotion selection button 217 is useful whenever a portion of text has previously received a different emotional specification and the user now desires to return that portion to a normal or neutral emotional characterization.

Note that the present invention is not limited to the particular vocal emotions indicated by emotion selection buttons 211-217 of FIG. 2. Other vocal emotions, either in place of or in addition to those shown in FIG. 2 are equally applicable to the present invention. Selection of other vocal emotions in place of or in addition to those of FIG. 2 would be a simple modification by the system implementor and/or the user to the graphical user editor interface of the present invention.

Note further that the particular colors/font styles indicating vocal emotional states of the preferred embodiment are user alterable such that if a particular user preferred to have pink indicate `Happy`, for example, this would be a simple modification (by the system implementor and/or by the user) to the graphical interface editor (which would then alter any displayed text having a vocal emotion of `Happy` specified). This customization capability provides for personal preferences of different users and also provides for differences in cultural interpretations of various colors. Further, note that some vocal emotions are particularly amenable to textual display indicia rather than, or in addition to, color representation. For example, the vocal emotion of `Emphasis` (see emotion selection button 216 of FIG. 2) is particularly well-suited to textual display in boldface, rather than using a particular color to indicate that vocal emotion (also indicated within the parentheses in FIG. 2). Again, color choice and font style (e.g., italic, boldface, underline, etc.) are system implementor and/or user definable/selectable thus making the present invention more broadly applicable and user friendly.

Graphical User Interface to Speech Synthesizer Translation

The preferred manner in which this invention would be implemented is in the context of creating vocal emotions that may be associated with text that is to be read by a text-to-speech synthesizer. The user would be provided with a list or display, as was explained more fully above, of the controls available for the specification of vocal emotions. To explain more fully the preferred embodiment of the present invention, the following reviews the specifics of how speech synthesizer parameters are specified for the text receiving vocal emotion qualities.

The translation of graphical modifications to speech synthesizer volume and duration parameters is a straight-forward application of linear scaling and offset. Visually, graphical modifications to the text (as was explained above with reference to FIG. 3) are displayed in a font at x % of normal size horizontally and y %