WikiPatents - Community Patent Review
Create Free Account  |  License or Sell Your Patent  |  WikiPatents Marketplace  |  WikiPatents Blog
Username:  Password:  
    
Advanced Search
Speech synthesis and analysis of dialects    
United States Patent5636325   
Link to this pagehttp://www.wikipatents.com/5636325.html
Inventor(s)Farrett; Peter W. (Austin, TX)
AbstractA set of intonation intervals for a chosen dialect are applied to the intonational contour of a phomene string derived from a single set of stored linguistic units, e.g., phonemes. Sets of intonational intervals are stored to simulate or recognize different dialects or languages from a single set of stored phonemes. The interval rules preferably use a prosodic analysis of the phoneme string or other cues to apply a given interval to the phoneme string. A second set of interval data is provided for semantic information. The speech system is based on the observation that each dialect and language possess its own set of musical relationships or intonation intervals. These musical relationships are used by a human listener to identify the particular dialect or language. The speech system may be either a speech synthesis or speech analysis tool or may be a combined speech synthesis/analysis system.
   














 Title Information Submit all comments and votes
 
Patent Text Patent PDF Print Page Summary File History
Plain text PDF images Print Summary File History
Drawing from US Patent 5636325
Speech synthesis and analysis of dialects - US Patent 5636325 Drawing
Speech synthesis and analysis of dialects
Inventor     Farrett; Peter W. (Austin, TX)
Owner/Assignee     International Business Machines Corporation (Armonk, NY)
Patent assignment
All assignments
Publication Date     June 3, 1997
Application Number     08/176,819
PAIR File History     Application Data   Transaction History
Image File Wrapper   Patent Term   Fees
Litigation
Filing Date     January 5, 1994
US Classification     704/258 704/231 704/260 704/267 704/268
Int'l Classification     G10L 009/00
Examiner     MacDonald; Allen R.
Assistant Examiner     Mattson; Robert
Attorney/Law Firm     LaBaw; Jeffrey S.
Address
Parent Case     This is a continuation of application Ser. No. 07/976,151 filed Nov. 13, 1992, now abandoned.
Priority Data    
USPTO Field of Search     381/29 381/30 381/31 381/32 381/33 381/34 381/35 381/36 381/37 381/38 381/39 381/40 381/41 381/42 381/43 381/44 381/45 381/46 381/47 381/48 381/49 381/50 381/51 381/52 381/53 395/2 395/2.2 395/2.4 395/2.55 395/2.6 395/2.67 395/2.69 395/2.76 395/2.77 360/135
Patent Tags     speech synthesis analysis dialects
   
Enter a comma (,) or semicolon (;) between multiple tag words/phrases.
Describe this patent:
 Amusing   
 Clever   
 Complex   
 Efficient   
 Historic   
 Important   
 Innovative   
 Interesting   
 Practical   
 Simple   
[no votes]
Patent WIKI

Share information and news about this patent, including information and news about the technology, inventors, company, ligation and licensing.

 References Submit all comments and votes
 
*references marked with an asterisk below are user-added references
 U.S. References
 
Add a new US reference:  
ReferenceRelevancyCommentsReferenceRelevancyComments
5133010
Borth
704/264
Jul,1992

[0 after 0 votes]
5113449
Blanton
704/261
May,1992

[0 after 0 votes]
5029211
Ozawa
704/266
Jul,1991

[0 after 0 votes]
4980917
Hutchins
704/254
Dec,1990

[0 after 0 votes]
4908727
Ezaki
360/135
Mar,1990

[0 after 0 votes]
4896359
Yamamoto
704/260
Jan,1990

[0 after 0 votes]
4852170
Bordeaux
704/277
Jul,1989

[0 after 0 votes]
4833718
Sprague
704/229
May,1989

[0 after 0 votes]
4802223
Lin
704/207
Jan,1989

[0 after 0 votes]
4692941
Jacks
704/260
Sep,1987

[0 after 0 votes]
4624012
Lin
704/261
Nov,1986

[0 after 0 votes]
4613944
Hashimoto
704/277
Sep,1986

[0 after 0 votes]
4455615
Tanimoto
704/277
Jun,1984

[0 after 0 votes]
 Foreign References
 Other References
 Market Review Submit all comments and votes
   
Market Size
Estimate the gross annual revenues of the relevant market sector:
> $10B
$5B - $10B
$2B - $5B
$500M - $2B
$100M - $500M
$10M - $100M
$1M - $10M
$500K - $1M
$100K - $500K
< $100K
[No votes]
$0
 
$0   $2.5B   $5B   $7.5B   $10B
Market Share
Estimate the percentage of the relevant market sector this invention will capture:
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Reasonable Royalty
What percentage of gross sales should the inventor or assignee be paid?
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Public's "Guesstimation" of Royalty Value
Market SizeN/A[No votes]
xMarket ShareN/A[No votes]
xReasonable RoyaltyN/A[No votes]

N/A

License Availablity
If you are NOT the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
License Availablity
If you ARE the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
Competitive Advantage
Does this invention have a significant competitive advantage over similar technologies?
Yes

No



[No votes]
Most helpful competitive advantage comment
[No comments]

Commercial Alternatives
Are there viable commercial alternatives for this invention?
Yes

No



[No votes]
Most helpful commercial alternative comment
[No comments]

 Technical Review Submit all comments and votes
 Claims Submit all comments and votes
 


I claim:

1. A method of operating a speech synthesis system comprising the steps of:

generating a string of linguistic units containing pitch data by selecting linguistic units from a first memory segment of the system which correspond to characters in a text string and concatenating the selected linguistic units together in a second memory segment of the system;

selecting locations within the pitch data of the string of linguistic units;

retrieving a first set of dialect intervals for a first selected dialect, the first set of dialect intervals selected from a set of melodic intervals as being indicative of the first selected dialect and stored in a dialect table in a third memory segment of the system; and

applying the first set of dialect intervals to the pitch data at the selected locations so that synthesized speech of the first selected dialect produced.

2. The method as recited in claim 1 wherein the applying step comprises changing at least one interval at a selected location in the pitch data to at least one dialect interval of the first set of dialect intervals.

3. A method of operating a speech recognition system comprising the steps of:

providing a digitized speech sample of human speech;

selecting a set of melodic intervals in the digitized speech sample;

retrieving a first set of dialect intervals for a first selected dialect, the first set of dialect intervals being melodic intervals which are indicative of the first selected dialect and stored in a dialect table; and

comparing the set of melodic intervals to the first set of dialect intervals to determine whether the digitized speech sample is from human speech of the first selected dialect.

4. The method as recited in claim 3 which further comprises the step of sending a message to the user interface of the system if there is a match between the set of melodic intervals and the first set of dialect intervals.

5. The method as recited in claim 3 which further comprises the steps of:

retrieving a second set of dialect intervals for a second selected dialect;

comparing the set of melodic intervals to the second set of dialect intervals to determine whether the digitized speech sample is from human speech of the second selected dialect; and,

sending a message to a user interface of the system indicating that there is a match between the set of melodic intervals and the second set of dialect intervals.

6. The method as recited in claim 3 wherein the selecting step comprises identifying a melodic interval in the digitized speech sample which exceeds a predetermined threshold as a melodic interval in the set of melodic intervals.

7. The method as recited in claim 3 which further comprises the steps of:

comparing the digitized speech sample with a code book which contains stored speech samples corresponding to phonemes to generate a string of phonemes corresponding to the digitized speech sample; and

comparing the digitized speech sample to pitch data in the string of phonemes to select the set of melodic intervals.

8. The method as recited in claim 3 wherein the selecting step comprises the steps of:

analyzing the digitized speech sample to generate prosodic data; and,

selecting the set of melodic intervals according to the prosodic data.

9. The method as recited in claim 1 wherein the dialect table includes sets of dialect intervals for a plurality of dialects.

10. The method as recited in claim 1 wherein the dialect table includes a set of dialect intervals for a first language.

11. The method as recited in claim 9 wherein the sets of dialect intervals are based on the diatonic scale.

12. The method as recited in claim 1 which further comprises the steps of:

generating prosody data for the string of linguistic units according to prosody rules of the system; and

altering the pitch data within the string of linguistic units according to the prosody data;

wherein the selected locations are chosen within the altered pitch data.

13. The method as recited in claim 1 which further comprises the steps of:

selecting a set of keywords located in the text string; and

locating a set of locations which correspond to the keywords in the string of linguistic units;

wherein the selected locations are selected according to locations in the pitch data which correspond to the locations of the set of keywords in the text string.

14. The method as recited in claim 2 which further comprises the steps of:

retrieving a second set of dialect intervals for a second selected dialect, the second set of dialect intervals selected from a set of melodic intervals as being indicative of the second selected dialect stored in the dialect table; and

changing at least one melodic interval at a selected location in the pitch data to one of the second set of dialect intervals to produce synthesized speech of the second selected dialect.

15. The method as recited in claim 5 which further comprises the steps of:

determining a probability of match for the first and second selected dialects; and,

sending a message to a user interface indicating the probability that the string of linguistic units represents speech of the first or second dialect.

16. The method as recited in claim 1 wherein the first dialect is British English and the first set of dialect intervals comprises an octave, a major seventh and a minor seventh.

17. The method as recited in claim 1 wherein the first dialect is a Japanese and the first set of dialect intervals comprises a perfect fifth, a perfect fourth, a major second and a minor second.

18. The method as recited in claim 1 wherein the first dialect is Irish and the first set of dialect intervals comprises a major sixth, a minor sixth and a major third.

19. The method as recited in claim 1 wherein the first dialect is Midwestern English and the first set of dialect intervals comprises a perfect fifth, a major third, a perfect fourth and a minor third.

20. A computer program product on a computer readable medium for speech synthesis, the computer program product executable in a computer system comprising:

program code means for generating a string of linguistic units containing pitch data by selecting linguistic units from a first memory segment of the system which correspond to characters in a text string and concatenating the selected linguistic units together in a second memory segment of the system;

program code means for selecting locations within the pitch data of the string of linguistic units;

program code means for retrieving a first set of dialect intervals for a first selected dialect, the first set of dialect intervals selected from a set of melodic intervals as being indicative of the first selected dialect stored in a dialect table in a third memory segment of the system; and

program code means for applying the first set of dialect intervals to the set of melodic intervals.

21. The product as recited in claim 20 wherein the applying means changes at least one melodic interval at a selected location in the pitch data to at least one, dialect interval of the first set of dialect intervals.

22. A computer program product in a computer readable medium for speech recognition, the computer program product executable in a computer system, comprising:

program code means for providing a digitized speech sample of human speech;

program code means for selecting a set of melodic intervals in the digitized speech sample;

program code means for retrieving a first set of dialect intervals for a first selected dialect, the first set of dialect intervals being melodic intervals which are indicative of the first selected dialect and stored in a dialect table in a third memory segment of the system; and

program code means for comparing the set of melodic intervals to the first set of dialect intervals to determine whether the digitized speech sample is from speech of the first selected dialect.

23. The product as recited in claim 22 which further comprises program code means for sending a message to a user interface of the system if there is a match between the set of melodic intervals and the first set of dialect intervals.

24. The product as recited in claim 22 which further comprises:

program code means for retrieving a second set of dialect intervals for a second selected dialect;

program code means for comparing the set of melodic intervals to the second set of dialect intervals to determine whether the digitized speech sample is from human speech of the second selected dialect; and,

program code means for sending a message to a user interface of the system indicating that there is a match between the set of melodic intervals and the second set of dialect intervals.

25. The product as recited in claim 22 which further comprises:

program code means for comparing the digitized speech sample with a code book which contains stored speech samples corresponding to phonemes to generate a string of phonemes corresponding to the digitized speech sample; and

program code means for comparing the digitized speech sample to pitch data in the string of phonemes to select the set of melodic intervals.

26. The product as recited in claim 22 wherein the selecting means comprises:

program code means for analyzing the digitized speech sample to generate prosodic data; and,

program code means for selecting the melodic intervals according to the prosodic data.

27. The product as recited in claim 21 wherein the identifying means comprises:

program code means for generating prosody data for the string of linguistic units according to prosody rules of the system; and

program code means for altering the pitch data within the string of linguistic units according to the prosody data;

wherein the selected locations are chosen within the altered pitch data.

28. A speech synthesis system comprising:

a memory for storing set of instructions to perform speech processing and speech data;

a processor coupled to the memory for executing the sets of instructions;

means for generating a string of linguistic units containing pitch data by selecting dialect neutral linguistic units from a first memory segment of the system which correspond to characters in a text string and concatenating the selected linguistic units together in a second memory segment of the system;

means for selecting locations within the pitch data of the string of linguistic units;

means for retrieving a first set of dialect intervals for a first selected dialect, the first set of dialect intervals selected from a set of melodic intervals as being indicative of the first selected dialect and stored in a dialect table in a third memory; and

means for applying the first set of dialect intervals to the pitch data at the selected locations so that synthesized speech of the first selected dialect produced.

29. The system as recited in claim 28 wherein the applying means changes at least one melodic interval at a selected location in the pitch data to at least one dialect interval of the first set of dialect intervals.

30. A speech recognition system comprising:

a memory for storing set of instructions to perform speech processing and speech data;

a processor coupled to the memory for executing the sets of instructions;

means for providing a digitized speech sample of human speech;

means for selecting a set of melodic intervals in the digitized speech sample;

means for retrieving a first set of dialect intervals for a first selected dialect, the first set of dialect intervals being melodic intervals which are indicative of the first selected dialect and stored in a dialect table; and

means for comparing the set of melodic intervals to the first set of dialect intervals to determine whether the digitized speech sample is from human speech of the first selected dialect.

31. The system as recited in claim 30 which further comprises means for sending a message to a user interface of the system if there is a match between the set of melodic intervals and the first set of dialect intervals.

32. The system as recited in claim 30 which further comprises:

means for retrieving a second set of dialect intervals for a second selected dialect;

means for comparing the set of melodic intervals to the second set of dialect intervals to determine whether the digitized speech sample is from human speech of the second selected dialect; and,

means for sending a message to a user interface of the system indicating that there is a match between the set of melodic intervals and the second set of dialect intervals.

33. The system as recited in claim 30 wherein the selecting means identifies a melodic interval in the digitized speech sample which exceeds a predetermined threshold as a melodic interval in the set of melodic intervals.

34. The system as recited in claim 30 wherein the selecting means comprises:

means for comparing the digitized speech sample with a code book which contains stored speech samples corresponding to phonemes to generate a string of phonemes corresponding to the digitized speech sample; and

means for comparing the digitized speech sample to pitch data in the string of phonemes to select the set of melodic intervals.

35. The system as recited in claim 30 wherein the identifying means comprises:

means for analyzing the digitized speech sample to generate prosodic data; and,

means for selecting the set of melodic intervals according to the prosodic data.

36. The system as recited in claim 28 wherein the dialect table includes sets of dialect intervals for a plurality of dialects.

37. The system as recited in claim 28 wherein the dialect table includes a set of dialect intervals for a first language.

38. The system as recited in claim 29 wherein the identifying means comprises:

means for generating prosody data for the string of linguistic units according to prosody rules of the system; and

means for altering the pitch data within the string of linguistic units according to the prosody data;

wherein the selected locations are chosen within the altered pitch data.

39. The system as recited in claim 28 wherein the first dialect is British English and the first set of dialect intervals comprises an octave, a major seventh and a minor seventh.

40. The system as recited in claim 28 wherein the first dialect is Japanese and the first set of dialect intervals comprises a perfect fifth, a perfect fourth, a major second and a minor second.

41. The system as recited in claim 28 wherein the first dialect is Irish and the first set of dialect intervals comprises a major sixth, a minor sixth and a major third.

42. The system as recited in claim 28 wherein the first dialect is Midwestern English and the first set of dialect intervals comprises a perfect fifth, a major third, a perfect fourth and a minor third.
 Description Submit all comments and votes
 


BACKGROUND OF THE INVENTION

This invention generally relates to improvements in speech synthesis and analysis. More particularly, it relates to improvements in handling a plurality of dialects in a speech I/O system having a single set of stored phonemes.

BACKGROUND OF THE INVENTION

There has been substantial research in the field of text-to-speech or speech-to-text input/output (I/O) systems in the past decades. Yet, analyzing, synthesizing and coding human speech has proven to be a very difficult problem whose complete solution has continued to elude researchers and engineers. The complexity of the frequency spectrum of phonemes in speech, the number of different phonemes in the same language, the number of different dialects and languages and the variety of ways the sounds are formed by different speakers are all factors which add to the problem. For a speech program, it is difficult to either identify a string of phonemes spoken continuously by a random human speaker or to synthesize speech from a set of phonemes which will be identified as a set of words by those hearing them.

Most text-to-speech conversion systems convert an input text string into a corresponding string of linguistic units such as consonant and vowel phonemes, or phoneme variants such as allophones, diphones, or triphones. An allophone is a variant of the phoneme based on surrounding sounds. For example, the aspirated "p" of the word "pawn" and the unaspirated "p" of the word "spawn" are both allophones of the phoneme "P". Phonemes are the basic building blocks of speech corresponding to the sounds of a particular language or dialect. Diphones and triphones are concatenations of phonemes and are related to allophones in that the pronunciation of each of the phonemes depend on the other phonemes, diphones or triphones. Two techniques, "synthesis by rule" or linear predictive coding (LPC) or variation thereof are generally used for converting the phonemes into synthetic speech. Other speech synthesis and analysis techniques are known to the art.

For a speech synthesis system, a text string is the initial input which is parsed into individual words and punctuation characters. Generally, a dictionary lookup is performed for those words which do not follow the standard system rules of pronunciation to convert the text of these words to a set of phonemes or other linguistic units. The remainder of the text is converted to a set of phonemes according to the text to phonemes rules.

Transitions between the individual phonemes in the phoneme string developed from the dictionary lookup and text-to-phoneme conversion must be developed if the synthesized speech is not to sound unnaturally discontinuous between one phoneme to the next. It is well known that the pronunciation of a particular phoneme is context dependent, i.e. the pronunciation depends upon what phonemes precede and follow the phoneme. The transitions between at least some phonemes if allophones, diphone or triphones are used as the linguistic unit may be less harsh as the relationship with the surrounding phonemes is part of the linguistic unit. Nonetheless, a more pleasing result will be accomplished if transitions are smoothed between linguistic units. Smoothing the transitions is usually accomplished by choosing a stored transition curve from a table of transitions or by an interpolation technique.

A prosodic routine is included in many prior art text-to-speech systems. These routines determine the duration and fundamental frequency pattern of the linguistics units in the text string, typically on a sentence level. Prosodic routines can be written for other portions of the text string such as phrases. The prosodic analyzer section will identify clauses within text sentences by locating punctuation and conjunctions. Keywords such as pronouns, prepositions and articles are used to determine the sentence structure. Once the sentence structure is detected, the prosody rules can be applied to the phoneme string which resulted from the dictionary lookup and the text to phonemes rules. The parsing of the text string into phonemes and prosody determination steps may be performed in different orders in different speech systems.

The prosody information, phonemes and transitions are converted into formant and pitch parameters. A speech synthesizer uses the parameters to generate a synthetic speech waveform. Formants are used to characterize the successive maxima in the speech spectrum; the first formant(f.sub.1) for the lowest resonance frequency, the second formant(f.sub.2) for the next lowest resonance frequency, the third(f.sub.3) formant for the third lowest resonance frequency, etc. Generally, the fundamental pitch, of, and the first three formants, f.sub.1, f.sub.2 and f.sub.3, will be adequate for intelligibility. The pitch and formant data for each phoneme can be stored in a lookup table. Alternatively, the pitch and formant data for large sets of phonemes, allophones, etc. can be efficiently stored using code books of parameters selected using vector quantization methods. An intonational contour will hopefully result which gives the synthesized speech an approximation to the rhythm and melody of human speech.

In a speech recognition system, a digitized audio signal is sampled many times per second to match the signal to code books to identify the individual phonemes which comprise the waveform. Transitions between phonemes and words are determined as well as prosodic information such as the punctuation in the sentences. A phoneme is easily related to an ascii character. The output of a speech recognition system is usually text string, in ascii or other character representation, but can be some other predetermined output. Techniques similar to those used in speech synthesis, e.g., LPC, are used in speech recognition. Indeed many speech systems are combined speech analysis/synthesis systems where a learning process analyzing speech samples is used to generate the code books subsequently used to synthesize speech from a text string. One of the more interesting problems in speech synthesis and analysis is the different dialects and languages in human speech. Yet, regardless of the storage method used, it is obvious that a huge amount of data is required for adequate speech synthesis even for a single voice. When a speech system is to produce or analyze a variety of dialects, the storage and cost problems can be multiplied for each new dialect. For example, some prior art systems use stored speech waveforms generated by a speaker of a desired dialect to produce the synthesized speech. It would be relatively easy to extend such a system for several dialects or other speech characteristics such as male or female by having several sets of waveforms generated by speakers of the dialects the system is to emulate. Storage becomes a problem.

Further, it desirable to efficiently switch from one dialect or language to the next. While it might be possible to produce speech from a first dialect from a first set of waveforms, and then when a second dialect is to be emulated, dump all the first set of waveforms from active memory and load a second set of waveforms given the vast amount of data required, it would not be quickly accomplished. Thus, it would be difficult in such a system given limited memory to simulate more than one dialect at a time.

One prior art speech system teaches that a single set of speech data can be used to generate multiple voices by altering the pitch or formant data according to an algorithm or ratio. The method separates the pitch period, the formants which model the vocal track and the speech rate as independent factors. The voice characteristics of the synthesized speech from the source are then modified by varying the magnitudes of the signal sampling rate, the pitch period and the speech rate or timing in a preselected manner depending on the desired output voice characteristics for the output synthesized speech. This technique is used to change the apparent sex and/or species of the synthesized speaker, but does not address different dialects or languages.

SUMMARY OF THE INVENTION

It is therefore an object of the invention to minimize storage requirements of producing or analyzing speech samples from a plurality of dialects.

It is another object of the invention to produce or analyze speech samples of a plurality of dialects concurrently.

These and other objects and features of the invention are accomplished by applying a set of intonation interval and timing parameters for a chosen dialect from sets of data for a plurality of dialects to a single set of stored linguistic units, e.g., phonemes. The speech system is based on the observation that each dialect and language possess its own set of musical relationships, e.g., intonation intervals. These musical relationships are used by a human listener to identify the particular dialect or language. The speech system may be either a speech synthesis or speech analysis tool or may be a combined speech synthesis/analysis system. After the text string or speech sample has been differentiated into a string of phonemes, a dialect table lookup is performed. In the case of a text string which is to be synthesized into speech, the user or speech system chooses a particular dialect for output. The table lookup extracts the interval and timing information for the selected dialect and applies them to the phoneme string according to interval rules. The interval rules use the prosodic analysis of the phoneme string or other cues to apply a given interval to the phoneme string. A separate semantic table lookup may be performed for semantic information, i.e., relating to punctuation. The semantic interval and timing information found are applied to the phoneme string according to semantic interval rules using the prosodic analysis.

For an analysis of a speech sample in recognition mode, the speech system will compare the speech sample to successive sets of interval and timing information for the various dialects retrieved by a table lookup. Alternatively, the speech system will compare the stored waveform of the captured speech sample to a waveform assembled from the stored phonemes. The differences between the two waveforms are used in the table lookup and compare step to identify the dialect of the speaker.

For speech synthesis, the system also envisions a transition smoothing table lookup. After the best transition curve is chosen from a table of transition curves, a constant may be added to the resulting intonational curve according to the particular phonemes which precede and follow the transition.

BRIEF DESCRIPTION OF THE DRAWINGS

These and other objects and features will become more easily understood by reference with the attached drawings and following description.

FIG. 1 is a representation of a personal computer system including the system unit, keyboard, mouse and display.

FIG. 2 is a block diagram of the computer system components in FIG. 1.

FIG. 3 is a block diagram of the speech analysis/synthesis system according to the present invention.

FIGS. 4A and 4B are flow diagrams of the table lookup process for the speech synthesis and analysis procedures respectively in the present invention.

FIG. 5A is a table of the frequency values of a portion of the diatonic scale which is used for human speech.

FIG. 5B is a table of intervals in the diatonic scale with their respective frequency ratios.

FIG. 6 is a representation of the lookup table including intervals and timing information for a plurality of dialects.

FIG. 6A depicts a text string and a representation of phonemes, transisitions and prosodic and keyword events to which the text string is parsed.

FIG. 7 depicts an audio controller card which can be used to control the speaker or microphone used in the present invention.

FIG. 8 is a flow diagram of the transition smoothing process in the present invention.

DETAILED DESCRIPTION OF THE DRAWINGS

The invention can be implemented on a variety of computer platforms. The processor unit could be, for example, a personal computer, a mini computer or a mainframe computer, running the plurality of computer terminals. The computer may be a standalone system, part of a network, such as a local area network or wide are network or a larger teleprocessing system. Most preferably, however, the invention is described below is implemented on standalone multimedia personal computer, such as IBM's PS/2 series, although the specific choice of a computer is limited only by the memory and disk storage requirements. For additional information on IBM's PS/2 series of computer readers referred to Technical Reference Manual Personal System/2 (Model 50, 60 Systems), IBM Corporation, Part Number 68X2224, Order

In FIG. 1, a personal computer 10, comprising a system unit 11, a keyboard 12, a mouse 13 and a display 14 are depicted. The keyboard 12 and mouse 13 are user input devices. The screen 16 of display device 14 is used to present the visual feedback to the user of the results of the computer operations. Typically, the graphical user interface supported by the operating system allows the user to use a point and shoot input method by moving the pointer 15 to icon representing a data object at a particular location on the screen and press one of the mouse buttons to form a user command selection. In the case of this invention, the data object may be an audio speech sample or a speech library comprising a plurality of audio speech signals. Not depicted is the speaker used to produce the synthesized speech which resides in the system unit 11. Alternatively, the synthesized speech could be produced on external speakers coupled to the audio controller 31 (FIG. 2)

FIG. 2 shows a block diagram of the components of the personal computer shown in FIG. 1. The system unit 11 includes a system bus or system busses 21 to which various components are coupled and by which communication between the various components is accomplished. A microprocessor 22 is connected to the system bus 21 and is supported by read only memory (ROM) 23 and random access memory (RAM) 24 also connected to system bus 21. The microprocessor 22 in the IBM PS/2 series of computers is one of the Intel family of microprocessors including the 8088, 286, 386 or 486 microprocessors, however, other microprocessors including, but not limited to Motorola's family of microprocessors such as the 68000, 68020 or the 68030 microprocessors and various Reduced Instruction Set Computer (RISC) microprocessors manufactured by IBM, Hewlett Packard, Sun, Intel, Motorola and others may be used in the specific computer.

The ROM 23 contains among other code the Basic Input/Output System (BIOS) which controls basic hardware operations such as the interaction and the disk drives and the keyboard. The RAM 24 is the main memory into which the operating system and speech programs are loaded. The memory management chip 25 is connected to the system bus 21 and controls direct memory access operations including, passing data between the RAM 24 and hard disk drive 21 and floppy disk drive 27. A CD ROM 28 also coupled to the system bus 21 is used to store a large amount of data, for example, a multimedia program or presentation.

Also connected to this system bus 21 are various I/O controllers: The keyboard controller 28, the mouse controller 29, the video controller 30, and the audio controller 31. As might be expected, the keyboard controller 28 provides the hardware interface for the keyboard 12, the mouse controller 29 provides the hardware interface for mouse 13, the video controller 30 is the hardware interface for the display 14. The audio controller 31 is the hardware interface for external speakers 32 which may be used to produce to the synthesize speech. The audio controller 31 also is the hardware interface for a microphone 33 used to receive sample from the user. Lastly, also coupled to the system bus is digital signal processor 34 which is preferably in incorporated into the audio controller 31.

FIG. 3 is an architectural block diagram of the speech synthesis/analysis system of the present invention. The text source 50 may be from CD ROM storage or on magnetic disk storage or may be the result of the alphanumeric input from the keyboard of the computer. Alternatively, it may be from a set of data transmitted over a network to a local computer. For purposes of this invention, it does not matter greatly where the ascii or other character string originates.

A pronunciation system 52 may be architected according to any number of speech synthesis techniques, such as synthesis by rule or LPC conversion, what is important, however, is that pronunciation system 52 produces both the concatenated phoneme string 54 and prosody data 56 relating to the text string. For the purposes of this application, the term phoneme should be understood to be a general term for the linguistic unit used by the speech system. Allophones, diphones and triphones are all particular phoneme variants. One skilled in the art would recognize that the text string could be converted into a stream of allophones or diphones rather than phonemes and that the invention would work equally well. The phoneme string at 54 is not a concatenated series of phoneme codes, but rather the numerical data of the phonemes. Also, prosody data 56 may also include key word data such as pronouns, prepositions, articles and proper nouns, which may also be useful applying the intonational intervals to the phoneme string. In the case of speech synthesis, the system or user also chooses which dialect and semantic meaning to be applied to the phoneme string. These inputs are made in data stream 57. The semantic information for speech synthesis would alternatively be included in the ascii text stream in terms of punctuation.

One pronunciation syst