WikiPatents - Community Patent Review
Create Free Account  |  License or Sell Your Patent  |  WikiPatents Marketplace  |  WikiPatents Blog
Username:  Password:  
    
Advanced Search
Speech segment coding and pitch control methods for speech synthesis systems    
United States Patent5617507   
Link to this pagehttp://www.wikipatents.com/5617507.html
Inventor(s)Lee; Chong R. (Seoul, KR); Park; Yong K. (Seoul, KR)
AbstractThe present invention relates to a method and system for synthesizing speech utilizing a periodic waveform decomposition and relocation coding scheme. According to the scheme, signals of voiced sound interval among original speech are decomposed into wavelets, each of which corresponds to a speech waveform for one period made by each glottal pulse. These wavelets are respectively coded and stored. The wavelets nearest to the positions where the wavelets are to be located are selected from stored wavelets and decoded. The decoded wavelets are superposed to each other such that original sound quality can be maintained and duration and pitch frequency of speech segment can be controlled arbitrarily.
   














 Title Information Submit all comments and votes
 
Patent Text Patent PDF Print Page Summary File History
Plain text PDF images Print Summary File History
Drawing from US Patent 5617507
Speech segment coding and pitch control methods for speech synthesis

     systems - US Patent 5617507 Drawing
Speech segment coding and pitch control methods for speech synthesis systems
Inventor     Lee; Chong R. (Seoul, KR); Park; Yong K. (Seoul, KR)
Owner/Assignee     Korea Telecommunication Authority (Seoul, KR)
Patent assignment
All assignments
Publication Date     April 1, 1997
Application Number     08/275,940
PAIR File History     Application Data   Transaction History
Image File Wrapper   Patent Term   Fees
Litigation
Filing Date     July 14, 1994
US Classification    
Int'l Classification    
Examiner     MacDonald; Allen R.
Assistant Examiner     Chowdhury; Indranil
Attorney/Law Firm     Seed and Berry LLP
Address
Parent Case     CROSS-REFERENCE TO RELATED APPLICATION This application is a continuation of U.S. patent application Ser. No. 07/972,283, filed Nov. 5, 1992, abandoned.
Priority Data     Nov 06, 1991 [KR] 91-19617
USPTO Field of Search    
Patent Tags     speech segment coding pitch control methods speech synthesis
   
Enter a comma (,) or semicolon (;) between multiple tag words/phrases.
Describe this patent:
 Amusing   
 Clever   
 Complex   
 Efficient   
 Historic   
 Important   
 Innovative   
 Interesting   
 Practical   
 Simple   
[no votes]
Patent WIKI

Share information and news about this patent, including information and news about the technology, inventors, company, ligation and licensing.

 References Submit all comments and votes
 
*references marked with an asterisk below are user-added references
 U.S. References
 
Add a new US reference:  
ReferenceRelevancyCommentsReferenceRelevancyComments
4914701
Zibman
704/203
Apr,1990

[0 after 0 votes]
4912768
Benbassat
704/260
Mar,1990

[0 after 0 votes]
3700815
Doddington
704/246
Oct,1972

[0 after 0 votes]
 Foreign References
 Other References
 Market Review Submit all comments and votes
   
Market Size
Estimate the gross annual revenues of the relevant market sector:
> $10B
$5B - $10B
$2B - $5B
$500M - $2B
$100M - $500M
$10M - $100M
$1M - $10M
$500K - $1M
$100K - $500K
< $100K
[No votes]
$0
 
$0   $2.5B   $5B   $7.5B   $10B
Market Share
Estimate the percentage of the relevant market sector this invention will capture:
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Reasonable Royalty
What percentage of gross sales should the inventor or assignee be paid?
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Public's "Guesstimation" of Royalty Value
Market SizeN/A[No votes]
xMarket ShareN/A[No votes]
xReasonable RoyaltyN/A[No votes]

N/A

License Availablity
If you are NOT the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
License Availablity
If you ARE the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
Competitive Advantage
Does this invention have a significant competitive advantage over similar technologies?
Yes

No



[No votes]
Most helpful competitive advantage comment
[No comments]

Commercial Alternatives
Are there viable commercial alternatives for this invention?
Yes

No



[No votes]
Most helpful commercial alternative comment
[No comments]

 Technical Review Submit all comments and votes
 Claims Submit all comments and votes
 


We claim:

1. A speech coding method for use in speech synthesis, comprising:

obtaining a set of spectral envelope parameters that represents an estimated spectral envelope of a voiced speech signal by using a spectrum estimation technique;

deconvolving said voiced speech signal, with an impulse response that is a time-domain representation of said estimated spectral envelope of said voiced speech signal, into a pitch pulse train signal having a sequence of periodically located pitch pulses;

forming an excitation signal by appending zero-valued samples to each pitch pulse signal of one period such that one pitch pulse is contained in each period;

convolving said excitation signal with said impulse response into wavelets;

obtaining wavelet codes by coding the wavelets of all periods; and

storing in memory wavelet codes and information of corresponding pitch pulse locations of all wavelets, for use in speech synthesis.

2. A speech synthesis method in a speech synthesis system which uses the speech coding method of claim 1, comprising:

determining appropriate time points which represent a desired pitch pattern;

selecting from all wavelet codes a wavelet code whose pitch pulse location is nearest to each of said time points;

obtaining a wavelet signal by decoding each selected wavelet code;

localizing said wavelet signal so that the pitch pulse location of said wavelet signal coincides with said time point; and

superposing all of said localized wavelet signals, thereby obtaining a synthetic speech.

3. The speech coding method of claim 1 wherein a wavelet code is formed by mating information obtained by coding said pitch pulse signal of one period, with information obtained by coding a set of said spectral envelope parameters of the same period as the one period of said pitch pulse signal.

4. A speech synthesis method in a speech synthesis system which uses the speech coding method of claim 3, comprising:

determining appropriate time points which represent a desired pitch pattern;

selecting from all wavelet codes a wavelet code whose pitch pulse location is nearest to each of said time points;

decoding a coded pitch pulse signal and a set of coded spectral envelope parameters of each selected wavelet code;

forming an excitation signal by appending zero-valued samples after each decoded pitch pulse signal;

obtaining a wavelet signal by convolving said excitation signal with an impulse response which is a time-domain representation of a set of said decoded spectral envelope parameters;

localizing said wavelet signal so that pitch pulse location of said wavelet signal coincides with said time point; and

superposing all of said localized wavelet signals, thereby obtaining a synthetic speech.

5. A speech synthesis method in a speech synthesis system which uses the speech coding method of claim 3, comprising:

determining appropriate time points which represent a desired pitch pattern;

selecting from all wavelet codes a wavelet code whose pitch pulse location is nearest to each of said time points;

decoding a coded pitch pulse signal and a set of coded spectral envelope parameters in each selected wavelet code;

localizing said decoded pitch pulse signal so that the pitch pulse location of said decoded pitch pulse signal coincides with said time point;

forming an excitation signal by superposing all of said localized pitch pulse signals; and

convolving said excitation signal with an impulse response which is a time-domain representation of a set of said decoded spectral envelope parameters, thereby obtaining a synthetic speech.

6. A speech coding method for use in speech synthesis, comprising:

obtaining a set of spectral envelope parameters of a voice speech signal by spectrum estimation;

deconvolving the voice speech signal, with an impulse response that is representative of the spectral envelope parameters set of the voice speech signal, into a pitch pulse train signal having a plurality of pitch pulses;

forming an excitation signal by segmenting the pitch pulse train signal such that one pitch pulse is contained in each period;

convolving the excitation signal with the impulse response into a plurality of wavelets; and

storing the plurality of wavelets for use in speech synthesis.

7. The speech coding method of claim 6 wherein the step of forming an excitation signal further includes the step of appending zero-valued samples to each segmented pitch pulse train signal of one period.

8. A speech coding method for use in speech synthesis, comprising:

obtaining a set of spectral envelope parameters of a voice speech signal by spectrum estimation;

deconvolving the voice speech signal, with an impulse response that is representative of the set of spectral envelope parameters, into a pitch pulse train signal having a substantially flat spectral envelope and a sequence of periodically located pitch pulses;

forming an excitation signal by adding zero-valued samples to each pitch pulse train signal of one period such that one pitch pulse is contained in each period;

convolving the excitation signal with the impulse response into wavelets with each wavelet being associated with one pitch pulse; and

storing the wavelets and the locations of the associated pitch pulses in memory for use in speech synthesis.
 Description Submit all comments and votes
 


BACKGROUND OF INVENTION

1. Field of the Invention

The invention relates to a speech synthesis system and a method of synthesizing speech, and more particularly, to a speech segment coding and a pitch control method which significantly improves the quality of the synthesized speech.

The principle of the present invention can be directly applied not only to speech synthesis but also to synthesis of other sounds, such as, the sounds of musical instruments or singing, each of which has a property similar to that of speech, or to a very low rate speech coding or speech rate conversion. The present invention will be described below concentrating on speech synthesis.

There are speech synthesis methods for implementing a text-to-speech synthesis system which can synthesize countless vocabularies by converting text, that is, character strings, into speech. However a method which is easy to implement and most generally utilized is speech segmental synthesis method, also called synthesis-by-concatenation method, in which the human speech is sampled and analyzed into phonetic units, such as demisyllables or diphones, to obtain short speech segments, which are then coded and stored in memory, and when the text is inputted, it is converted into phonetic transcriptions. Speech segments corresponding to the phonetic transcriptions are then sequentially retrieved from the memory and decoded to synthesize the speech corresponding to the input text.

In this type of segmental speech synthesis method, one of the most important elements to govern the quality of the synthesized speech is the coding method of the speech segments. In the prior art speech segmental synthesis method of the speech synthesis system, a vocoding method of low speech quality is mainly used as the speech coding method for storing speech segments. However this is one of the most important causes which lowers the quality of synthesized speech. A brief description with respect to the prior art speech segment coding method follows.

The speech coding method can be largely classified into a waveform coding method of good speech quality and a vocoding method of low speech quality. Since the waveform coding method is a method which intends to transfer the speech waveform as it is, it is very difficult to change pitch frequency and duration so that it is impossible to adjust intonation and rate of speech when performing the speech synthesis. Also it is impossible to conjoin the speech segments therebetween smoothly so that the waveform coding method is basically not suitable for coding the speech segments.

On the contrary, when the vocoding method (also called an analysis-synthesis method) is used, the pitch pattern and the duration of the speech segment can be arbitrarily changed. Further, since the speech segments can also be smoothly conjoined by interpolating the spectral envelope estimation parameters so that the vocoding method is suitable for the coding means for text-to-speech synthesis, vocoding methods, such as linear predictive coding (LPC) or formant vocoding, is adopted in most present speech synthesis systems. However, since the quality of decoded speech is low when the speech is coded using the vocoding method, the synthesized speech obtained by decoding the stored speech segments and concatenating them can not have better speech quality than that offered by the vocoding method.

Attempts made so far to improve speech quality offered by the vocoding method replaces the impulse train used with an excitation signal that has a less artificial waveform. One such attempt was to utilize, a waveform having peakiness lower than that of the impulse, for example a triangular waveform or a half circle waveform or a waveform similar to a glottal pulse. Another attempt was to select a sample pitch pulse of one or some of residual signal pitch periods obtained by inverse filtering and to utilize instead of the impulse, one sample pulse for the entire time period or for a substantially long time period. However, such attempts to replace the impulse with an excitation pulse of other waveforms have not improved the speech quality or have improved it only slightly, if ever, and have never obtained synthesized speech with a quality proximating that of natural speech.

It is the object of the present invention to synthesize high quality speech having a naturalness and an intelligibility with the same degree as that of human speech by utilizing a novel speech segment coding method enabling good speech quality and pitch control. The method of the present invention combines the merits of the waveform coding method which provides good speech quality but without the ability to control the pitch and the vocoding method which provides pitch control but has low speech quality.

The present invention utilizes a periodic waveform decomposition method which is a coding method which decomposes a signal in a voiced sound sector in the original speech into wavelets equivalent to one-period speech waveforms made by glottal pulses to code and store the decomposed signal, and a time warping-based wavelet relocation method which is a waveform synthesis method capable of arbitrary adjustment of the duration and pitch frequency of the speech segment while maintaining the quality of the original speech by selecting wavelets nearest to positions where wavelets are to be placed among stored wavelets, then by decoding the selected wavelets and superposing them. For purposes of this invention musical sounds are treated as voiced sounds.

The preceding objects should be construed as merely presenting a few of the more pertinent features and applications of the invention. Many other beneficial results can be obtained by applying the disclosed invention in a different manner or modifying the invention within the scope of the disclosure. Accordingly, other objects and a fuller understanding of the invention may be had by referring to both the summary of the invention and the detailed description, below, which describe the preferred embodiment in addition to the scope of the invention defined by the claims considered in conjunction with the accompanying drawings.

SUMMARY OF THE INVENTION

Speech segment coding and pitch control methods for speech synthesis systems of the present invention are defined by the claims with specific embodiments shown in the attached drawings. For the purpose of summarizing the invention, the invention relates to a method capable of synthesizing speech that proximates the quality of natural speech by adjusting its duration and pitch frequency by waveform-coding wavelets of each period, storing them in memory, and at the time of synthesis, decoding them and locating them at appropriate time points such that they have the desired pitch pattern and then superposing them to generate natural speech, singing, music and the like.

The present invention includes a speech segment coding method for use with a speech synthesis system, where the method comprises the forming of wavelets by obtaining parameters which represent a spectral envelope in each analysis time interval. This is done by analyzing a periodic or quasi-periodic digital signal, such as voiced speech, with the spectrum estimation technique. An original signal is first deconvolved into an impulse response represented by the spectral envelope parameters and a periodic or quasiperiodic pitch pulse train signal having a nearly flat spectral envelope. An excitation signal obtained by appending zero-valued samples after a pitch pulse signal of one period obtained by segmenting the pitch pulse train signal period by period so that one pitch pulse is contained in each period and an impulse response corresponding to a set of spectral envelope parameters in the same time interval as the excitation signal are convolved to form a wavelet for that period.

The wavelets, rather than being formed by waveform-coding and stored in memory in advance, may be formed by mating information obtained by waveform-coding a pitch pulse signal of each period interval, obtained by segmentation, with information obtained by coding a set of spectral envelope estimation parameters with the same time interval as the above information, or with an impulse response corresponding to the parameters and storing the wavelet information in memory. There are two methods of producing synthetic speech by using the wavelet information stored in memory. The first method is to constitute each wavelet by convolving an excitation signal obtained by appending zero-valued samples after a pitch pulse signal of one period obtained by decoding the information and an impulse response corresponding to the decoded spectral envelope parameters in the same time interval as the excitation signal, and then to assign the wavelets to appropriate time points such that they have desired pitch pattern and duration pattern, locate them at the time points, and then superpose them.

The second method is to constitute a synthetic excitation signal by assigning the pitch pulse signals obtained by decoding the wavelet information to appropriate time points such that they have desired pitch pattern and duration pattern and locating them at the time points, and constitute a set of synthetic spectral envelope parameters either by temporally compressing or expanding the set of time functions of the parameters on a subsegment-by-subsegment basis, depending on whether the duration of a subsegment in a speed segment to be synthesized is shorter or longer than that of a corresponding subsegment in the original speech segment, respectively, or by locating the set of time functions of the parameters of one period synchronously with the mated pitch pulse signal of one period located to form the synthetic excitation signal, and to convolve the synthetic excitation signal and an impulse response corresponding to the synthetic spectral envelope parameter set by utilizing a time-varying filter or by using an FFT(Fast Fourier Transform)-based fast convolution technique. In the latter method, a blank interval occurs when a desired pitch period is longer than the original pitch period and an overlap interval occurs when the desired pitch period is shorter than the original pitch period.

In the overlap interval, the synthetic excitation signal is obtained by adding the overlapped pitch pulse signals to each other or by selecting one of them, and the spectral envelope parameter is obtained by selecting either one of the overlapped spectral envelope parameters or by using an average value of the two overlapped parameters.

In the blank interval, the synthetic excitation signal is obtained by filling it with zero-valued samples, and the synthetic spectral envelope parameter is obtained by repeating the values of the spectral envelope parameters at the beginning and ending points of the proceeding and following periods before and after the center of the blank interval, or by repeating one of the two values or an average value of the two values, or by filling it with values and smoothly connecting the two values.

The present invention further includes a pitch control method of a speech synthesis system capable of controlling duration and pitch of a speech segment by a time warping-based wavelet relocation method which makes it possible to synthesize speech with almost the same quality as that of natural speech, by coding important boundary time points such as the starting point, the end point and the steady-state points in a speech segment and pitch pulse positions of each wavelet or each pitch pulse signal and storing them in memory simultaneously at the time of storing each speech segment, and at the time of synthesis, obtaining a time-warping function by comparing desired boundary time points and original boundary time points stored corresponding to the desired boundary time points, finding out the original time points corresponding to each desired pitch pulse position by utilizing the time-warping function, selecting wavelets having pitch pulse positions nearest to the original time points and locating them at desired pitch pulse positions, and superposing the wavelets.

The pitch control method may further include producing synthetic speech by selecting pitch pulse signals of one period and spectral envelope parameters corresponding to the pitch pulse signals, instead of the wavelets, and locating them, and convolving the located pitch pulse signals and impulse response corresponding to the spectral envelope parameters to produce wavelets and superposing the produced wavelets, or convolving a synthetic excitation signal obtained by superposing the located pitch pulse signals and a time-varying impulse response corresponding to a synthetic spectral envelope parameters made by concatenating the located spectral envelope parameters.

A voiced speech synthesis device of a speech synthesis system is disclosed and includes a decoding subblock 9 producing wavelet information by decoding wavelet codes from the speech segment storage block 5. A duration control subblock 10 produces time-warping data from input of duration data from a prosodics generation subsystem 2 and boundary time points included in header information from the speech segment storage block 5. A pitch control subblock 11 produces pitch pulse position information such that it has an intonation pattern as indicated by an intonation pattern data from input of the header information from the speech segment storage block 5, the intonation pattern data from the prosodics generation subsystem and the time-warping information from the duration control subblock 10. An energy control subblock 12 produces gain information such that synthesized speech has the stress pattern as indicated by stress pattern data from input of the stress pattern data from the prosodics generation subsystem 2, the time-warping information from the duration control subblock 10 and pitch pulse position information from the pitch control subblock 11. A waveform assembly subblock 13 produces a voiced speech signal from input of the wavelet information from the decoding subblock 9, the time-warping information from the duration control subblock 10, the pitch pulse position information from the pitch control subblock 11 and the gain information from the energy control subblock 12.

Thus, according to the present invention, text is inputted to the phonetic preprocessing subsystem 1 where it is converted into phonetic transcriptive symbols and syntatic analysis data. The syntatic analysis data is outputted to a prosodics generation subsystem 2. The prosodics generation subsystem 2 outputs prosodic information to the speech segment concatenation subsystem 3. The phonetic transcriptive symbols output from the preprocessing subsystem is also inputted to the speech segment concatenation subsystem 3. The phonetic transcriptive symbols are then inputted to the speech segment selection block 4 and the corresponding prosodic data are inputted to the voiced sound synthesis block 6 and to the unvoiced sound synthesis block 7. In the speech segment selection block 4 each input phonetic transcriptive symbol is matched with a corresponding speech segment synthesis unit and a memory address of the matched synthesis unit corresponding to each input phonetic transcriptive symbol is found out from a speech segment table in the speech segment storage block 5. The address of the matched synthesis unit is then outputted to the speech segment storage block 5 where the corresponding speech segment in coded wavelet form is selected for each of the addresses of the matched synthesis units. The selected speech segment in coded wavelet form is outputted to the voiced sound synthesis block 6 for voiced sound and to the unvoiced sound synthesis block 7 for unvoiced sound. The voiced sound synthesis block 6, which uses the time warping-based wavelet relocation method to synthesize speech sound, and the unvoiced sound synthesis block 7 output digital synthetic speech signals, to the digital-to-analog converter for converting the input digital signals into analog signals which are the synthesized speech sounds.

To utilize the present invention, speech and/or music is first recorded on magnetic tape. The resulting sound is then converted from analog signals to digital signals by low-pass filtering the analog signals and then feeding the filtered signals to an analog-to-digital converter. The resulting digitized speech signals are then segmented into a number of speech segments having sounds which correspond to synthesis units, such as phonemes, diphones, demisyllables and the like, by using known speech editing tools. Each resulting speech segment is then differentiated into voiced and unvoiced speech segments by using known voiced/unvoiced detection and speech editing tools. The unvoiced speech segments are encoded by known vocoding methods which use white random noise as an unvoiced speech source. The vocoding methods include LPC, homomorphic, formant vocoding methods, and the like.

The voiced speech segments are used to form wavelets sj(n) according to the method disclosed below in FIG. 4. The wavelets sj(n) are then encoded by using an appropriate waveform coding method. Known waveform coding methods include Pulse Code Modulation (PCM), Adaptive Differential Pulse Code Modulation (ADPCM), Adaptive Predictive Coding (APC) and the like. The resulting encoded voiced speech segments are stored in the speech segment storage block 5 as shown in FIGS. 6A and 6B. The encoded unvoiced speech segments are also stored in the speech segment storage block 5.

The more pertinent and important features of the present invention have been outlined above in order that the detailed description of the invention which follows will be better understood and that the present contribution to the art can be fully appreciated. Additional features of the invention described hereinafter form the subject of the claims of the invention. Those skilled in the art can appreciate that the conception and the specific embodiment disclosed herein may be readily utilized as a basis for modifying or designing other structures for carrying out the same purposes of the present invention. Further, those skilled in the art can realize that such equivalent constructions do not depart from the spirit and scope of the invention as set forth in the claims.

BRIEF DESCRIPTION OF THE DRAWINGS

For fuller understanding of the nature and objects of the invention, reference should be had to the following detailed description taken in conjunction with the accompanying dawings in which:

FIG. 1 illustrates the text-to-speech synthesis system of the speech segment synthesis method;

FIG. 2 illustrates the speech segment concatenation subsystem;

FIGS. 3A through 3T illustrate waveforms for explaining the principle of the periodic waveform decomposition method and the wavelet relocation method according to the present invention;

FIG. 4 illustrates a block diagram for explaining the periodic waveform decompostion method;

FIGS. 5A through 5E illustrate block diagrams for explaining the procedure of the blind deconvolution method;

FIGS. 6A and 6B illustrate code formats for the voiced speech segment information stored at the speech segment storage block;

FIG. 7 illustrates the voiced speech synthesis block according to the present invention; and

FIGS. 8A and 8B illustrate graphs for explaining the duration and pitch control method according to the present invention.

Similar reference characters refer to similar parts throughout the several views of the drawings.

DETAILED DESCRIPTION OF THE INVENTION

The structure of the text-to-speech synthesis system of the prior art speech segment synthesis method consists of three subsystems:

A. A phonetic preprocessing subsystem (1);

B. A prosodics generation subsystem (2); and

C. A speech segment concatenation subsystem (3) as shown in FIG. 1. When the text is input from a keyboard, a computer or any other system, to the text-to-speech synthesis system, the phonetic preprocessing subsystem (1) analyzes the syntax of the text and then changes the text to a string of phonetic transcriptive symbols by applying thereto phonetic recoding rules. The prosodics generation subsystem (2) generates intonation pattern data and stress pattern data utilizing syntactic analysis data so that appropriate intonation and stress can be applied to the string of phonetic transcriptive symbols, and then outputs the data to the speech segment concatenation subsystem (3). The prosodics generation subsystem (2) also provides the data with respect to the duration of each phoneme to the speech segment concatenation subsystem (3).

The above three prosodic data, i.e. the intonation pattern data, the stress pattern data and the data regarding the duration of each phoneme are, in general, sent to the speech segment concatenation subsystem (3) together with the string of the phonetic transcriptive symbols generated by the phonetic preprocessing subsystem (1), although they may be transferred to the speech segment concatenation subsystem (3) independently of the string of the phonetic transcriptive symbols.

The speech segment concatenation subsystem (3) generates continuous speech by sequentially fetching appropriate speech segments which are coded and stored in memory thereof according to the string of the phonetic transcriptive symbols (not shown) and by decoding them. At this time the speech segment concatenation subsystem (3) can generate synthetic speech having the intonation, stress and speech rate as intended by the prosodics generation subsystem (2) by controlling the energy (intensity), the duration and the pitch period of each speech segment according to the prosodic information.

The present invention remarkably improves speech quality in comparison with synthesized speech of the prior art by improving the coding method for storing the speech segments in the speech segment concatenation subsystem (3). A description with respect to the operation of the speech segment concatenation subsystem (3) with reference to FIG. 2 follows.

When the string of the phonetic transcriptive symbols formed by the phonetic preprocessing subsystem (1) is inputted to the speech segment selection block (4), the speech segment selection block (4) sequentially selects the synthesis units, such as diphones and demisyllables, by continuously inspecting the string of incoming phonetic transcriptive symbols, and finds out the addresses of the speech segments corresponding to the selected synthesis units from the memory thereof as in Table 1. Table 1 shows an example of the speech segment table kept in the speech segment selection block (4) which selects diphone-based speech segments. This results in the formation of an address of the selected speech segment being output to the speech segment storage block (5).

The speech segments corresponding to the addresses of the speech segment are coded according to the method of the present invention, to be described later, and are stored at the addresses of the memory of the speech segment storage block (5).

TABLE 1 ______________________________________ phonetic transcriptive memory address symbol of speech segment (in hexadecimal) ______________________________________ /ai/ 0000 /au/ 0021 /ab/ 00A3 /ad/ 00FF . . . . . . ______________________________________

When the address of the selected speech segment from the speech segment selection block (4) is inputted to the speech segment storage block (5), the speech segment storage block (5) fetches the corresponding speech segment data from the memory in the speech segment storage block (5) and sends it to a voiced sound synthesis block (6) if it is a voiced sound or a voiced fricative sound, or to an unvoiced sound synthesis block (7) if it is an unvoiced sound. That is, the voiced sound synthesis block (6) synthesizes a digital speech signal corresponding to the voiced sound speech segments; and, the unvoiced sound synthesis block (7) synthesizes a digital speech signal corresponding to the unvoiced sound speech segment. Each digital synthesized speech signal of the voiced sound synthesis block (6) and the unvoiced sound synthesis block 7 is then converted into an analog signal.

Thus, the resulting digital synthesized speech signal output from the voiced sound synthesis block (6) or unvoiced sound synthesis block (7) is then sent to a D/A conversion block (8) consisting of a digital-to-analog converter, an analog low-pass filter and an analog amplifier, and is converted into an analog signal to provide synthesized speech sound.

When the voiced sound synthesis block (6) and the unvoiced sound synthesis block (7) concatenate the speech segments, they provide the prosody as intended by the prosodics generation subsystem (2) to synthesized speech by properly adjusting the duration, the intensity and the pitch frequency of the speech segment on the basis of the prosodic information, i.e., intonation pattern data, stress pattern data, duration data.

The preparation of the speech segment for storage in the speech segment storage block (5) is as follows. A synthesis unit is first selected. Such synthesis units include phoneme, allophone, diphone, syllable, demisyllable, CVC, VCV, CV, VC unit (here, "C" stands for a consonant, "V" stands for a vowel phoneme, respectively) or combinations thereof. The synthesis units which are most widely used in the current speech synthesis method are the diphones and the demisyllables.

The speech segment corresponding to each element of an aggregation of the synthesis units is segmented from the speech samples which are actually pronounced by a human. Accordingly, the number of elements of the synthesis unit aggregation is the same as the number of speech segments. For example, in case where demisyllables are used as the synthesis units in English, the number of demisyllables is about 1000 and, accordingly the number of the speech segments is also about 1000. In general, such speech segments consist of the unvoiced sound interval and the voiced sound interval.

In the present invention, the unvoiced speech segment and the voiced speech segment obtained by segmenting the prior art speech segment into the unvoiced sound interval and the voiced sound interval are used as the basic synthesis unit. The unvoiced sound speech synthesis portion is accomplished according to the prior art as discussed below. The voiced sound speech synthesis is accomplished according to the present invention.

Thus, the unvoiced speech segments are decoded at the unvoiced sound synthesis block (7) shown in FIG. 2. In case of decoding the unvoiced sound, it has been noted in the prior art that the use of an artificial white random noise signal as an excitation signal for a synthesis filter does not aggravate or decrease the quality of the decoded speech. Therefore, in the coding and decoding of the unvoiced speech segments the prior art vocoding method can be applied as it is, in which method the white noise is used as the excitation signal. For example, in the prior art synthesis of unvoiced sound, the white noise signal can be generated by a random number generation algorithm and can be utilized, or the white noise signal generated in advance and stored in memory can be retrieved from memory when synthesizing, or a residual signal obtained by filtering the unvoiced sound interval of the actual speech utilizing an inverse spectral envelope filter and stored in memory can be retrieved from memory, when synthesizing. If it is not necessary to change the duration of the unvoiced speech segment, an extremely simple coding method can be utilized in which the unvoiced sound portion is coded according to a waveform coding method such as Pulse Code Modulation (PCM) or Adaptive Differential Pulse Code Modulation (ADPCM) and is stored. It is then decoded to be used, when synthesizing.

The present invention relates to a coding and synthesis method of the voiced speech segments which governs the quality of the synthesized speech. A description with respect to such a method with the emphasis on the speech segment storage block and the voiced sound synthesis block is (6) shown in FIG. 2.

The voiced speech segments among the speech segments stored in the memory of the speech segment storage block (5) are decomposed into wavelets of pitch periodic component in advance according to the periodic-waveform decomposition method of the present invention and stored therein. The voiced sound synthesis block (6) synthesizes speech having the desired pitch and the duration patterns by properly selecting and arranging the wavelets according to the time warping-based wavelet relocation method. The principle of these methods is described below with reference to the drawings.

Voiced speech s(n) is a periodic signal obtained when a periodic glottal wave generated at the vocal cords passes through the acoustical vocal tract filter V(f) consisting of the oral cavity, pharyngeal cavity and nasal cavity. Here, it is assumed that the vocal tract filter V(f) includes frequency characteristic due to a lip radiation effect. A spectrum S(f) of voiced speech is characterized by:

1. A fine structure varying rapidly with respect to frequency "f"; and

2. A spectral envelope varying slowly thereto, the former being due to periodicity of the voiced speech signal and the latter reflecting the spectrum of a glottal pulse and frequency characteristic of the vocal tract filter.

The spectrum S(f) of the voiced speech takes the same form as the form obtained when the fine structure of an impulse train due to harmonic components which exist at integer multiples of the pitch frequency Of is multiplied by a spectral envelope function H(f). Therefore, voiced speech s(n) can be regarded as an output signal when a periodic pitch pulse train signal e(n) having a flat spectral envelope and the same period as the voiced speech S(n) is input to a time-varying filter having the same frequency response characteristic as the spectral envelope function H(f) of the voiced speech s(n). Viewing this in the time domain, the voiced speech s(n) is a convolution of an impulse response h(n) of the filter H(f) and the periodic pitch pulse train signal e(n). Since H(f) corresponds to the spectral envelope function of the voiced speech s(n), the time-varying filter having H(f) as its frequency response characteristic is referred to as a spectral envelope filter or a synthesis filter.

In FIG. 3A, a signal for 4 periods of a glottal waveform is illustrated. Commonly, the waveforms of the glottal pulses composing the glottal waveform are similar to each other but not completely identical, and also the interval time between the adjacent glottal pulses is similar to each other but not completely equal. As described above, the voiced speech waveform s(n) of FIG. 3C is generated when the glottal waveform g(n) shown in FIG. 3A is filtered by the vocal tract filter V(f). The glottal waveform g(n) consists of the glottal pulses g1(n), g2(2), g3(n) and g4(n) distinguished from each other in terms of time, and when they are filtered by the vocal tract filter V(f), the wavelets s1(n), s2(n), s3(n) and s4(n) shown in FIG. 3B are generated. The voiced speech waveform s(n) shown in FIG. 3C is generated by superposing such wavelets.

A basic concept of the present invention is that if one can obtain the wavelets which compose a voiced speech signal by decomposing the voiced speech signal, one can synthesize speech with arbitrary accent and intonation pattern by changing the intensity of the wavelets and the time intervals between them.

Because the voiced speech waveform s(n) shown in FIG. 3C was generated by superposing the wavelets which overlap with each other in time, it is difficult to get the wavelets back from the speech waveform s(n).

In order for the waveform of each period not to overlap with each other in the time domain, the waveform must be a peaky waveform in which the energy is concentrated about one point in time, as seen in FIG. 3F.

A spiky waveform is a waveform that has a nearly flat spectral envelope in the frequency domain. When a voiced speech waveform s(n) is given, a periodic pitch pulse train signal e(n) having a flat spectral envelope as shown in FIG. 3F can be obtained as output by estimating the envelope of the spectrum S(f) of the waveform s(n) and inputing it into an inverse spectral envelope filter 1/H(f) having an inverse of the envelope function H(f) as a frequency characteristic. FIGS. 4, 5A and 5B are related to this step.

Because the pitch pulse waveforms of each period composing the periodic pitch pulse train signal e(n) as shown in FIG. 3F do not overlap with one another in the time domain, they can be separated. The principle of the periodic-waveform decomposition method is that because the separated "pitch pulse signals for one period" e1(n), e2(n), . . . have a substantially flat spectrum, if they are input back to the spectral envelope filter H(f) so that the signals have the original spectrum, then the wavelets s1(n), s2(n), etc. as shown in FIG. 3B can be obtained.

FIG. 4 is a block diagram of the periodic-waveform decomposition method of the present invention in which the voiced speech segment is analyzed into wavelets. The voiced speech waveform s(n) which is a digital signal, is obtained by band-limiting the analog voiced speech signal or musical instrumental sound signal with a low pass filter and by converting the resulting signals into analog-to-digital signals and storing on a magnetic disc in the form of the Pulse Code Modulation (PCM) code format by grouping several bits at a time, and is then retrieved to process when needed.

The first stage of wavelet preparation process according to the periodic-waveform decomposition method is a blind deconvolution in which the voiced speech waveform s(n) (periodic signal s(n)) is deconvolved into an impulse response h(n), which is a time domain function of the spectrum envelope function H(f) of the signal s(n), and a periodic pitch pulse train signal e(n) having a flat spectral envelope and the same period as the signal s(n). See FIGS. 5A and 5B and the discussion related thereto.

As described, for the blind deconvolution, the spectrum estimation technic with which the spectral envelope function H(f) is estimated from the signal s(n) is essential.

Prior art spectrum estimation technics can be classified into 3 methods:

1. A block analysis method;

2. A pitch-synchronous analysis method; and

3. A sequential analysis method depending on the length of an analysis interval.

The block analysis method is a method in which the speech signal is divided into blocks of constant duration of the order of 10-20 ms (milliseconds), and then the analysis is done with respect to the constant number of speech samples existing in each block, obtaining one set (commonly 10-16 parameters) of spectral envelope parameters for each block, for which method a homomorphic analysis method and a block linear prediction analysis method are typical.

The pitch-synchronous analysis method obtains one set of spectral envelope parameters for each period by performing analysis on each period speech signal which was obtained by dividing the speech signal with the pitch period as the unit (as shown in FIG. 3C), for which method the analysis-by-synthesis method and the pitch-synchronous linear prediction analysis method are typical.

In the sequential analysis method, one set of spectral envelope parameters is obtained for each speech sample (as shown in FIG. 3D by estimating the spectrum for each speech sample, for which method the least squares method and the recursive least squares method which are a kind of adaptive filtering method, are typical.

FIG. 3D shows variation with time of the first 4 reflection coefficients among 14 reflection coefficients k1, k2, . . . , k14 which constitute a spectral envelope parameter set obtained by the sequential analysis method. (Please refer to FIG. 5A.) As can be seen from the drawing, the values of the spectral envelope parameters change continuously due to continuous movement of the articulatory organs, which means that the impulse response h(n) of the spectral envelope filter continuously changes. Here, for convenience of explanation, assuming that h(n) does not change in an interval of one period, h(n) during the first, second and third period is denoted respectively as h(n)1, h(n)2, h(n)3 as shown in FIG. 3E.

A set of envelope parameters obtained by various spectrum estimation technics, such as a cepstrum CL(i) which is a parameter set obtained by the homomorphic analysis method, and a prediction coefficient set {ai} or a reflection coefficient set {ki}, or a set of line spectrum pairs, etc. which is obtained by applying the recursive least squares method or the linear prediction method, is equally dealt with as the H(f) or h(n), because it can make the frequency characteristic H(f) or the impulse response h(n) of the spectral envelope filter. Therefore, hereinafter, the impulse response is also referred to as the spectral envelope parameter set.

FIGS. 5A and 5B show methods of the blind deconvolution.

FIG. 5A shows a blind deconvolution method performed by using the linear prediction analysis method or by using the recursive least squares method which are both prior art methods. Given the voiced speech waveform s(n), as shown in FIG. 3C, the prediction coefficients (a1, a2, . . . , aN) or the reflection coefficients (k1, k2, . . . , kN) which are the spectral envelope parameters representing the frequency characteristic H(f) or the impulse response h(n) of the spectral envelope filter are obtained utilizing the linear prediction analysis method or the recursive least squares method. Normally 10-16 prediction coefficients are sufficient for the order of the prediction "N". Utilizing the prediction coefficients (a1, a2 . . . aN) and the reflection coefficients (k1, k2 . . . kN) as the spectral envelope parameter, an inverse spectral envelope filter (or simply referred to as an inverse filter) having the frequency characteristic of 1/H(f) which is an inverse of the frequency characteristic H(f) of the spectral envelope filter, can easily be constructed by one skilled in the art. If the voiced speech waveform is the input to the inverse spectral envelope filter, also referred to as a linear prediction error filter in the linear prediction analysis method or in the recursive least squares method, the periodic pitch pulse train signal of the type of FIG. 3F having the flat spectral envelope called as a prediction error signal or a residual signal can be obtained as output from the filter.

FIGS. 5B and 5C show the blind deconvolution method utilizing the homomorphic analysis method, which is a block analysis method, while FIG. 5B shows the method performed by a frequency division (NOT heretofore DEFINED or discussed relative to this--explain or delete) and FIG. 5C shows the method performed by inverse filtering respectively.

A description of FIG. 5B follows. Speech samples for analysis of one block are obtained by multiplying the voiced speech signal s(n) by a tapered window function such as Hamming window having a duration of about 10-20 ms. A cepstral sequence c(i) is then obtained by processing the speech samples utilizing a series of homomorphic processing procedures consisting of a discrete Fourier transform, a complex logarithm and an inverse discrete Fourier transform as shown in FIG., 5D. The cepstrum is a function of the quefrency which is a unit similar to time.

A low-quefrency cepstrum CL(i) situated around an origin representing the spectral envelope of the voiced speech s(n) and a high-quefrency cepstrum CH(i) representing a periodic pitch pulse train signal e(n), are capable of being separated from each other in quefrency domain. That is, multiplying the cepstrum c(i) by a low-quefrency window function and a high-quefrency window function, respectively, gives CL(i) and CH(i), respectively. Taking them respectively through an inverse homomorphic processing procedure as shown in FIG. 5E gives the impulse response h(n) and the pitch pulse train signal e(n). In this case, because taking the CH(i) through the inverse homomorphic processing procedure does not directly give the pitch pulse train signal e(n) but gives the pitch pulse train signal of one block multiplied by a time window function w(n), e(n) can be obtained by multiplying again the pitch pulse train signal by an inverse time window function 1/w(n) corresponding to the inverse of w(n).

The method of FIG. 5C is the same as that of FIG. 5B, except only that CL(i) instead of CH(i) is utilized in FIG. 5C in obtaining the periodic pitch pulse train signal e(n). That is, in this method, by utilizing the property that an impulse response h.sup.-1 (n) corresponding to 1/H(f) which is an inverse of the frequency characteristics H(f) can be obtained by processing -CL(i), which is obtained by taking the negative of CL(i), through the inverse homomorphic processing procedure, the periodic pitch pulse train signal e(n) can be obtained as output by constructing a finite-duration impulse response (FIR) filter which has h.sup.-1 (n) as an impulse response and by inputting to the filter an original speech signal s(n) which is not multiplied by a window function. This method is an inverse filtering method which is basically the same as that of FIG. 5A, except only that while in the homomorphic analysis of FIG. 5C the inverse spectral envelope filter 1/H(f) is constructed by obtaining an impulse response h.sup.-1 (n) of the inverse spectral envelope filter, in FIG. 5A the inverse spectral envelope filter 1/H(f) can be directly constructed by the prediction coefficients {ai} or the reflection coefficients {ki} obtained by the linear prediction analysis method.

In the blind deconvolution based on the homomorphic analysis, the impulse response h(n) or the low-quefrency cepstrum CL(i) shown by dotted lines in FIGS. 5B and 5C can be used as the spectral envelope parameter set. When using the impulse response {h(o), h(1), . . . . , h(N-1)} a spectral envelope parameter set is normally comprised of a good number of parameters of the order of N being 90-120, whereas the number of parameters can be decreased to 50-60 with N being 25-30 when using the cepstrum {CL(-N)m CL(-N+1), . . . , O, . . . , CL(N)}.

As described above, the voiced speech waveform s(n) i