|
Description  |
|
|
BACKGROUND OF INVENTION
1. Field of the Invention
The invention relates to a speech synthesis system and a method of synthesizing speech, and more particularly, to a speech segment coding and a pitch control method which significantly improves the quality of the synthesized speech.
The principle of the present invention can be directly applied not only to speech synthesis but also to synthesis of other sounds, such as, the sounds of musical instruments or singing, each of which has a property similar to that of speech, or
to a very low rate speech coding or speech rate conversion. The present invention will be described below concentrating on speech synthesis.
There are speech synthesis methods for implementing a text-to-speech synthesis system which can synthesize countless vocabularies by converting text, that is, character strings, into speech. However a method which is easy to implement and most
generally utilized is speech segmental synthesis method, also called synthesis-by-concatenation method, in which the human speech is sampled and analyzed into phonetic units, such as demisyllables or diphones, to obtain short speech segments, which are
then coded and stored in memory, and when the text is inputted, it is converted into phonetic transcriptions. Speech segments corresponding to the phonetic transcriptions are then sequentially retrieved from the memory and decoded to synthesize the
speech corresponding to the input text.
In this type of segmental speech synthesis method, one of the most important elements to govern the quality of the synthesized speech is the coding method of the speech segments. In the prior art speech segmental synthesis method of the speech
synthesis system, a vocoding method of low speech quality is mainly used as the speech coding method for storing speech segments. However this is one of the most important causes which lowers the quality of synthesized speech. A brief description with
respect to the prior art speech segment coding method follows.
The speech coding method can be largely classified into a waveform coding method of good speech quality and a vocoding method of low speech quality. Since the waveform coding method is a method which intends to transfer the speech waveform as it
is, it is very difficult to change pitch frequency and duration so that it is impossible to adjust intonation and rate of speech when performing the speech synthesis. Also it is impossible to conjoin the speech segments therebetween smoothly so that the
waveform coding method is basically not suitable for coding the speech segments.
On the contrary, when the vocoding method (also called an analysis-synthesis method) is used, the pitch pattern and the duration of the speech segment can be arbitrarily changed. Further, since the speech segments can also be smoothly conjoined
by interpolating the spectral envelope estimation parameters so that the vocoding method is suitable for the coding means for text-to-speech synthesis, vocoding methods, such as linear predictive coding (LPC) or formant vocoding, is adopted in most
present speech synthesis systems. However, since the quality of decoded speech is low when the speech is coded using the vocoding method, the synthesized speech obtained by decoding the stored speech segments and concatenating them can not have better
speech quality than that offered by the vocoding method.
Attempts made so far to improve speech quality offered by the vocoding method replaces the impulse train used with an excitation signal that has a less artificial waveform. One such attempt was to utilize, a waveform having peakiness lower than
that of the impulse, for example a triangular waveform or a half circle waveform or a waveform similar to a glottal pulse. Another attempt was to select a sample pitch pulse of one or some of residual signal pitch periods obtained by inverse filtering
and to utilize instead of the impulse, one sample pulse for the entire time period or for a substantially long time period. However, such attempts to replace the impulse with an excitation pulse of other waveforms have not improved the speech quality or
have improved it only slightly, if ever, and have never obtained synthesized speech with a quality proximating that of natural speech.
It is the object of the present invention to synthesize high quality speech having a naturalness and an intelligibility with the same degree as that of human speech by utilizing a novel speech segment coding method enabling good speech quality
and pitch control. The method of the present invention combines the merits of the waveform coding method which provides good speech quality but without the ability to control the pitch and the vocoding method which provides pitch control but has low
speech quality.
The present invention utilizes a periodic waveform decomposition method which is a coding method which decomposes a signal in a voiced sound sector in the original speech into wavelets equivalent to one-period speech waveforms made by glottal
pulses to code and store the decomposed signal, and a time warping-based wavelet relocation method which is a waveform synthesis method capable of arbitrary adjustment of the duration and pitch frequency of the speech segment while maintaining the
quality of the original speech by selecting wavelets nearest to positions where wavelets are to be placed among stored wavelets, then by decoding the selected wavelets and superposing them. For purposes of this invention musical sounds are treated as
voiced sounds.
The preceding objects should be construed as merely presenting a few of the more pertinent features and applications of the invention. Many other beneficial results can be obtained by applying the disclosed invention in a different manner or
modifying the invention within the scope of the disclosure. Accordingly, other objects and a fuller understanding of the invention may be had by referring to both the summary of the invention and the detailed description, below, which describe the
preferred embodiment in addition to the scope of the invention defined by the claims considered in conjunction with the accompanying drawings.
SUMMARY OF THE INVENTION
Speech segment coding and pitch control methods for speech synthesis systems of the present invention are defined by the claims with specific embodiments shown in the attached drawings. For the purpose of summarizing the invention, the invention
relates to a method capable of synthesizing speech that proximates the quality of natural speech by adjusting its duration and pitch frequency by waveform-coding wavelets of each period, storing them in memory, and at the time of synthesis, decoding them
and locating them at appropriate time points such that they have the desired pitch pattern and then superposing them to generate natural speech, singing, music and the like.
The present invention includes a speech segment coding method for use with a speech synthesis system, where the method comprises the forming of wavelets by obtaining parameters which represent a spectral envelope in each analysis time interval.
This is done by analyzing a periodic or quasi-periodic digital signal, such as voiced speech, with the spectrum estimation technique. An original signal is first deconvolved into an impulse response represented by the spectral envelope parameters and a
periodic or quasiperiodic pitch pulse train signal having a nearly flat spectral envelope. An excitation signal obtained by appending zero-valued samples after a pitch pulse signal of one period obtained by segmenting the pitch pulse train signal period
by period so that one pitch pulse is contained in each period and an impulse response corresponding to a set of spectral envelope parameters in the same time interval as the excitation signal are convolved to form a wavelet for that period.
The wavelets, rather than being formed by waveform-coding and stored in memory in advance, may be formed by mating information obtained by waveform-coding a pitch pulse signal of each period interval, obtained by segmentation, with information
obtained by coding a set of spectral envelope estimation parameters with the same time interval as the above information, or with an impulse response corresponding to the parameters and storing the wavelet information in memory. There are two methods of
producing synthetic speech by using the wavelet information stored in memory. The first method is to constitute each wavelet by convolving an excitation signal obtained by appending zero-valued samples after a pitch pulse signal of one period obtained
by decoding the information and an impulse response corresponding to the decoded spectral envelope parameters in the same time interval as the excitation signal, and then to assign the wavelets to appropriate time points such that they have desired pitch
pattern and duration pattern, locate them at the time points, and then superpose them.
The second method is to constitute a synthetic excitation signal by assigning the pitch pulse signals obtained by decoding the wavelet information to appropriate time points such that they have desired pitch pattern and duration pattern and
locating them at the time points, and constitute a set of synthetic spectral envelope parameters either by temporally compressing or expanding the set of time functions of the parameters on a subsegment-by-subsegment basis, depending on whether the
duration of a subsegment in a speed segment to be synthesized is shorter or longer than that of a corresponding subsegment in the original speech segment, respectively, or by locating the set of time functions of the parameters of one period
synchronously with the mated pitch pulse signal of one period located to form the synthetic excitation signal, and to convolve the synthetic excitation signal and an impulse response corresponding to the synthetic spectral envelope parameter set by
utilizing a time-varying filter or by using an FFT(Fast Fourier Transform)-based fast convolution technique. In the latter method, a blank interval occurs when a desired pitch period is longer than the original pitch period and an overlap interval
occurs when the desired pitch period is shorter than the original pitch period.
In the overlap interval, the synthetic excitation signal is obtained by adding the overlapped pitch pulse signals to each other or by selecting one of them, and the spectral envelope parameter is obtained by selecting either one of the overlapped
spectral envelope parameters or by using an average value of the two overlapped parameters.
In the blank interval, the synthetic excitation signal is obtained by filling it with zero-valued samples, and the synthetic spectral envelope parameter is obtained by repeating the values of the spectral envelope parameters at the beginning and
ending points of the proceeding and following periods before and after the center of the blank interval, or by repeating one of the two values or an average value of the two values, or by filling it with values and smoothly connecting the two values.
The present invention further includes a pitch control method of a speech synthesis system capable of controlling duration and pitch of a speech segment by a time warping-based wavelet relocation method which makes it possible to synthesize
speech with almost the same quality as that of natural speech, by coding important boundary time points such as the starting point, the end point and the steady-state points in a speech segment and pitch pulse positions of each wavelet or each pitch
pulse signal and storing them in memory simultaneously at the time of storing each speech segment, and at the time of synthesis, obtaining a time-warping function by comparing desired boundary time points and original boundary time points stored
corresponding to the desired boundary time points, finding out the original time points corresponding to each desired pitch pulse position by utilizing the time-warping function, selecting wavelets having pitch pulse positions nearest to the original
time points and locating them at desired pitch pulse positions, and superposing the wavelets.
The pitch control method may further include producing synthetic speech by selecting pitch pulse signals of one period and spectral envelope parameters corresponding to the pitch pulse signals, instead of the wavelets, and locating them, and
convolving the located pitch pulse signals and impulse response corresponding to the spectral envelope parameters to produce wavelets and superposing the produced wavelets, or convolving a synthetic excitation signal obtained by superposing the located
pitch pulse signals and a time-varying impulse response corresponding to a synthetic spectral envelope parameters made by concatenating the located spectral envelope parameters.
A voiced speech synthesis device of a speech synthesis system is disclosed and includes a decoding subblock 9 producing wavelet information by decoding wavelet codes from the speech segment storage block 5. A duration control subblock 10
produces time-warping data from input of duration data from a prosodics generation subsystem 2 and boundary time points included in header information from the speech segment storage block 5. A pitch control subblock 11 produces pitch pulse position
information such that it has an intonation pattern as indicated by an intonation pattern data from input of the header information from the speech segment storage block 5, the intonation pattern data from the prosodics generation subsystem and the
time-warping information from the duration control subblock 10. An energy control subblock 12 produces gain information such that synthesized speech has the stress pattern as indicated by stress pattern data from input of the stress pattern data from
the prosodics generation subsystem 2, the time-warping information from the duration control subblock 10 and pitch pulse position information from the pitch control subblock 11. A waveform assembly subblock 13 produces a voiced speech signal from input
of the wavelet information from the decoding subblock 9, the time-warping information from the duration control subblock 10, the pitch pulse position information from the pitch control subblock 11 and the gain information from the energy control subblock
12.
Thus, according to the present invention, text is inputted to the phonetic preprocessing subsystem 1 where it is converted into phonetic transcriptive symbols and syntatic analysis data. The syntatic analysis data is outputted to a prosodics
generation subsystem 2. The prosodics generation subsystem 2 outputs prosodic information to the speech segment concatenation subsystem 3. The phonetic transcriptive symbols output from the preprocessing subsystem is also inputted to the speech segment
concatenation subsystem 3. The phonetic transcriptive symbols are then inputted to the speech segment selection block 4 and the corresponding prosodic data are inputted to the voiced sound synthesis block 6 and to the unvoiced sound synthesis block 7.
In the speech segment selection block 4 each input phonetic transcriptive symbol is matched with a corresponding speech segment synthesis unit and a memory address of the matched synthesis unit corresponding to each input phonetic transcriptive symbol is
found out from a speech segment table in the speech segment storage block 5. The address of the matched synthesis unit is then outputted to the speech segment storage block 5 where the corresponding speech segment in coded wavelet form is selected for
each of the addresses of the matched synthesis units. The selected speech segment in coded wavelet form is outputted to the voiced sound synthesis block 6 for voiced sound and to the unvoiced sound synthesis block 7 for unvoiced sound. The voiced sound
synthesis block 6, which uses the time warping-based wavelet relocation method to synthesize speech sound, and the unvoiced sound synthesis block 7 output digital synthetic speech signals, to the digital-to-analog converter for converting the input
digital signals into analog signals which are the synthesized speech sounds.
To utilize the present invention, speech and/or music is first recorded on magnetic tape. The resulting sound is then converted from analog signals to digital signals by low-pass filtering the analog signals and then feeding the filtered signals
to an analog-to-digital converter. The resulting digitized speech signals are then segmented into a number of speech segments having sounds which correspond to synthesis units, such as phonemes, diphones, demisyllables and the like, by using known
speech editing tools. Each resulting speech segment is then differentiated into voiced and unvoiced speech segments by using known voiced/unvoiced detection and speech editing tools. The unvoiced speech segments are encoded by known vocoding methods
which use white random noise as an unvoiced speech source. The vocoding methods include LPC, homomorphic, formant vocoding methods, and the like.
The voiced speech segments are used to form wavelets sj(n) according to the method disclosed below in FIG. 4. The wavelets sj(n) are then encoded by using an appropriate waveform coding method. Known waveform coding methods include Pulse Code
Modulation (PCM), Adaptive Differential Pulse Code Modulation (ADPCM), Adaptive Predictive Coding (APC) and the like. The resulting encoded voiced speech segments are stored in the speech segment storage block 5 as shown in FIGS. 6A and 6B. The encoded
unvoiced speech segments are also stored in the speech segment storage block 5.
The more pertinent and important features of the present invention have been outlined above in order that the detailed description of the invention which follows will be better understood and that the present contribution to the art can be fully
appreciated. Additional features of the invention described hereinafter form the subject of the claims of the invention. Those skilled in the art can appreciate that the conception and the specific embodiment disclosed herein may be readily utilized as
a basis for modifying or designing other structures for carrying out the same purposes of the present invention. Further, those skilled in the art can realize that such equivalent constructions do not depart from the spirit and scope of the invention as
set forth in the claims.
BRIEF DESCRIPTION OF THE DRAWINGS
For fuller understanding of the nature and objects of the invention, reference should be had to the following detailed description taken in conjunction with the accompanying dawings in which:
FIG. 1 illustrates the text-to-speech synthesis system of the speech segment synthesis method;
FIG. 2 illustrates the speech segment concatenation subsystem;
FIGS. 3A through 3T illustrate waveforms for explaining the principle of the periodic waveform decomposition method and the wavelet relocation method according to the present invention;
FIG. 4 illustrates a block diagram for explaining the periodic waveform decompostion method;
FIGS. 5A through 5E illustrate block diagrams for explaining the procedure of the blind deconvolution method;
FIGS. 6A and 6B illustrate code formats for the voiced speech segment information stored at the speech segment storage block;
FIG. 7 illustrates the voiced speech synthesis block according to the present invention; and
FIGS. 8A and 8B illustrate graphs for explaining the duration and pitch control method according to the present invention.
Similar reference characters refer to similar parts throughout the several views of the drawings.
DETAILED DESCRIPTION OF THE INVENTION
The structure of the text-to-speech synthesis system of the prior art speech segment synthesis method consists of three subsystems:
A. A phonetic preprocessing subsystem (1);
B. A prosodics generation subsystem (2); and
C. A speech segment concatenation subsystem (3) as shown in FIG. 1. When the text is input from a keyboard, a computer or any other system, to the text-to-speech synthesis system, the phonetic preprocessing subsystem (1) analyzes the syntax of
the text and then changes the text to a string of phonetic transcriptive symbols by applying thereto phonetic recoding rules. The prosodics generation subsystem (2) generates intonation pattern data and stress pattern data utilizing syntactic analysis
data so that appropriate intonation and stress can be applied to the string of phonetic transcriptive symbols, and then outputs the data to the speech segment concatenation subsystem (3). The prosodics generation subsystem (2) also provides the data
with respect to the duration of each phoneme to the speech segment concatenation subsystem (3).
The above three prosodic data, i.e. the intonation pattern data, the stress pattern data and the data regarding the duration of each phoneme are, in general, sent to the speech segment concatenation subsystem (3) together with the string of the
phonetic transcriptive symbols generated by the phonetic preprocessing subsystem (1), although they may be transferred to the speech segment concatenation subsystem (3) independently of the string of the phonetic transcriptive symbols.
The speech segment concatenation subsystem (3) generates continuous speech by sequentially fetching appropriate speech segments which are coded and stored in memory thereof according to the string of the phonetic transcriptive symbols (not shown)
and by decoding them. At this time the speech segment concatenation subsystem (3) can generate synthetic speech having the intonation, stress and speech rate as intended by the prosodics generation subsystem (2) by controlling the energy (intensity),
the duration and the pitch period of each speech segment according to the prosodic information.
The present invention remarkably improves speech quality in comparison with synthesized speech of the prior art by improving the coding method for storing the speech segments in the speech segment concatenation subsystem (3). A description with
respect to the operation of the speech segment concatenation subsystem (3) with reference to FIG. 2 follows.
When the string of the phonetic transcriptive symbols formed by the phonetic preprocessing subsystem (1) is inputted to the speech segment selection block (4), the speech segment selection block (4) sequentially selects the synthesis units, such
as diphones and demisyllables, by continuously inspecting the string of incoming phonetic transcriptive symbols, and finds out the addresses of the speech segments corresponding to the selected synthesis units from the memory thereof as in Table 1.
Table 1 shows an example of the speech segment table kept in the speech segment selection block (4) which selects diphone-based speech segments. This results in the formation of an address of the selected speech segment being output to the speech
segment storage block (5).
The speech segments corresponding to the addresses of the speech segment are coded according to the method of the present invention, to be described later, and are stored at the addresses of the memory of the speech segment storage block (5).
TABLE 1 ______________________________________ phonetic transcriptive memory address symbol of speech segment (in hexadecimal) ______________________________________ /ai/ 0000 /au/ 0021 /ab/ 00A3 /ad/ 00FF . . . . . .
______________________________________
When the address of the selected speech segment from the speech segment selection block (4) is inputted to the speech segment storage block (5), the speech segment storage block (5) fetches the corresponding speech segment data from the memory in
the speech segment storage block (5) and sends it to a voiced sound synthesis block (6) if it is a voiced sound or a voiced fricative sound, or to an unvoiced sound synthesis block (7) if it is an unvoiced sound. That is, the voiced sound synthesis
block (6) synthesizes a digital speech signal corresponding to the voiced sound speech segments; and, the unvoiced sound synthesis block (7) synthesizes a digital speech signal corresponding to the unvoiced sound speech segment. Each digital synthesized
speech signal of the voiced sound synthesis block (6) and the unvoiced sound synthesis block 7 is then converted into an analog signal.
Thus, the resulting digital synthesized speech signal output from the voiced sound synthesis block (6) or unvoiced sound synthesis block (7) is then sent to a D/A conversion block (8) consisting of a digital-to-analog converter, an analog
low-pass filter and an analog amplifier, and is converted into an analog signal to provide synthesized speech sound.
When the voiced sound synthesis block (6) and the unvoiced sound synthesis block (7) concatenate the speech segments, they provide the prosody as intended by the prosodics generation subsystem (2) to synthesized speech by properly adjusting the
duration, the intensity and the pitch frequency of the speech segment on the basis of the prosodic information, i.e., intonation pattern data, stress pattern data, duration data.
The preparation of the speech segment for storage in the speech segment storage block (5) is as follows. A synthesis unit is first selected. Such synthesis units include phoneme, allophone, diphone, syllable, demisyllable, CVC, VCV, CV, VC unit
(here, "C" stands for a consonant, "V" stands for a vowel phoneme, respectively) or combinations thereof. The synthesis units which are most widely used in the current speech synthesis method are the diphones and the demisyllables.
The speech segment corresponding to each element of an aggregation of the synthesis units is segmented from the speech samples which are actually pronounced by a human. Accordingly, the number of elements of the synthesis unit aggregation is the
same as the number of speech segments. For example, in case where demisyllables are used as the synthesis units in English, the number of demisyllables is about 1000 and, accordingly the number of the speech segments is also about 1000. In general,
such speech segments consist of the unvoiced sound interval and the voiced sound interval.
In the present invention, the unvoiced speech segment and the voiced speech segment obtained by segmenting the prior art speech segment into the unvoiced sound interval and the voiced sound interval are used as the basic synthesis unit. The
unvoiced sound speech synthesis portion is accomplished according to the prior art as discussed below. The voiced sound speech synthesis is accomplished according to the present invention.
Thus, the unvoiced speech segments are decoded at the unvoiced sound synthesis block (7) shown in FIG. 2. In case of decoding the unvoiced sound, it has been noted in the prior art that the use of an artificial white random noise signal as an
excitation signal for a synthesis filter does not aggravate or decrease the quality of the decoded speech. Therefore, in the coding and decoding of the unvoiced speech segments the prior art vocoding method can be applied as it is, in which method the
white noise is used as the excitation signal. For example, in the prior art synthesis of unvoiced sound, the white noise signal can be generated by a random number generation algorithm and can be utilized, or the white noise signal generated in advance
and stored in memory can be retrieved from memory when synthesizing, or a residual signal obtained by filtering the unvoiced sound interval of the actual speech utilizing an inverse spectral envelope filter and stored in memory can be retrieved from
memory, when synthesizing. If it is not necessary to change the duration of the unvoiced speech segment, an extremely simple coding method can be utilized in which the unvoiced sound portion is coded according to a waveform coding method such as Pulse
Code Modulation (PCM) or Adaptive Differential Pulse Code Modulation (ADPCM) and is stored. It is then decoded to be used, when synthesizing.
The present invention relates to a coding and synthesis method of the voiced speech segments which governs the quality of the synthesized speech. A description with respect to such a method with the emphasis on the speech segment storage block
and the voiced sound synthesis block is (6) shown in FIG. 2.
The voiced speech segments among the speech segments stored in the memory of the speech segment storage block (5) are decomposed into wavelets of pitch periodic component in advance according to the periodic-waveform decomposition method of the
present invention and stored therein. The voiced sound synthesis block (6) synthesizes speech having the desired pitch and the duration patterns by properly selecting and arranging the wavelets according to the time warping-based wavelet relocation
method. The principle of these methods is described below with reference to the drawings.
Voiced speech s(n) is a periodic signal obtained when a periodic glottal wave generated at the vocal cords passes through the acoustical vocal tract filter V(f) consisting of the oral cavity, pharyngeal cavity and nasal cavity. Here, it is
assumed that the vocal tract filter V(f) includes frequency characteristic due to a lip radiation effect. A spectrum S(f) of voiced speech is characterized by:
1. A fine structure varying rapidly with respect to frequency "f"; and
2. A spectral envelope varying slowly thereto, the former being due to periodicity of the voiced speech signal and the latter reflecting the spectrum of a glottal pulse and frequency characteristic of the vocal tract filter.
The spectrum S(f) of the voiced speech takes the same form as the form obtained when the fine structure of an impulse train due to harmonic components which exist at integer multiples of the pitch frequency Of is multiplied by a spectral envelope
function H(f). Therefore, voiced speech s(n) can be regarded as an output signal when a periodic pitch pulse train signal e(n) having a flat spectral envelope and the same period as the voiced speech S(n) is input to a time-varying filter having the
same frequency response characteristic as the spectral envelope function H(f) of the voiced speech s(n). Viewing this in the time domain, the voiced speech s(n) is a convolution of an impulse response h(n) of the filter H(f) and the periodic pitch pulse
train signal e(n). Since H(f) corresponds to the spectral envelope function of the voiced speech s(n), the time-varying filter having H(f) as its frequency response characteristic is referred to as a spectral envelope filter or a synthesis filter.
In FIG. 3A, a signal for 4 periods of a glottal waveform is illustrated. Commonly, the waveforms of the glottal pulses composing the glottal waveform are similar to each other but not completely identical, and also the interval time between the
adjacent glottal pulses is similar to each other but not completely equal. As described above, the voiced speech waveform s(n) of FIG. 3C is generated when the glottal waveform g(n) shown in FIG. 3A is filtered by the vocal tract filter V(f). The
glottal waveform g(n) consists of the glottal pulses g1(n), g2(2), g3(n) and g4(n) distinguished from each other in terms of time, and when they are filtered by the vocal tract filter V(f), the wavelets s1(n), s2(n), s3(n) and s4(n) shown in FIG. 3B are
generated. The voiced speech waveform s(n) shown in FIG. 3C is generated by superposing such wavelets.
A basic concept of the present invention is that if one can obtain the wavelets which compose a voiced speech signal by decomposing the voiced speech signal, one can synthesize speech with arbitrary accent and intonation pattern by changing the
intensity of the wavelets and the time intervals between them.
Because the voiced speech waveform s(n) shown in FIG. 3C was generated by superposing the wavelets which overlap with each other in time, it is difficult to get the wavelets back from the speech waveform s(n).
In order for the waveform of each period not to overlap with each other in the time domain, the waveform must be a peaky waveform in which the energy is concentrated about one point in time, as seen in FIG. 3F.
A spiky waveform is a waveform that has a nearly flat spectral envelope in the frequency domain. When a voiced speech waveform s(n) is given, a periodic pitch pulse train signal e(n) having a flat spectral envelope as shown in FIG. 3F can be
obtained as output by estimating the envelope of the spectrum S(f) of the waveform s(n) and inputing it into an inverse spectral envelope filter 1/H(f) having an inverse of the envelope function H(f) as a frequency characteristic. FIGS. 4, 5A and 5B are
related to this step.
Because the pitch pulse waveforms of each period composing the periodic pitch pulse train signal e(n) as shown in FIG. 3F do not overlap with one another in the time domain, they can be separated. The principle of the periodic-waveform
decomposition method is that because the separated "pitch pulse signals for one period" e1(n), e2(n), . . . have a substantially flat spectrum, if they are input back to the spectral envelope filter H(f) so that the signals have the original spectrum,
then the wavelets s1(n), s2(n), etc. as shown in FIG. 3B can be obtained.
FIG. 4 is a block diagram of the periodic-waveform decomposition method of the present invention in which the voiced speech segment is analyzed into wavelets. The voiced speech waveform s(n) which is a digital signal, is obtained by
band-limiting the analog voiced speech signal or musical instrumental sound signal with a low pass filter and by converting the resulting signals into analog-to-digital signals and storing on a magnetic disc in the form of the Pulse Code Modulation (PCM)
code format by grouping several bits at a time, and is then retrieved to process when needed.
The first stage of wavelet preparation process according to the periodic-waveform decomposition method is a blind deconvolution in which the voiced speech waveform s(n) (periodic signal s(n)) is deconvolved into an impulse response h(n), which is
a time domain function of the spectrum envelope function H(f) of the signal s(n), and a periodic pitch pulse train signal e(n) having a flat spectral envelope and the same period as the signal s(n). See FIGS. 5A and 5B and the discussion related
thereto.
As described, for the blind deconvolution, the spectrum estimation technic with which the spectral envelope function H(f) is estimated from the signal s(n) is essential.
Prior art spectrum estimation technics can be classified into 3 methods:
1. A block analysis method;
2. A pitch-synchronous analysis method; and
3. A sequential analysis method depending on the length of an analysis interval.
The block analysis method is a method in which the speech signal is divided into blocks of constant duration of the order of 10-20 ms (milliseconds), and then the analysis is done with respect to the constant number of speech samples existing in
each block, obtaining one set (commonly 10-16 parameters) of spectral envelope parameters for each block, for which method a homomorphic analysis method and a block linear prediction analysis method are typical.
The pitch-synchronous analysis method obtains one set of spectral envelope parameters for each period by performing analysis on each period speech signal which was obtained by dividing the speech signal with the pitch period as the unit (as shown
in FIG. 3C), for which method the analysis-by-synthesis method and the pitch-synchronous linear prediction analysis method are typical.
In the sequential analysis method, one set of spectral envelope parameters is obtained for each speech sample (as shown in FIG. 3D by estimating the spectrum for each speech sample, for which method the least squares method and the recursive
least squares method which are a kind of adaptive filtering method, are typical.
FIG. 3D shows variation with time of the first 4 reflection coefficients among 14 reflection coefficients k1, k2, . . . , k14 which constitute a spectral envelope parameter set obtained by the sequential analysis method. (Please refer to FIG.
5A.) As can be seen from the drawing, the values of the spectral envelope parameters change continuously due to continuous movement of the articulatory organs, which means that the impulse response h(n) of the spectral envelope filter continuously
changes. Here, for convenience of explanation, assuming that h(n) does not change in an interval of one period, h(n) during the first, second and third period is denoted respectively as h(n)1, h(n)2, h(n)3 as shown in FIG. 3E.
A set of envelope parameters obtained by various spectrum estimation technics, such as a cepstrum CL(i) which is a parameter set obtained by the homomorphic analysis method, and a prediction coefficient set {ai} or a reflection coefficient set
{ki}, or a set of line spectrum pairs, etc. which is obtained by applying the recursive least squares method or the linear prediction method, is equally dealt with as the H(f) or h(n), because it can make the frequency characteristic H(f) or the impulse
response h(n) of the spectral envelope filter. Therefore, hereinafter, the impulse response is also referred to as the spectral envelope parameter set.
FIGS. 5A and 5B show methods of the blind deconvolution.
FIG. 5A shows a blind deconvolution method performed by using the linear prediction analysis method or by using the recursive least squares method which are both prior art methods. Given the voiced speech waveform s(n), as shown in FIG. 3C, the
prediction coefficients (a1, a2, . . . , aN) or the reflection coefficients (k1, k2, . . . , kN) which are the spectral envelope parameters representing the frequency characteristic H(f) or the impulse response h(n) of the spectral envelope filter are
obtained utilizing the linear prediction analysis method or the recursive least squares method. Normally 10-16 prediction coefficients are sufficient for the order of the prediction "N". Utilizing the prediction coefficients (a1, a2 . . . aN) and the
reflection coefficients (k1, k2 . . . kN) as the spectral envelope parameter, an inverse spectral envelope filter (or simply referred to as an inverse filter) having the frequency characteristic of 1/H(f) which is an inverse of the frequency
characteristic H(f) of the spectral envelope filter, can easily be constructed by one skilled in the art. If the voiced speech waveform is the input to the inverse spectral envelope filter, also referred to as a linear prediction error filter in the
linear prediction analysis method or in the recursive least squares method, the periodic pitch pulse train signal of the type of FIG. 3F having the flat spectral envelope called as a prediction error signal or a residual signal can be obtained as output
from the filter.
FIGS. 5B and 5C show the blind deconvolution method utilizing the homomorphic analysis method, which is a block analysis method, while FIG. 5B shows the method performed by a frequency division (NOT heretofore DEFINED or discussed relative to
this--explain or delete) and FIG. 5C shows the method performed by inverse filtering respectively.
A description of FIG. 5B follows. Speech samples for analysis of one block are obtained by multiplying the voiced speech signal s(n) by a tapered window function such as Hamming window having a duration of about 10-20 ms. A cepstral sequence
c(i) is then obtained by processing the speech samples utilizing a series of homomorphic processing procedures consisting of a discrete Fourier transform, a complex logarithm and an inverse discrete Fourier transform as shown in FIG., 5D. The cepstrum
is a function of the quefrency which is a unit similar to time.
A low-quefrency cepstrum CL(i) situated around an origin representing the spectral envelope of the voiced speech s(n) and a high-quefrency cepstrum CH(i) representing a periodic pitch pulse train signal e(n), are capable of being separated from
each other in quefrency domain. That is, multiplying the cepstrum c(i) by a low-quefrency window function and a high-quefrency window function, respectively, gives CL(i) and CH(i), respectively. Taking them respectively through an inverse homomorphic
processing procedure as shown in FIG. 5E gives the impulse response h(n) and the pitch pulse train signal e(n). In this case, because taking the CH(i) through the inverse homomorphic processing procedure does not directly give the pitch pulse train
signal e(n) but gives the pitch pulse train signal of one block multiplied by a time window function w(n), e(n) can be obtained by multiplying again the pitch pulse train signal by an inverse time window function 1/w(n) corresponding to the inverse of
w(n).
The method of FIG. 5C is the same as that of FIG. 5B, except only that CL(i) instead of CH(i) is utilized in FIG. 5C in obtaining the periodic pitch pulse train signal e(n). That is, in this method, by utilizing the property that an impulse
response h.sup.-1 (n) corresponding to 1/H(f) which is an inverse of the frequency characteristics H(f) can be obtained by processing -CL(i), which is obtained by taking the negative of CL(i), through the inverse homomorphic processing procedure, the
periodic pitch pulse train signal e(n) can be obtained as output by constructing a finite-duration impulse response (FIR) filter which has h.sup.-1 (n) as an impulse response and by inputting to the filter an original speech signal s(n) which is not
multiplied by a window function. This method is an inverse filtering method which is basically the same as that of FIG. 5A, except only that while in the homomorphic analysis of FIG. 5C the inverse spectral envelope filter 1/H(f) is constructed by
obtaining an impulse response h.sup.-1 (n) of the inverse spectral envelope filter, in FIG. 5A the inverse spectral envelope filter 1/H(f) can be directly constructed by the prediction coefficients {ai} or the reflection coefficients {ki} obtained by the
linear prediction analysis method.
In the blind deconvolution based on the homomorphic analysis, the impulse response h(n) or the low-quefrency cepstrum CL(i) shown by dotted lines in FIGS. 5B and 5C can be used as the spectral envelope parameter set. When using the impulse
response {h(o), h(1), . . . . , h(N-1)} a spectral envelope parameter set is normally comprised of a good number of parameters of the order of N being 90-120, whereas the number of parameters can be decreased to 50-60 with N being 25-30 when using the
cepstrum {CL(-N)m CL(-N+1), . . . , O, . . . , CL(N)}.
As described above, the voiced speech waveform s(n) i | | |