|
Description  |
|
|
BACKGROUND OF THE INVENTION
The present invention relates to a method of analyzing an input speech
signal and a speech analysis apparatus thereof.
In a conventional speech-recognition apparatus, an utterance-practicing
apparatus for hearing-impaired people, a communications system using
speech analysis and synthesis, or a speech synthesizing apparatus, an
input speech signal is analyzed and its features are extracted, so as to
perform desired processing. The input speech signal is analyzed on the
basis of its frequency spectrum. Human auditory sensitivity for temporal
changes in waveform of the speech signal is worse than that for the
spectrum thereof. Signals having an identical spectrum are recognized as
an identical phoneme.
A voiced sound portion of a speech signal has a structure of a cyclic
signal generated by vibrations of the vocal cord. The frequency spectrum
of the voiced sound has a harmonic spectrum structure. However, an
unvoiced sound portion of the speech signal does not accompany vibrations
of the vocal cord. The unvoiced sound has a sound source as noise
generated by an air stream flowing through the vocal tract. As a result,
the frequency spectrum of the unvoiced sound does not have a cyclic
structure that of the harmonic spectrum. There are two conventional speech
analysis schemes in accordance with these frequency spectra. One scheme
assumes a cyclic pulse source as a sound source of the input speech
signal, and the other assumes a noise source. The former is known as
speech analysis using cepstrum analysis, and the latter speech analysis
scheme is known as speech analysis using an auto-recurrence (AR) model.
According to these speech analysis schemes, microstructures are removed
from the spectrum of the input speech signal, to obtain a so-called
spectrum envelope.
In the analysis of the input speech signal according to the AR model or the
cepstrum analysis scheme to obtain the spectrum envelope, both schemes
assume a stationary stochastic process. If the phoneme changes as a
function of time, such a conventional analysis scheme cannot be applied.
In order to solve this problem, the signal is extracted in a short time
region such that the system does not greatly change. The extracted signal
is multiplied by a window function, such as a Hamming window or a Hanning
window, so as to eliminate the influence of an end point, thereby
obtaining a quasi-stationary signal as a function of time. The
quasi-stationary signal is analyzed to obtain the spectrum envelope. This
envelope is defined as the spectrum envelope at the extraction timing of
the signal.
In order to obtain the spectrum of the input speech signal according to the
conventional speech analysis scheme, an average spectrum of a signal
portion extracted for a given length of time (to be referred to as a frame
length hereinafter), is obtained. For this reason, in order to
sufficiently extract an abrupt change in spectrum, the frame length must
be shortened. In particular, at a leading edge of a consonant, its
spectrum is spontaneously changed within several milliseconds, and the
order of frame length must be several milliseconds. With this arrangement,
the frame length is approximately equal to the pitch period of vibrations
of the vocal cord. The precision of spectrum extraction largely depends on
the timing and degree of the vocal cord pulse included within the frame
length. As a result, the spectrum cannot be stably extracted.
It is assumed that the problem described above is caused since the dynamic
spectrum, as a function of time, is analyzed by a model assuming a
stationary stochastic process.
In conventional spectrum extraction, the time interval (to be referred to
as a frame period) must be shortened upon shifting the frame position for
extracting the signal, so as to follow rapid changes in the spectrum.
However, if the frame period is shortened into, halves, for example, the
number of frames to be analyzed is doubled. In this manner, shortening of
the frame period greatly increases the amount of data to be processed. For
example, the amount of data obtained by A/D-converting a 1-second
continuous speech signal at a 50-.mu.sec pitch, is 20,000. However, if the
above data length is analyzed using a 10-msec frame length and a 2-msec
frame period, the number of frames to be analyzed is:
1 s.div.0.002 s=500
As a result, the amount of data to be analyzed is:
(10 msec.div.0.05 msec).times.500=100,000
and the number of data is increased by five times.
As is described above, in a conventional speech analysis scheme based on
the stationary stochastic process, abrupt changes in spectrum at a dynamic
portion such as a leading edge of the consonant, cannot be stably analyzed
with high precision. If the frame period is shortened, the amount of data
which must be processed is greatly increased.
Another conventional method for effectively analyzing a speech signal is
frequency analysis, using a filter bank. According to this analysis
method, an input speech signal is supplied to a plurality of bandpass
filters having different center frequencies, and outputs from the filters
are used to constitute a speech-power spectrum. This method has advantages
in having easy hardware arrangement and real-time processing.
Most of the conventional speech analysis methods determine spectrum
envelopes of input speech signals. A method of finally analyzing the
speech signal from the determined spectrum envelope is known as formant
analysis, for extracting formant frequency and width from a local peak, in
order to analyze the input speech signal. This analysis method is based on
the facts that each vowel has a specific formant frequency and width, and
that each consonant is characterized by the change in formant frequency in
the transition from the consonant to a vowel. For example, five Japanese
vowels ("a", "i", "u", "e", and "o") can be defined by two formant
frequencies F1 and F2, F1 being the lowest formant frequency, and F2 is
the next one. Being substantially equal, frequencies F1 and F2 are used
for voices uttered by persons of the same sex and the about same age.
Therefore, the vowels can be identified by detecting formant frequencies
F1 and F2.
Another conventional method is also known, for extracting local peaks of
the spectrum envelope and for analyzing these peaks, based on their
frequencies and temporal changes. This method is based on the assumption
that phonemic features appear in the frequencies of local peaks of the
vowel portion, or in the temporal changes in local peaks of the consonant
portion.
Still another conventional method is also proposed, for defining a spectrum
envelope curve itself as a feature parameter of the speech signal and to
use the feature parameters in the subsequent identification,
classification, or display.
In the analysis of a speech signal, it is important to extract the spectrum
envelope. Excluding the spectrum envelope itself, the formant frequency
and width derived from the envelope, and the frequency and transition of
the local peak can be used as feature parameters.
When a person utters a sound, its phoneme is assumed to be defined by
resonance/antiresonance of the vocal tract. For example, a resonant
frequency appears as a formant on the spectrum envelope. Therefore, if
different persons have an identical vocal tract structure, substantially
identical spectra are obtained for an identical phoneme.
However, in general, if persons, for example, male vs. female, or child vs.
adult, have greatly different vocal tract lengths, the resonant or
antiresonant frequencies are different from each other, and the resultant
spectrum envelopes are different accordingly. In this case, the local
peaks and formant frequencies are shifted from each other for an identical
phoneme. This fact is inconvenient for an analysis aiming at extracting
identical results for identical phonemes, regardless of the speakers, as
in the cases of speech recognition and visual display of speech for
hearing-impaired persons.
In order to solve the above problems, two conventional methods are known.
One is a method for preparing a large number of standard patterns, and the
other is a method for determining a formant frequency ratio.
In the former method, a large number of different spectrum envelopes of
males and females, adults and children, are registered as the standard
patterns. Unknown input patterns are classified on the basis of
similarities between these unknown patterns and the standard patterns.
Therefore, a large number of different indefinite input speech signals can
be recognized. According to this method, in order to recognize
similarities between the standard patterns and any input speech patterns,
a very large number of standard patterns must be prepared. In addition, it
takes a long period of time to compare input patterns with the standard
patterns. Furthermore, this method does not extract the results normalized
by the vocal tract lengths, and therefore cannot be used for displaying
phonemic features not dependent on the vocal tract lengths.
The latter method, i.e., the method of determining the formant frequency
ratio, is known as a method of extracting phonemic features not based on
the vocal tract lengths. More specifically, among the local peaks in the
spectrum envelope, first, second, and third formant frequencies F1, F2,
and F3, which are assumed to be relatively stable, are extracted for
vowels, and ratios F1/F3 and F2/F3 are calculated to determine the feature
parameter values. If the vocal tract length is multiplied by a, the
formant frequencies become 1/a times, i.e., F1/a, F2/a, and F3/a. However,
the ratios of the formant frequencies remain the same.
The above method is effective if the first, second, and third formants of
the vowels can be stably extracted. However, if these formants cannot be
stably extracted, the analytic reliability is greatly degraded.
Furthermore, this method is not applicable to consonants. That is, the
formant as the resonant characteristics of the vocal tract cannot be
defined for the consonants, and the local peaks corresponding to the
first, second, and third formants cannot always be observed on the
spectrum envelope. As a result, frequencies F1, F2, and F3 cannot be
extracted or used to calculate their ratios. At a leading or trailing edge
of a vowel as well as for a consonant, the formants are not necessarily
stable, and a wrong formant frequency is often extracted. In this case,
the ratio of the formant frequencies is discretely changed and presents a
completely wrong value. Therefore, the above method is applicable to only
stable portions of vowels of the speech signal. Another method must be
used to analyze the leading and trailing edges of the vowels and the
consonants. Since different extraction parameters must be used for the
stable portions of the vowels and other portions including the consonants,
it is impossible to describe continuous changes from a consonant to a
vowel. In short, the method of calculating the ratio of the formant
frequency is applicable only to stationary vowel portions.
No conventional methods have been proposed to extract feature parameters
inherent to phonemes from a large number of indefinite spectrum envelopes
derived from different vocal tract lengths.
SUMMARY OF THE INVENTION
The present invention has been made in consideration of the above
situation. It is an object of the present invention to provide a method
for calculating analytic results inherent to phonemes, without being
influenced by different vocal tract lengths of speakers, and for
calculating changes in the spectrum envelope in the transition from a
consonant to a vowel.
According to an aspect of the present invention, there is provided a method
comprising: receiving a spectrum envelope, to transform the spectrum
envelope such that the spectrum envelope has a suitable magnitude, and to
generate the transformed spectrum envelope; receiving the transformed
spectrum envelope, to integrate the transformed spectrum envelope with
respect to a predetermined variable, and to generate an integrated data;
receiving the transformed spectrum envelopes and the integrated data, to
project the transformed spectrum envelope with respect to integrated data.
It is another object of the present invention to provide a speech analysis
apparatus for practicing the above method.
According to another aspect of the present invention, there is provided an
apparatus comprising: transforming means for receiving a spectrum
envelope, for transforming the spectrum envelope such that the spectrum
envelope has a suitable magnitude, and for generating a transformed
spectrum envelope; integrating means for receiving the transformed
spectrum envelope, and for integrating the transformed spectrum envelope
with respect to a predetermined variable to generate integrated data; and
projecting means for receiving the transformed spectrum envelopes and
projecting the transformed spectrum envelope with respect to integrated
data.
According to the present invention, the analysis is stably performed for a
consonant as well as for a leading edge of a vowel, to allow smooth
displaying of the spectral changes.
The problem of variations in analysis results, caused by the different
vocal tract lengths of the speakers, can be solved. Thus, the best results
inherent to the phonemes can always be obtained. In this case, according
to the present invention, the method is arbitrarily applied to any
spectrum envelope portion of the input speech signal, regardless of vowels
and consonants, and voiced and unvoiced sounds. Since the analysis results
are independent of extraction precision and stability of the formant
frequency, the method is applicable to the entire range of the input
speech signal. In particular, the changes in spectrum envelope in the
transition from a consonant to a vowel can be determined without being
influenced by different individual vocal tract lengths, unlike in the
conventional method.
According to the present invention, a normalized logarithmic spectrum
envelope is used as a function to be integrated in place of the spectrum
envelope and the logarithmic spectrum envelope, and thus, the influences
of voice magnitudes for identical phonemes can be eliminated.
When transformation is performed by integrating the envelope with respect
to mels, a unit of pitch, such transformation is compatible with human
auditory sensitivity, thus minimizing the contributions of low-frequency
components.
According to a spectrum envelope extractor in the speech analyzing
apparatus of the present invention, a time frequency pattern of a
frequency spectrum in the analysis frame can be extracted, although
conventional speech analysis provides only an average spectrum of the
input speech signal in the analysis frame. Therefore, abrupt changes in
spectrum can be stably extracted, with high precision.
The time frequency pattern of the frequency spectrum thus obtained has a
definite meaning. Artificial parameters (analysis orders in the AR model,
a cutoff quefrency in cepstrum analysis, etc.) are not included in the
time frequency pattern, thus achieving high reliability.
Furthermore, since the time frequency pattern of the frequency spectrum,
which is obtained from frames including the unvoiced sounds and
consonants, includes many noise components, it cannot be used without
modifications. According to the present invention, however, the time
frequency pattern of the frequency spectrum produced by inverse Fourier
transformation, is temporarily smoothed to reduce the influences of noise,
thus obtaining a high-quality time frequency pattern output as a function
of time.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a block diagram of a speech analysis apparatus according to an
embodiment of the present invention;
FIG. 2A is a block diagram of a spectrum envelope extractor in the
apparatus of FIG. 1;
FIG. 2B is a block diagram showing a modification of the spectrum envelope
extractor in FIG. 2A;
FIG. 3 is a graph showing vocal tract in logarithmic spectrum envelopes
caused by vocal tract length differences;
FIG. 4 is a graph showing different formants of a male and a female;
FIG. 5 is a graph obtained by plotting data of FIG. 4 on the basis of
formant ratios;
FIGS. 6A to 6F are graphs for explaining the principle of the present
invention;
FIG. 7A and 7B are graphs for explaining transform U;
FIGS. 8A and 8B are graphs showing different male and female spectrum
envelopes;
FIGS. 9A and 9B are graphs obtained by performing transform U of the
spectrum envelopes of FIGS. 8A and 8B;
FIG. 10A is a graph showing a spectrum envelope of a word "ta" uttered by a
female;
FIG. 10B is a graph obtained by performing transform U of the spectrum
envelope of FIG. 10A;
FIGS. 11A and 11B are graphs obtained by male and female utterances of
Japanese phoneme "a";
FIGS. 12A and 12B are graphs obtained by performing transform U of male and
female utterances of Japanese phoneme "i" in units of mels according to
another embodiment of the present invention;
FIG. 13A is a graph showing a spectrum envelope of a female utterance of
"ta";
FIG. 13B is a graph showing the results of transform U of male and female
utterances of Japanese phoneme "a" in units of mels;
FIGS. 14A and 14B are graphs showing results of transform U of male and
female utterances of phoneme "a" in units of mels;
FIGS. 15A to 15D are schematic views illustrating a model for generation of
a speech signal;
FIG. 16 is a graph showing the result of Fourier transform of a pulse train
of FIG. 15A;
FIG. 17 is a graph showing the result of Fourier transform of the, vocal
tract characteristics in FIG. 15C;
FIGS. 18A and 18B are graphs showing discrete spectra;
FIGS. 19A to 19C are views illustrating a time frequency pattern of a
frequency spectrum derived from the speech signal;,
FIG. 20 is a flow chart for obtaining the spectrum envelope;
FIG. 21 is a graph showing the input speech signal;
FIGS. 22 and 23 are graphs showing real and imaginary parts of resultant
spectrum I(.OMEGA.);
FIGS. 24A to 24D are graphs showing data rearrangement according to an FFT
algorithm;
FIG. 25 is a graph showing a time frequency pattern of a frequency spectrum
obtained by this embodiment;
FIGS. 26 and 27 are graphs showing time frequency patterns of a frequency
spectra obtained by the embodiment of FIGS. 2B and 2A;
FIG. 28 is a graph showing the relationship between the scale of mels and
the frequency;
FIG. 29 is a block diagram of a speech analysis apparatus according to
another embodiment of the present invention;
FIGS. 30 and 31 are detailed block diagrams of an arrangement shown in FIG.
29;
FIG. 32 is a block diagram of a filter bank of FIG. 30.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Detailed
A speech analysis apparatus according to an embodiment of the present
invention will be described with reference to the accompanying drawings.
FIG. 1 is a block diagram showing an arrangement of an embodiment. Before
describing the embodiment with reference to FIG. 1, the principle of the
embodiment of the present invention will be described with reference to
FIGS. 3 to 7B.
Comparison results of vowel spectrum envelopes with respect to different
vocal tract lengths will be illustrated in FIG. 3. FIG. 3 shows
logarithmic plotting of spectrum envelope P(f) of an identical phoneme
from two different vocal tract lengths l1 and l2. Referring to FIG. 3, in
a frequency range from several hundreds of hertz to about 5 kHz, spectrum
envelope Pl(f) of long vocal tract length l1 is a multiple, along the
frequency (f) axis, of spectrum envelope P2(f) (log P2(f) in FIG. 3) of
short vocal tract length l2 with reference to a fixed origin. However, in
the range of 0 Hz to several hundreds of hertz, the difference between
envelopes Pl(f) and P2(f) is typical, and a similarity therebetween is
reduced. This frequency range is based on differences in individual tone
colors and is not so important in speech analysis. The vocal tract lengths
are proportional to resonant frequencies. If a ratio l1/l2 of length l1 to
length l2 is given as r, a relationship between spectrum envelopes Pl(f)
and P2(f) is obtained when the magnitudes thereof are normalized in the
frequency range of several hundreds of hertz to 5 kHz:
log.vertline.Pl(f).vertline..apprxeq.log.vertline.P2(rf).vertline.(1)
Magnitude-normalized logarithmic spectrum envelopes
log.vertline.Pl(f).vertline. and log.vertline.P2(rf).vertline. are used in
place of spectrum envelopes Pl(f) and P2(f) themselves to normalize the
magnitudes of the input speech signals.
In this case, if first to third formants are extracted, their frequencies
F1, F2, F3, F1', F2', and F3' are plotted as shown in FIG. 3. Since these
frequencies satisfy the following relation:
F1'/F1.apprxeq.F2'/F2.apprxeq.F3'/F3.apprxeq.r (2)
the ratios of formant frequencies F are kept unchanged (the frequencies
given in relations (3)) as follows:
F1/F2.apprxeq.F1'/F2'
F1/F3.apprxeq.F1'/F3' (3)
The above fact will be proven by the results (FIG. 4 and 5) of actual
measurements. FIG. 4 shows a distribution of F1 and F2 of males and
females in their twenties to thirties. As is apparent from FIG. 4, the
actual distributions for the males and females are greatly different. For
example, the formant frequency of a male utterance of Japanese phoneme "a"
is the same as that of a female utterance of Japanese phoneme "o", and the
formant frequency of a male utterance of Japanese phoneme "e" is the same
as that of a female utterance of Japanese phoneme "u".
FIG. 5 shows the distributions of ratios F1/F3 and F2/F3. Referring to FIG.
5, in the formant frequency ratios, it is found that the differences
caused by the sex difference between males and females can be solved.
In the frequency range of several hundreds of hertzs to about 5 kHz
regardless of the stationary state of the spectrum envelope, transform R
given by equation (4) is performed for spectrum envelope P(f) to multiply
values on the frequency axis by a constant, i.e., to obtain r.multidot.f:
##EQU1##
In this case, if transform (U) for projecting spectrum envelopes P(f) and
P(r.multidot.f) into an unchanging functional space is found, spectrum
envelopes P(f) belonging to an identical phoneme must have identical
shapes in this space regardless of vocal tract lengths l.
The above operation is described as the principle in FIGS. 6A to 6F. These
figures show that, although a difference is present in spectrum envelopes
P(f) of Japanese phoneme "a" or "i" caused by different vocal tract
lengths l, these envelopes are transformed to spectrum envelopes P'(f)
having an identical distribution by means of transform U. More
specifically, as shown in FIG. 6A, spectrum envelope Pla(f) (FIG. 6A) of
Japanese phoneme "a" for length l1 and spectrum envelope P2a((f) (FIG. 6C)
thereof for length l2 are transformed into spectrum envelopes P'a(f) (FIG.
6E) of an identical shape by transform U. Similarly, spectrum envelope
Pli(f) (FIG. 6B) of Japanese phoneme "i" and P2i(f) (FIG. 6D) thereof are
transformed into spectrum envelopes P'i(f) (FIG. 6F) of an identical
shape.
In this embodiment, transform U is performed as follows. If a
magnitude-normalized logarithmic spectrum envelope is integrated on the
logarithmic scale along the frequency axis and the resultant integral is
defined as L(f), it is given by
##EQU2##
wherein .epsilon. is a very small positive value near 0 and is determined
by conditions to be described later.
L(f) in equation (5) depends on the function of P(f) and is rewritten as
LP(f). Transform of equation (4) is performed for LP(f) to obtain:
##EQU3##
for h=r.multidot.k, then k=h/r and logk=logh-logr, therefore,
dlogk=dlogh-dlogr. In this case, since r is the constant, dlogk=dlogh
therefore
##EQU4##
If the second term of the right-hand side of equation (6) is sufficiently
small, then
LP'(f).apprxeq.LP(r.multidot.f) (7)
Assume function (P(f),LP(f)) obtained by plotting spectrum envelopes P(f)
and LP(f) using frequency f as a parameter:
##EQU5##
ps therefore
It is thus apparent that transform U projects transform R of equation (4)
into the unchanging functional shape. If normalized logarithmic spectrum
envelope log .vertline.P'(f).vertline. is proportionally elongated or
compressed along the frequency axis with respect to normalized logarithmic
spectrum envelope log.vertline.P(f).vertline., the replacement of the
logarithmic frequency axis with integral L(f) of equation (5) absorbs the
deviations of the normalized logarithmic spectrum envelopes on the
frequency axis.
FIGS. 7A and 7B are views for explaining the principle of transform U.
Logarithmic spectrum envelope log.vertline.P(f).vertline. is used as
envelope data to be described later in place of spectrum envelope P(f).
Transform U is performed for the logarithmic spectrum envelope of FIG. 7A
to obtain a spectrum envelope of FIG. 7B. In this case, equation (8) can
be rewritten as follows:
(log.vertline.(P(f).vertline.,LP(f))=(log.vertline.P'(f).vertline.,LP'(f))(
10)
If normalized logarithmic spectrum envelope log.vertline.P(f).vertline. is
used in place of spectrum envelope P(f) or logarithmic spectrum envelope
log.vertline.P(f).vertline., equation (8) is rewritten as follows:
(log.vertline.(P(f).vertline.,LP(f))=(log.vertline.P'(f).vertline.,LP'(f))(
11)
The condition for neglecting the second term of the right-hand side of
equation (6) will be described below. The condition is determined by
evaluating integral I given by equation (12) since the actual range of
ratio r falls within the range of 1/2 to about 2, and the normalized
logarithmic spectrum envelope in the range of .epsilon. to 2.epsilon. on
the frequency axis is substantially constant, i.e., approximately a
constant:
##EQU6##
The magnitude of the speech spectrum envelope is greatly reduced at a
frequency smaller than half of the pitch frequency. If about 100 Hz is
used as .epsilon. in equation (6), it is apparent from equation (12) that
the second term of the right-hand side in equation (6) can be neglected.
However, if .epsilon. is excessively small, the influence of small
frequency components on integral L(f) given by (5) is excessively large.
In this case, analysis sensitivity is increased near the origin of the
spectrum. Therefore, .epsilon. must not be less than 10 Hz, and preferably
falls within the range of 10 Hz to 100 Hz.
The principle of this embodiment has been described. The arrangement for
processing the above operation will be described with reference back to
FIG. 1.
Referring to FIG. 1, spectrum envelope extractor 11 extracts spectrum
envelope P(f) of input speech signal AIN. Various spectrum envelope
extraction schemes may be used, such as extraction in AR model speech
analysis, extraction in cepstrum speech analysis, extraction in speech
frequency analysis with a filter bank, and so on.
Logarithm circuit 12 converts the magnitude of spectrum envelope P(f)
extracted by extractor 11 into a logarithmic value. Normalizing circuit 13
normalizes the magnitude of logarithmic spectrum envelope
log.vertline.P(f).vertline. output from logarithm circuit 12. Examples of
the method for normalizing the magnitude of logarithmic spectrum envelope
log.vertline.P(f).vertline. are a method using automatic gain control
(AGC), and a method of differentiating logarithmic spectrum envelope
log.vertline.P(f).vertline. with frequency f to eliminate a constant term
from the envelope log.vertline.P(f).vertline., integrating a
differentiated value, and adding a constant value to the integrated value.
Transform section 10 is constituted by logarithm circuit 12 and
normalizing circuit 13.
Integrator 14 integrates normalized logarithmic envelope
log.vertline.(P(f).vertline. (output from normalizing circuit 13) using
the frequency on the logarithmic scale as a variable. More specifically,
integrator 14 integrates spectrum envelope log.vertline.P(f).vertline.
according to the integral function of equation (5). It should be noted
that the e value is given as 50 Hz.
Projection circuit 15 receives logarithmic spectrum envelope
log.vertline.P(f).vertline. output from logarithm circuit 12 and the
integrated result from integrator 14, projects .vertline.P(f).vertline.
onto integral function L(f) (=LP(f)) by using frequency f, as shown in
FIGS. 7A and 7B, and displays the projection result. In projection circuit
15, LP(f) is plotted along the x-axis of the orthogonal coordinate system
and logarithmic spectrum envelope log.vertline.P(f).vertline. is plotted
along the y-axis thereof, and the parameters are displayed using frequency
f, thereby patterning the analysis results of input speech signals AIN.
In processing of projection circuit 15, as is apparent from equations (10)
and (11), spectrum envelope P(f) or normalized logarithmic spectrum
envelope log.vertline.P(f).vertline. may be used as the value plotted
along the y-axis. Alternatively, normalized spectrum envelope P(f) may be
used. According to the present invention, it is essential for envelope
data subjected to projection to indicate at least the four patterns
described above.
In processing of projection circuit 15, envelope data may be plotted along
the x-axis, and LP(f) may be plotted along the y-axis.
An example of practical measurement by speech analysis according to this
embodiment will be described below. FIGS. 8A and 8B respectively show
logarithmic spectrum envelopes log.vertline.P(f).vertline. of male and
female utterances of Japanese phoneme "i". These envelopes
log.vertline.P(f).vertline. may be determined as follows.
Speech signal AIN input at a condenser microphone is input to extractor 11
and sampled at a sampling frequency of 50 .mu.sec to obtain a 12-bit
digital signal. A 8-kword wave memory is used to sample the speech signal.
Extractor 11 determines spectrum envelope P(f) by analyzing the cepstrum of
signal AIN. Cepstrum analysis is performed as follows. A 1024-point frame
of a stable vowel portion is differentiated, and a differentiated result
is multiplied with a Hamming window. The result is then
Fourier-transformed by an FFT algorithm, thereby obtaining spectrum
envelope P(f).
Logarithm circuit 12 calculates a logarithm of the absolute value of
envelope P(f). The logarithm is subjected to inverse Fourier transform to
obtain its cepstrum. The cepstrum is sampled with a rectangular window
having a cutoff period of 1.7 to 2.5 msec on the quefrency axis. The
result is then Fourier-transformed to obtain logarithmic spectrum envelope
log.vertline.P(f).vertline..
In order to obtain logarithmic spectrum envelope
log.vertline.P(f).vertline., the cutoff range on the quefrency axis is
selected in correspondence with the pitch frequency. Furthermore, in order
to normalize the magnitude of envelope log.vertline.P(f).vertline.,
envelope log.vertline.P(f).vertline. is calculated after a value of the
0th order of the cepstrum is converted into a predetermined value.
The logarithmic spectrum envelopes shown in FIGS. 8A and 8B are obtained as
described above. When these envelopes in FIGS. 8A and 8B are compared,
their distributions are similar to each other within the range below about
5 kHz. However, the female spectrum shape is elongated along the frequency
axis as compared with the male spectrum shape.
LP(f) (expressed by equation (5)) for this envelope
log.vertline.P(f).vertline. is calculated when .epsilon. is given as 50
Hz. The calculated values are plotted along the x-axis, and envelopes
log.vertline.P(f).vertline. are plotted along the y-axis, as shown in
FIGS. 9A and 9B. Although the peak heights and minute nuances are
different in these graphs, the deviations along the frequency direction in
FIGS. 8A and 8B are apparently eliminated.
FIG. 10A shows time serial changes of logarithmic spectrum envelope
log.vertline.P(f).vertline. obtain | | |