|
Description  |
|
|
BACKGROUND AND SUMMARY OF THE INVENTION
The present invention relates to voice messaging systems, wherein pitch and
LPC parameters (and usually other excitation information too) are encoded
for transmission and/or storage, and are decoded to provide a close
replication of the original speech input.
The present invention also relates to speech recognition and encoding
systems, and to any other system wherein it is necessary to estimate the
pitch of the human voice.
The present invention is particularly related to linear predictive coding
(LPC) systems for (and methods of) analyzing or encoding human speech
signals. In LPC modeling generally, each sample in a series of samples is
modeled (in the simplified model) as a linear combination of preceding
samples, plus an excitation function:
##EQU1##
where u.sub.k is the LPC residual signal. That is, u.sub.k represents the
residual information in the input speech signal which is not predicted by
the LPC model. Note that only N prior signals are used for prediction. The
model order (typically around 10) can be increased to give better
prediction, but some information will always remain in the residual signal
u.sub.k for any normal speech modelling application.
Within the general framework of LPC modeling, many particular
implementations of voice analysis can be selected. In many of these, it is
necessary to determine the pitch of the input speech signal. That is, in
addition to the formant frequencies, which in effect correspond to
resonances of the vocal tract, the human voice also contains a pitch,
modulated by the speaker, which corresponds to the frequency at which the
larynx modulates the airstream. That is, the human voice can be considered
as an excitation function applied to an acoustic passive filter, and the
excitation function will generally appear in the LPC residual function,
while the characteristics of the passive acoustic filter (i.e., the
resonance characteristics of mouth, nasal cavity, chest, etc.) will be
modeled by the LPC parameters. It should be noted that during unvoiced
speech, the excitation function does not have a well-defined pitch, but
instead is best modeled as broad band white noise or pink noise.
Estimation of the pitch period is not completely trivial. Among the
problems is the fact that the first formant will often occur at a
frequency close to that of the pitch. For this reason, pitch estimation is
often performed on the LPC residual signal, since the LPC estimation
process in effect deconvolves vocal tract resonances from the excitation
information, so that the residual signal contains relatively less of the
vocal tract resonances (formants) and relatively more of the excitation
information (pitch). However, such residual-based pitch estimation
techniques have their own difficulties. The LPC model itself will normally
introduce high frequency noise into the residual signal, and portions of
this high frequency noise may have a higher spectral density than the
actual pitch which should be detected. One prior art solution to this
difficulty is simply to low pass filter the residual signal at around 1000
Hz. This removes the high frequency noise, but also removes the legitimate
high frequency energy which is present in the unvoiced regions of speech,
and renders the residual signal virtually useless for voicing decisions.
A cardinal criterion in voice messaging applications is the quality of
speech reproduced. Prior art systems have had many difficulties in this
respect. In particular, many of these difficulties relate to problems of
accurately detecting the pitch and voicing of the input speech signal.
It is typically very easy to incorrectly estimate a pitch period at twice
or half its value. For example, if correlation methods are used, a good
correlation at a period P guarantees a good correlation at period 2P, and
also means that the signal is more likely to show a good correlation at
period P/2. However, such doubling and halving errors produce very
annoying degradation in voice quality. For example, erroneous halving of
the pitch period will tend to produce a squeaky voice, and erroneous
doubling of the pitch period will tend to produce a coarse voice.
Moreover, pitch period doubling or halving is very likely to occur
intermittently, so that the synthesized voice will tend to crack or to
grate, intermittently.
Thus, it is an object of the present invention to provide a voice messaging
system wherein errors of pitch period doubling and halving are avoided.
It is a further object of the present invention to provide a voice
messaging system wherein voices are not reproduced with erroneous squeaky,
cracking, coarse, or grating qualities.
A related difficulty in prior art voice messaging systems is voicing
errors. If a section of voiced speech is incorrectly determined to be
unvoiced, the reproduced speech will sound as through it was whispered
rather than spoken speech. If a section of unvoiced speech is incorrectly
estimated to be voiced, the regenerated speech in this section will show a
buzzing quality.
Thus, it is an object of the present invention to provide a voice messaging
system, wherein voicing errors are avoided.
It is a further object of the present invention to provide a voice
messaging system wherein spurious buzz and dropouts do not appear in the
reconstituted speech.
The pitch usually varies fairly smoothly across frames. In the prior art,
tracking of pitch across frames has been attempted, but the interrelation
of the pitch and voicing decisions can pose difficulties. That is, where
the voicing decision is made separately, the voicing and pitch decisions
must still be reconciled. Thus, this method poses a heavy processor load.
It is a further object of the invention to provide a voice messaging system
wherein pitch is tracked consistently with respect to plural frames in the
sequence of frames, without imposing a heavy processor load.
It is a further object of the present invention to provide a voice
messaging system wherein voicing decisions are made consistently across a
sequence of frames.
It is a further object of the present invention to provide a voice
messaging system wherein pitch and voicing decisions are made consistently
across a sequence of frames, without imposing a heavy processor load.
The present invention uses an adaptive filter to filter the residual
signal. By using a time-varying filter which has a single pole at the
first reflection coefficient (k.sub.1 of the speech input), the high
frequency noise is removed from the voiced periods of speech, but the high
frequency information in the unvoiced speech periods is retained. The
adaptively filtered residual signal is then used as the input for the
pitch decision.
It is necessary to retain the high frequency information in the unvoiced
speech periods to permit better voicing/unvoicing decisions. That is, the
"unvoiced" voicing decision is normally made when no strong pitch is
found, that is when no correlation lag of the residual signal provides a
high normalized correlation value. However, if only a low-pass filtered
portion of the residual signal during unvoiced speech periods is tested,
this partial segment of the residual signal may have spurious
correlations. That is, the danger is that the truncated residual signal
which is produced by the fixed low-pass filter of the prior art does not
contain enough data to reliably show that no correlation exists during
unvoiced periods, and the additional band width provided by the
high-frequency energy of unvoiced periods is necessary to reliably exclude
the spurious correlation lags which might otherwise be found.
Thus, it is an object of the present invention to provide a method for
filtering high-frequency noise out during voiced speech periods, without
making erroneous voicing decisions during unvoiced speech periods.
It is a further object of the invention to provide a voice messaging system
which does not make erroneous high-frequency pitch assignments during
voiced speech periods, and which also does not make erroneous voicing
decisions during unvoiced speech periods.
It is a further object of the present invention to provide a system for
making pitch and voicing estimates of speech which disregards
high-frequency noise during voiced speech segments and which uses
high-frequency information during unvoiced speech segments.
Improvement in pitch and voicing decisions is particularly critical for
voice messaging systems, but is also desirable for other applications. For
example, a word recognizer which incorporated pitch information would
naturally require a good pitch estimation procedure. Similarly, pitch
information is sometimes used for speaker verification, particularly over
a phone line, where the high frequency information is partially lost.
Moreover, for long-range future recognition systems, it would be desirable
to be able to take account of the syntactic information which is denoted
by pitch. Similarly, a good analysis of voicing would be desirable for
some advanced speech recognition systems, e.g., speech to text systems.
Thus, it is a further object of the present invention to provide a method
for making optimal pitch decisions in a series of frames of input speech.
It is a further object of the present invention to provide a method for
making optimal voicing decisions in a sequence of frames of input speech.
It is a further object of the present invention to provide a method for
making optimal speech and voicing decisions in a sequence of frames of
input speech.
The first reflection coefficient k.sub.1 is approximately related to the
high/low frequency energy ratio and a signal. See. R. J. McAulay, "Design
of a Robust Maximum Likelihood Pitch Estimator for Speech and Additive
Noise," Technical Note, 1979-28, Lincoln Labs, June 11, 1979, which is
hereby incorporated by reference. For k.sub.1 close to -1, there is more
low frequency energy in the signal than high-frequency energy, and vice
versa for k.sub.1 close to 1. Thus, by using k.sub.1 to determine the pole
of a 1-pole deemphasis filter, the residual signal is low pass filtered in
the voiced speech periods and is high pass filtered in the unvoiced speech
periods. This means that the formant frequencies are excluded from
computation of pitch during the voiced periods, while the necessary
high-band width information is retained in the unvoiced periods for
accurate detection of the fact that no pitch correlation exists.
Preferably a post-processing dynamic programming technique is used to
provide not only an optimal pitch value but also an optimal voicing
decision. That is, both pitch and voicing are tracked from frame to frame,
and a cumulative penalty for a sequence of frame pitch/voicing decisions
is accumulated for various tracks to find the track which gives optimal
pitch and voicing decisions. The cumulative penalty is obtained by
imposing a frame error in going from one frame to the next. The frame
error preferably not only penalizes large deviations in pitch period from
frame to frame, but also penalizes pitch hypotheses which have a
relatively poor correlation "goodness" value, and also penalizes changes
in the voicing decision if the spectrum is relatively unchanged from frame
to frame. This last feature of the frame transition error therefore forces
voicing transitions towards the points of maximal spectral change.
According to the present invention there is provided:
A voice messaging system, for encoding and regenerating human speech,
comprising:
means for receiving an analog input speech signal;
LPC analysis means, connected to said input means for analyzing said input
speech signal according to an LPC (linear predictive coding) model, said
LPC analysis means providing LPC parameters and a residual signal;
an adaptive filter connected to receive said residual signal and at least
one of said LPC parameters from said LPC analysis means, said adaptive
filter filtering said residual signal according to a filter characteristic
defined by at least one of said LPC parameters;
means operatively connected to said filter for extracting pitch and voicing
information from said filtered residual signal; and
means for encoding said pitch and voicing information and said LPC
parameters.
BRIEF DESCRIPTION OF THE DRAWINGS
The present invention will be described with reference to the accompanying
drawings, wherein:
FIG. 1 shows the configuration of a voice messaging system generally;
FIG. 2 shows generally the configuration of the portion of the system of
the present invention wherein improved selection of a set of pitch period
candidates is achieved;
FIG. 3 shows generally the configuration of the portion of the system of
the present invention wherein an optimal pitch and voicing decision is
made, after a set of pitch period candidates has previously been
identified;
FIGS. 4a and 4b show a composite block diagram illustrating generally the
configuration using the presently preferred embodiment for pitch tracking;
and
FIG. 5 shows an example of a trajectory in a dynamic programming process,
which is used to identify an optimal pitch and voicing decision at a frame
prior to the current frame.
DESCRIPTION OF THE PREFERRED EMBODIMENTS
FIG. 2 shows generally the configuration of the system of the present
invention, whereby improved selection of pitch period candidates and
voicing decisions is achieved. A speech input signal, which is shown as a
time series s.sub.i, is provided to an LPC analysis block. The LPC
analysis can be done by a wide variety of conventional techniques, but the
end product is a set of LPC parameters and residual signal u.sub.i.
Background on LPC analysis generally, and on various methods for
extraction of LPC parameters, is found in numerous generally known
references, includng Markel and Gray, Linear Prediction of Speech (1976)
and Rabiner and Schafer, Digital Processing of Speech Signals (1978), and
references cited therein, all of which are hereby incorporated by
reference.
In the presently preferred embodiment, the analog speech waveform is
sampled at a frequency of 8 KHz and with a precision of 16 bits to produce
the input time series s.sub.1. Of course, the present invention is not
dependent at all on the sampling rate or the precision used, and is
applicable to speech sampled at any rate, or with any degree of precision,
whatsoever.
In the presently preferred embodiment, the set of LPC parameters which is
used includes a plurality of reflection coefficients k.sub.i, and a
10th-order LPC model is used (that is, only the reflection coefficients
k.sub.1 through k.sub.10 are extracted, and higher order coefficients are
not extracted). However, other model orders or other equivalent sets of
LPC parameters can be used, as is well known to those skilled in the art.
For example, the LPC predictor coefficients a.sub.k can be used, or the
impulse response estimates e.sub.k. However, the reflection coefficients
k.sub.i are most convenient.
In the presently preferred embodiment, the reflection coefficients are
extracted according to the Leroux-Gueguen procedure, which is set forth,
for example, in IEEE Transactions on Acoustics, Speech and Signal
Processing, p. 257 (June 1977), which is hereby incorporated by reference.
However, other algorithms well known to those skilled in the art, such as
Vurbin's, could be used to compute the coefficients.
A by-product of the computation of the LPC parameters will typically be a
residual signal u.sub.k. However, if the parameters are computed by a
method which does not automatically pop out the u.sub.k as a by-product,
the residual can be found simply by using the LPC parameters to configure
a finite-impulse-response digital filter which directly computes the
residual series u.sub.k from the input series s.sub.k.
The residual signal time series u.sub.k is not put through a very simple
digital filtering operation, which is dependent on the LPC parameters for
the current frame. That is, the speech input signal s.sub.k is a time
series having a value which can change once every sample, at a sampling
rate of, e.g., 8 Khz. However, the LPC parameters are normally recomputed
only once each frame period, at a frame frequency of, e.g., 100 Hz. The
residual signal u.sub.k also has a period equal to the sampling period.
Thus, the digital filter 14, whose value is dependent on the LPC
parameters, is preferably not readjusted at every residual signal u.sub.k.
In the presently preferred embodiment, approximately 80 values in the
residual signal time series u.sub.k pass through the filter 14 before a
new value of the LPC parameters is generated, and therefore a new
characteristic for the filter 14 is implemented.
More specifically, the first reflection coefficient k.sub.1 is extracted
from the set of LPC parameters provided by the LPC analysis section 12.
Where the LPC parameters themselves are the reflection coefficients
k.sub.I, it is merely necessary to look up the first reflection
coefficient k.sub.1. However, where other LPC parameters are used, the
transformation of the parameters to produce the first order reflection
coefficient is typically extremely simple, for example,
k.sub.1 =a.sub.1 /a.sub.0 (2)
Although the present invention preferably uses the first reflection
coefficient to define a 1-pole adaptive filter, the invention is not as
narrow as the scope of this principal preferred embodiment. That is, the
filter need not be a single-pole filter, but may be configured as a more
complex filter, having one more poles and or one or more zeros, some or
all of which may be adaptively varied according to the present invention.
It should also be noted that the adaptive filter characteristic need not be
determined by the first reflection coefficient k.sub.1. As is well known
in the art, there are numerous equivalent sets of LPC parameters, and the
parameters in other LPC parameter sets may also provide desirable
filtering characteristics. Particularly, in any set of LPC parameters, the
lowest order parameters are most likely to provide information about gross
spectral shape. Thus, an adaptive filter according to the present
invention could use a.sub.1 or e.sub.1 to define a pole, can be a single
or multiple pole and can be used alone or in combination with other zeros
and or poles. Moreover, the pole (or zero) which is defined adaptively by
an LPC parameter need not exactly coincide with that parameter, as in the
presently preferred embodiment, but can be shifted in magnitude or phase.
Thus, the 1-pole adaptive filter 14 filters the residual signal times
series u.sub.k to produce a filtered time series u'.sub.k. As discussed
above, this filtered time series u'.sub.k will have its high frequency
energy greatly reduced during the voiced speech segments, but will retain
nearly the full frequency band width during the unvoiced speech segments.
This filtered residual signal u'.sub.k is then subjected to further
processing, to extract the pitch candidates and voicing decision.
A wide variety of methods to extract pitch information from a residual
signal exist, and any of them can be used. Many of these are discussed
generally in the Markel and Gray book incorporated by reference above.
In the presently preferred embodiment, the candidate pitch values are
obtained by finding the peaks in the normalized correlation function of
the filtered residual signal, defined as follows:
##EQU2##
where u'.sub.j is the filtered residual signal, k.sub.min and k.sub.max
define the boundaries for the correlation lag k, and m is the number of
samples in one frame period (80 in the preferred embodiment) and therefore
defines the number of samples to be correlated. The candidate pitch values
are defined by the lags k* at which the value of C(k*) takes a local
maximum, and the scalar value of C(k) is used to define a "goodness" value
for each candidate k*.
Optionally a threshold value C.sub.min will be imposed on the goodness
measure C(k), and local maxima of C(k) which do not exceed the threshold
value C.sub.min will be ignored. If no k* exists for which C(k*) is
greater than C.sub.min, then the frame is necessarily unvoiced.
Alternately, the goodness threshold C.sub.min can be dispensed with, and
the normalized autocorrelation function 16' can simply be controlled to
report out a given number of candidates which have the best goodness
values, e.g., the 16 pitch period candidates k having the largest values
of C(k).
In one embodiment, no threshold at all is imposed on the goodness value
C(k), and no voicing decision is made at this stage. Instead, the 16 pitch
period candidates k*.sub.1, k*.sub.2, etc., are reported out, together
with the corresponding goodness value (C(k*.sub.i)) for each one. In the
presently preferred embodiment, the voicing decision is not made at this
stage, even if all of the C(k) values are extremely low, but the voicing
decision will be made in the succeeding dynamic programming step,
discussed below.
In the presently preferred embodiment, a variable number of pitch
candidates are identified, according to a peak-finding algorithm. That is,
the graph of the "goodness" values C(k) versus the candidate pitch period
k is tracked. Each local maximum is identified as a possible peak.
However, the existence of a peak at this identified local maximum is not
confirmed until the function has thereafter dropped by a constant amount.
This confirmed local maximum then provides one of the pitch period
candidates. After each peak candidate has been identified in this fashion,
the algorithm then looks for a valley. That is, each local minimum is
identified as a possible valley, but is not confirmed as a valley until
the function has thereafter risen by a predetermined constant value. The
valleys are not separately reported out, but a confirmed valley is
required after a confirmed peak before a new peak will be identified. In
the presently preferred embodiment, where the goodness values are defined
to be bounded by +1 or -1, the constant value required for confirmation of
a peak or for a valley has been set at 0.2, but this can be widely varied.
Thus, this stage provides a variable number of pitch candidates as output,
from zero up to 15.
In the presently preferred embodiment, the set of pitch period candidates
provided by the foregoing steps is then provided to a dynamic programming
algorithm. This dynamic programming algorithm tracks both pitch and
voicing decisions, to provide a pitch and voicing decision for each frame
which is optimal in the context of its neighbors.
Given the candidate pitch values and their goodness values C(k), dynamic
programming is now used to obtain an optimum pitch contour which includes
an optimum voicing decision for each frame. The dynamic programming
requires several frames of speech in a segment of speech to be analyzed
before the pitch and voicing for the first frame of the segment can be
decided. At each frame of the speech segment, every pitch candidate is
compared to the retained pitch candidates from the previous frame. Every
retained pitch candidate from the previous frame carries with it a
cumulative penalty, and every comparison between each new pitch candidate
and any of the retained pitch candidates also has a new distance measure.
Thus, for each pitch candidate in the new frame, there is a smallest
penalty which represents a best match with one of the retained pitch
candidates of the previous frame. When the smallest cumulative penalty has
been calculated for each new candidate, the candidate is retained along
with its cumulative penalty and a back pointer to the best match in the
previous frame. Thus, the back pointers define a trajectory which has a
cumulative penalty as listed in the cumulative penalty value of the last
frame in the project rate. The optimum trajector for any given frame is
obtained by choosing the trajectory with the minimum cumulative penalty.
The unvoiced state is defined as a pitch candidate at each frame. The
penalty function preferably includes voicing information, so that the
voicing decision is a natural outcome of the dynamic programming strategy.
In the presently preferred embodiment, the dynamic programming strategy is
16 wide and 6 deep. That is, 15 candidates (or fewer) plus the "unvoiced"
decision (stated for convenience as a zero pitch period) are identified as
possible pitch periods at each frame, and all 16 candidates, together with
their goodness values, are retained for the 6 previous frames. FIG. 5
shows schematically the operation of such a dynamic programming algorithm,
indicating the trajectories defined within the data points. For
convenience, this diagram has been drawn to show dynamic programming which
is only 4 deep and 3 wide, but this embodiment is precisely analogous to
the presently preferred embodiment.
The decisions as to pitch and voicing are made final only with respect to
the oldest frame contained in the dynamic programming algorithm. That is,
the pitch and voicing decision would accept the candidate pitch at frame
F.sub.K-5 whose current trajectory cost was minimal. That is, of the 16
(or fewer) trajectories ending at the most recent frame F.sub.K, the
candidate pitch in frame F.sub.K which has the lowest cumulative
trajectory cost identifies the optimal trajectory. This optimal trajectory
is then followed back and used to make the pitch/voicing decision for
frame F.sub.K-5. Note that no final decision is made as to pitch
candidates in succeeding frames (F.sub.k-4, etc.), since the optimal
trajectory may no longer appear optimal after more frames are evaluated.
Of course, as is well known to those skilled in the art of numerical
optimization, a final decision in such a dynamic programming algorithm can
alternatively be made at other times, e.g. in the next to last frame held
in the buffer. In addition, the width and depth of the buffer can be
widely varied. For example, as many as 64 pitch candidates could be
evaluated, or as few as two; the buffer could retain as few as one
previous frame, or as many as 16 previous frames or more, and other
modifications and variations can be instituted as will be recognized by
those skilled in the art. The dynamic programming algorithm is defined by
the transition error between a pitch period candidate in one frame and
another pitch period candidate in the succeeding frame. In the presently
preferred embodiment, this transition error is defined as the sum of three
parts: an error E.sub.p due to pitch deviations, an error E.sub.s due to
pitch candidates having a low "goodness" value, and an error E.sub.t due
to the voicing transition.
The pitch deviation error E.sub.P is a function of the current pitch period
and the previous pitch period as given by:
##EQU3##
if both frames are voiced, and E.sub.P =B.sub.P .times.D.sub.N otherwise;
where tau is the candidate pitch period of the current frame, tau.sub.P is
a retained pitch period of the previous frame with respect to which the
transition error is being computed, and B.sub.P, A.sub.D, and D.sub.N are
constants. Note that the minimum function includes provision for pitch
period doubling and pitch period halving. This provision is not strictly
necessary in the present invention, but is believed to be advantageous. Of
course, optionally, similar provision could be included for pitch period
tripling, etc.
The voicing state error, E.sub.S, is a function of the "goodness" value
C(k) of the current frame pitch candidate being considered. For the
unvoiced candidate, which is always included among the 16 or fewer pitch
period candidates to be considered for each frame, the goodness value C(k)
is set equal to the maximum of C(k) for all of the other 15 pitch period
candidates in the same frame. The voicing state error E.sub.S is given by
E.sub.S =B.sub.S (R.sub.V -C(tau), if the current candidate is voiced, and
E.sub.S =B.sub.S (C(tau)-R.sub.U) otherwise, where C(tau) is the "goodness
value" corresponding to the current pitch candidate tau, and B.sub.S,
R.sub.V, and R.sub.U are constants.
The voicing transition error E.sub.T is defined in terms of a spectral
difference measure T. The spectral difference measure T defines, for each
frame, gener | | |