|
Description  |
|
|
BACKGROUND OF THE INVENTION
The field of this invention is speech technology generally and, in
particular, methods and devices for analyzing, digitally encoding and
synthesizing speech or other acoustic waveforms.
Systems for digital encoding and synthesis of speech are the subject of
considerable present interest, particularly at rates compatible with
existing transmission lines, which commonly carry digital information at
2.4-9.6 kilobits per second. At such rates, conventional systems based
upon speech waveform modeling are inadequate for coding applications and
yield poor quality speech transmission, even if linear predictive coding
(LPC) and other efficient coding techniques are used.
Typically, the problem of representing speech signals is approached by
using a speech production model in which speech is viewed as the result of
passing a glottal excitation waveform through a time-varying, linear
filter that models the resonant characteristics of the vocal tract. In a
so-called "binary excitation model," it is assumed that the glottal
excitation can be in one of two possible states corresponding to voiced or
unvoiced speech.
In the voiced speech state, the excitation is periodic with a period which
is allowed to vary slowly over time relative to the analysis frame rate,
typically 10-20 msecs For the unvoiced speech state, the glottal
excitation is modeled as random noise with a flat spectrum In both cases,
the power level in the excitation is also considered to be slowly
time-varying.
While this binary model has been used successfully to design narrowband
vocoders and speech synthesis systems, its limitations are well known. For
example, the speech excitation is often mixed, having both voiced and
unvoiced components simultaneously, and often only portions of the
spectrum are truly harmonic. Additionally, the binary model requires that
each frame of data be classified as either voiced or unvoiced, a decision
which is difficult to make if the speech is subject to additive acoustic
noise.
The above-referenced parent application, U.S. Ser. No. 712,866, discloses
an alternative to the binary excitation model in which speech analysis and
synthesis, as well as coding, can be accomplished simply and effectively
by employing a time-frequency representation of the speech waveform which
is independent of the speech state. In particular, a sinusoidal model for
the speech waveform is utilized to develop a new analysis and synthesis
method.
The basic method of U.S. Ser. No. 712,866 includes the steps of (i)
selecting frames--i.e. windows of approximately 20-60 milliseconds--of
samples from the waveform; (ii) analyzing each frame of samples to extract
a set of frequency components; (iii) tracking the components from one
frame to the next; and (iv) interpolating the values of the components
from one frame to the next to obtain a parametric representation of the
waveform. A synthetic waveform can then be constructed by generating a set
of sine waves corresponding to the parametric representation. The
disclosures of U.S. Ser. No. 712,866 are incorporated herein by reference.
In one illustrated embodiment described in detail in U.S. Ser. No. 712,866,
the basic method is utilized to select amplitudes, frequencies and phases
corresponding to the largest peaks in a periodogram of the measured
signal, independently of the speech state. In order to reconstruct the
speech waveform, the amplitudes, frequencies and phases of the sine waves
estimated on one frame are matched and allowed to continuously evolve into
the corresponding parameter set on the next frame.
Because the number of estimated peaks is not constant and is slowly
varying, the matching process is not straightforward. Rapidly varying
regions of speech, such as unvoiced/voiced transitions, can result in
large changes in both the location and number of peaks.
To account for such rapid movements in spectral energy, the concept of
"birth"0 and "death" of sinusoidal components is employed in a
nearest-neighbor matching method based on the frequencies estimated on
each frame. If a new peak appears, a "birth" is said to occur and a new
track is initiated. If an old peak is not matched, a "death" is said to
occur and the corresponding track is allowed to decay to zero.
Once the parameters on successive frames have been matched, phase
continuity of each sinusoidal component is ensured by unwrapping the
phase. In one embodiment described in U.S. Ser. No. 712,866, the phase is
unwrapped using a cubic phase interpolation function having parameter
values that are chosen to satisfy the measured phase and frequency
constraints at the frame boundaries while maintaining maximal smoothness
over the frame duration.
In the final step of the illustrated embodiment, the corresponding
sinusoidal amplitudes are interpolated in a linear manner across each
frame.
In speech coding applications, U.S. Ser. No. 712,866 teaches that pitch
estimates can be used to establish a set of harmonic frequency bins to
which frequency components are assigned. The term "pitch" is used herein
to denote the fundamental rate at which a speaker's vocal chords are
vibrating. The amplitudes of the components are coded directly using
adaptive differential pulse code modulation (ADPCM) across frequency, or
indirectly using linear predictive coding (LPC).
In one embodiment of the coder, the peak in each harmonic frequency bin
having the largest amplitude is selected and assigned to the frequency at
the center of the bin. This results in a harmonic series based upon the
coded pitch period. An amplitude envelope can then be constructed by
connecting the resulting set of peaks and later sampled in a
pitch-adaptive fashion (either linearly or non-linearly) to provide
efficient coding at various bit rates. The phases can then be coded by
measuring the phases of the edited peaks and then coding such phases using
4 to 5 bits per phase peak. Further details on coding acoustic waveforms
in accordance with applicants' sinusoidal analysis techniques can be found
in commonly-owned, copending U.S. patent application Ser. No. 034,097,
entitled "Coding of Acoustic Waveforms," incorporated herein by reference.
Analysis/synthesis systems constructed according to the invention disclosed
in U.S. Ser. No. 712,866, based on a sinusoidal representation of speech,
yield synthetic speech that is essentially indistinguishable from the
original. Coding techniques as disclosed in U.S. Ser. No. 034,097 have led
to the realization of multi-rate coders operating at rates from 2.4 to 9.6
kilobits per second. Such systems produce synthetic speech that is very
intelligible at all rates and, in general, produce speech having
progressively improving quality as the data rate is increased.
A practical limitation of the sinusoidal technique has been the
computational complexity required to perform the sinusoidal synthesis.
This complexity results because it is typically necessary to generate each
sine wave on a per-sample basis and then sum the resulting set of sine
waves. Good performance can be achieved in sinusoidal analysis/synthesis
while operating at a 50 Hz frame rate, provided that the sine wave
frequencies are matched from frame to frame and that either cubic phase or
piece-wise quadratic phase interpolators are used to ensure consistency
between the measured frequencies and phases at the frame boundaries. The
disadvantage of this approach is the computational overhead associated
with the interpolation process. Even if very powerful 125 nanosecond/cycle
microprocessors are utilized, such as the ADSP2100 DSP integrated circuits
manufactured by Analog Devices (Norwood, Mass.), two such microprocessors
typically are required to synthesize 80 sine waves.
An alternative method for performing sinusoidal synthesis includes
constructing a set of sine waves having constant amplitudes, frequencies
and linearly-varying phases, applying a triangular window of twice the
frame size, and then utilizing an overlap-and-add technique in conjunction
with the sine waves generated on the previous frame. Such a set of sine
waves can also be generated using conventional Fast Fourier Transform
(FFT) methods. In this approach, a Fast Fourier Transform (FFT) buffer is
filled out with non-zero entries at the sine wave frequencies, an inverse
FFT is executed, and then the overlap-and-add technique is applied. This
process also leads to synthetic speech that is perceptually
indistinguishable from the original, provided the frame rate is
approximately 100 Hz (10 ms/frame).
However, for low-rate coding applications, it is necessary to operate at a
50 Hz frame rate (20 ms/frame) or lower. At these frame rates, the FFT
overlap-and-add method yields synthetic speech that sounds "rough" because
the triangular parametric window is at least 40 ms wide, and this is too
long a period compared to the rate of change of the vocal tract and vocal
chord articulators.
An apparatus for computationally efficient coding of acoustic waveforms at
frame rates of 50 Hz or less, without the "roughness" produced at low
coding rates by the above-described methods, would meet a substantial
need. In particular, speech processing devices and methods that reduce
frame-to-frame discontinuities at low coding rates would be particularly
advantageous for coding of speech.
Accordingly, there exists a need for computationally efficient methods and
devices for synthesizing sine waves for speech coding, analysis and
synthesis systems which operate at low coding rates requiring frame rates
of 50 Hz and below. In particular, techniques and apparatus for efficient
synthesis of sine waves in connection with sinusoidal transform coding
would satisfy long-felt needs and provide substantial contributions to the
art.
SUMMARY OF THE INVENTION
Sine wave synthesis and coding systems are further disclosed for processing
acoustic waveforms based on Fast Fourier Transform (FFT) overlap-and-add
techniques. A technique for sine wave synthesis is disclosed which
relieves computational choke points by generating mid-frame sine wave
parameters, thereby reducing frame-to-frame discontinuities, particularly
at low coding rates. The technique is applied to the sinusoidal model
after the frame-to-frame sine wave matching has been performed. Mid-frame
values are obtained by linearly interpolating the matched sine wave
amplitudes and frequencies and estimating a mid-point phase, such that the
mid-frame sine wave is best fit to the most recent half-frame segments of
the lagging and leading sine waves.
For example, the invention provides methods and apparatus for receiving
sets of sine wave parameters every 20 ms and for implementing an
interpolation technique that allows for resynthesis every 10 ms.
In synthesizing the mid-frame sine wave components, the mid-frame phase can
be estimated as follows:
.theta.(M)=(.theta..sub.o +.theta..sub.1)/2+(.omega..sub.o
-.omega..sub.1)/2.N/4+.pi.M
where M is an integer whose value is chosen such that .pi.M is closest to
(.theta..sub.o -.theta..sub.1)/2+(.omega..sub.o +.omega..sub.1)/2.N/4
and where .theta..sub.o is the phase of the lagging frame, .theta..sub.1 is
the phase of the leading frame, .omega..sub.o is the frequency of the
lagging frame, .omega..sub.1 is the frequency of the leading frame, and N
is the analysis frame length.
In another aspect of the invention, a system is disclosed which provides
improved quality, particularly for low-rate speech coding applications
where the speech has been corrupted by additive acoustic noise. For high
pitched speakers especially, background noise can have a tonal quality
when resynthesized that can be annoying if the signal-to-noise (SNR) ratio
is low. When a pitch-adaptive analysis window is used, the window will be
short for high pitched speakers and, when applied to the noise, will
result in relatively few resolved sine waves. The resulting synthetic
noise then sounds tonal. In addition to reducing the frame-to-frame
discontinuities, the present invention suppresses this tonal noise and
replaces it with a more "noise-like" signal which improves the robustness
of the system.
In one embodiment of the noise compensating system, the receiver can employ
a voicing measure to determine highly unvoiced frames (i.e., noisy
frames), and the spectra for successive noisy frames can then be averaged
to obtain an average background noise spectrum. This information can be
used to suppress the synthesized noise at the harmonics in accordance with
the SNR at each harmonic and used to replace the suppressed noise with a
broad band noise having the same spectral characteristic.
Methods are also disclosed for phase regeneration of sine waves for which
no phase coding is possible. At low data rates (e.g., 2.4 kbps and below),
it is typically not possible to code any of the sine wave phases. Thus, in
another aspect of the invention, techniques are disclosed to reconstruct
an appropriate set of phases for use in synthesis, based on an assumption
that all the sine waves should come into phase every pitch onset time.
Reconstruction is achieved by defining a phase function for the pitch
fundamental obtained by integration of the instantaneous pitch frequency.
The invention will next be described in connection with certain illustrated
embodiments. However, it should be clear that various changes and
modifications can be made by those skilled in the art without departing
from the spirit and scope of the invention, as defined by the claims. For
example, although the description that follows is particularly adapted to
speech coding, it should be clear that various other acoustic waveforms
can be processed in a similar fashion.
BRIEF DESCRIPTION OF THE DRAWINGS
For a more thorough understanding of the nature and objects of the
invention, reference should be had to the following detailed description
and to the drawings, in which:
FIG. 1 is an illustration of a simple overlap-and-add interpolation
technique in accordance with the invention, showing a triangular
parametric window applied to sine wave parameters obtained at frame
boundaries to generate interpolated values between those measured at frame
boundaries;
FIG. 2 is an illustration of a further application of overlap-and-add
interpolation techniques according to the invention, showing the
generation of an artificial mid-frame sine wave to reduce the
discontinuities in the resynthesized waveform at low coding rates;
FIG. 3 is a flow chart showing the steps of a method of mid-frame sine wave
synthesis according to the invention;
FIG. 4 is a schematic block diagram of a mid-frame sine wave synthesis
system according to the invention; and
FIG. 5 is a further schematic block diagram showing a noise suppressing
receiver structure according to the invention.
DETAILED DESCRIPTION
In the present invention the speech waveform is modeled as a sum of sine
waves. If s(n) represents the sampled speech waveform, then
s(n)=.SIGMA.A.sub.i (n)cos[.theta..sub.i (n)] (1)
where A.sub.i (n) and .theta..sub.i (n) are the time-varying amplitudes and
phases of the i'th tone.
To obtain a representation of the waveform over time, frequency components
measured on one analysis frame must be matched with frequency components
that are obtained on a successive frame. In particular, a frequency
component from one frame must be matched with a frequency component in the
next frame having the "closest" value. The matching technique is described
in more detail in parent case U.S. Ser. No. 712,866, herein incorporated
by reference. Once matched, the values of the components from one frame to
the next must be interpolated to obtain a parametric representation in
which the sine waves of one frame evolve into the corresponding parameter
set of the next frame.
FIG. 1 illustrates the basic process of interpolating exemplary frequency
components for frames K and K+1 in accordance with the invention by the
overlap-and-add method. The triangular windows A and B shown in FIG. 1 are
used to interpolate the sine wave components from frame K to frame K+1. In
the overlap-and-add method of filling in data values, the triangular
window is applied to the resulting sine waves generated during each frame.
The overlapped values in region C are then summed to fill in the values
between those measured at the frame boundaries.
The overlap/add technique illustrated in FIG. 1 yields good performance for
sampling rates near 100 Hz, i.e. 10 ms frames. However, for most coding
applications, sampling rates of approximately 50 Hz, i.e. 20 ms frames,
are required When the overlap-and-add interpolation technique shown in
FIG. 1 is used, in this case, the triangular window is effectively 40 ms
wide, which assumes a stationarity that is too long relative to the rate
of change of the human vocal tract and vocal chord articulators, and
significant frame to frame discontinuities result. Thus, a further
preferred embodiment of the invention provides a method for minimizing
such discontinuities.
If A.sub.o, .omega..sub.o, and .theta..sub.o represent the amplitude,
frequency and phase of a sine wave on frame K and A.sub.1, .omega..sub.1,
and .theta..sub.1 represent the amplitude, frequency and phase of the
matched sine wave on frame K+1, then the equations:
A=(A.sub.o +A.sub.1)/2 (2)
and
.omega.=(.omega..sub.o +.omega..sub.1)/2 (3)
represent a good approximation of the true amplitude and frequency at the
mid-point between frame K and frame K+1. Equations 2 and 3 represent one
set of interpolation functions which can be used to fill in data values
between those measured at frame boundaries.
In order to minimize any discontinuity between the sine wave at frame K and
its transition to the synthetic sine wave at the mid-point and between the
synthetic sine wave and its transition to the sine wave at frame K+1, the
invention calculates a phase that yields the minimum mean-squared-error at
times N/4 and 3N/4, where N is the analysis frame length. This phase is
calculated according to the equation:
.theta.(M)=(.theta..sub.o +.theta..sub.1)/2+(.omega..sub.o
-.omega..sub.1)/2.N/4+.pi.M (4)
where M is an integer whose value is chosen, such that .pi.M is closest to
(.theta..sub.o -.theta..sub.1)/2+(.omega..sub.o +.omega..sub.1)/2.N/4 (5)
In accordance with this preferred embodiment of the invention, an
artificial set of mid-frame sine waves is generated by applying the above
interpolation rules for all of the matched sine waves and then applying a
conventional FFT overlap-and-add technique. FIG. 2 illustrates this
overlap-and-add interpolation technique, showing an artificial sine wave
between frame K and frame K+1. The artificial sine wave S(n), generated
with values provided by the above interpolation rules, reduces the
discontinuities between S.sub.o (n) and S.sub.1 (n) shown in FIG. 2.
Because the effective stationarity has been reduced from 40 ms to 20 ms,
the resulting synthetic speech is no longer "rough." Hence, the invention
provides a method for doubling the effective synthesis rate with no
increase in the actual transmission frame rate.
In FIG. 3, a flow chart of the processing steps for interpolation using
synthetic mid-frame parameters according to the invention is shown. Sine
wave parameters for each frame are received and sampled every T ms, where
T is the frame period for frames K and K+1. The sine wave parameters
include amplitude A, frequency .omega. and phase .theta.. As shown in FIG.
3, the interpolation procedure begins in step 1 with the sine wave
parameters for frame K which are used to initialize the process. Next in
step 2, the sine wave parameters for frame K+1 are received.
The frequency components for frames K and K+1 are then matched in step 3,
preferably according to the method described in U.S. Ser. No. 712,866, and
in step 4 a mid-frame sine wave is constructed having an amplitude and
frequency given by Equations 2 and 3, and a phase is estimated for each
sine wave component, in accordance with Equation 4 above, such that each
mid-frame sine wave is best fit to the most recent half-frame segments of
the lagging and leading sine waves.
Finally in step 5, the overlap-and-add technique is applied to interpolate
between the frame K and mid-frame values and, likewise, to interpolate
between the mid-frame and frame K+1 values in order to synthesize a set of
waveforms at a virtual rate of T/2 ms. Thus, the synthetic waveform
reduces the discontinuities between the frame K and frame K+1 waveforms,
in effect generating an artificial frame half the duration of the actual
frame.
FIG. 4 is a block diagram of an acoustic waveform processing apparatus,
according to the invention. The transmitter 10 includes sine waves
parameter estimator 12 which samples the input acoustic waveform to obtain
a discrete samples and generates a series of frames, each frame spanning a
plurality of samples. The estimator 12 further includes means for
extracting a set of frequency components having discrete amplitudes and
phases. The amplitude, frequency and phase information extracted from the
sampled frames of the input waveform is coded by coder 14 for
transmission. The sampling, analyzing and coding functions of elements 12
and 14 are more fully discussed in U.S. Ser. No. 712,866, as well as U.S.
Ser. No. 034,097 also incorporated herein by reference.
In the receiver section 16, the coded amplitude, frequency and phase
information is decoded by decoder 18 and then analyzed by frequency
tracker 20 to match frequency components from one frame to the next.
The interpolator 22 interpolates the values of components from one frame to
the next frame to obtain a parametric representation of the waveform, so
that a synthetic waveform can be synthesized by generating a set of sine
waves corresponding to the interpolated values of the parametric
representation
In a preferred embodiment of the invention, the interpolator 22 includes a
mid-frame phase estimator 24 which implements a "best fit" phase
calculation, in accordance with Equations 4 and 5 above, and a linear
interpolator 26, which linearly interpolates matched amplitude and
frequency components from one frame to the next frame. The apparatus 16
further includes an FFT-based sine wave generator 28 which performs an
overlap-and-add function utilizing Fourier analysis.
The generator 28 further includes means for filling a buffer with amplitude
and phase values at the sine wave frequencies, means for taking an inverse
FFT of the buffered values, and means for performing an overlap-and-add
operation with transformed values and those obtained from the previous
frame.
Moreover, as shown generally in FIG. 4, the apparatus 10 can also
optionally include a noise estimator and generator 30. For high-pitched
speakers especially, the background noise has a tonal quality that can
become quite annoying, particularly when the signal-to-noise ration (SNR)
is low. The noise dependence on pitch is due to the fact that the analysis
window typically is set at two and one-half times the average pitch.
Hence, for a high-pitched speaker, the window will be short (but no less
than 20 ms) which, when applied to the noise, results in relatively few
resolved sine waves. The resulting synthetic noise then sounds tonal.
Conversely, for low-pitched speakers, the window will be quite long. This
results in a more resolved noise spectra which leads to a larger number of
sine waves for synthesis, which in turn, sounds more "noise-like," that is
to say, less tonal.
In FIG. 5, a noise correction system 30 according to the invention is shown
in more detail. The noise correction system 30 operates in concert with a
speech (or other acoustic waveform) synthesizer 32 (e.g., frequency
tracking, interpolating and sine wave generating circuitry as described
above in connection with FIG. 4), and includes a noise envelope estimator
34, a noise suppression filter 36, a broadband noise generator 38, and a
summer 40. The noise envelope estimator 34 estimates the noise envelope
parameters from decoded sine waves and voicing measurements, as discussed
in more detail below. These noise envelope parameters drive the noise
suppression filter 36 to modify the waveforms from synthesizer 32 and also
drive the broadband noise generator 38. The modified, synthetic waveforms
and broadband noise are then added in summer 40 to obtain the output
waveform in which "tonal" noise is essentially eliminated.
Although the noise correction system 30 is illustrated by discrete
elements, it should be apparent that the functions of some or all of these
elements can be combined in operation. For example, the noise correction
system can be implemented as part of the synthesizer, itse | | |