|
Description  |
|
|
TECHNICAL FIELD
The field of this invention is speech technology generally and, in
particular, methods and devices for analyzing, digitally-encoding,
modifying and synthesizing speech or other acoustic waveforms.
BACKGROUND OF THE INVENTION
Typically, the problem of representing speech signals is approached by
using a speech production model in which speech is viewed as the result of
passing a glottal excitation waveform through a time-varying linear filter
that models the resonant characteristics of the vocal tract. In many
speech applications it suffices to assume that the glottal excitation can
be in one of two possible states corresponding to voiced or unvoiced
speech. In the voiced speech state the excitation is periodic with a
period which is allowed to vary slowly over time relative to the analysis
frame rate (typically 10-20 msecs). For the unvoiced speech state the
glottal excitation is modelled as random noise with a flat spectrum. In
both cases the power level in the excitation is also considered to be
slowly time-varying.
While this binary model has been used successfully to design narrowband
vocoders and speech synthesis systems, its limitations are well known. For
example, often the excitation is mixed having both voiced and unvoiced
components simultaneously, and often only portions of the spectrum are
truly harmonic. Furthermore, the binary model requires that each frame of
data be classified as either voiced or unvoiced, a decision which is
particularly difficult to make if the speech is also subject to additive
acoustic noise.
Speech coders at rates compatible with conventional transmission lines
(i.e. 2.4-9.6 kilobits per second) would meet a substantial need. At such
rates the binary model is ill-suited for coding applications.
Additionally, speech processing devices and methods that allow the user to
modify various parameters in reconstructing waveform would find
substantial usage. For example, time-scale modification (without pitch
alteration) would be a very useful feature for a variety of speech
applications (i.e. slowing down speech for translation purposes or
speeding it up for scanning purposes) as well as for musical composition
or analysis. Unfortunately, time-scale (and other parameter) modifications
also are not accomplished with high quality by devices employing the
binary model.
Thus, there exists a need for better methods and devices for processing
audible waveforms. In particular, speech coders operable at mid-band rates
and in noisy environments as well as synthesizers capable of maintaining
their perceptual quality of speech while changing the rate of articulation
would satisfy long-felt needs and provide substantial contributions to the
art.
SUMMARY OF THE INVENTION
It has been discovered that speech analysis and synthesis as well as coding
and time-scale modification can be accomplished simply and effectively by
employing a time-frequency representation of the speech waveform which is
independent of the speech state. Specifically, a sinusoidal model for the
speech waveform is used to develop a new analysis-synthesis technique.
The basic method of the invention includes the steps of: (a) selecting
frames (i.e. windows of about 20-40 milliseconds) of samples from the
waveform; (b) analyzing each frame of samples to extract a set of
frequency components; (c) tracking the components from one frame to the
next; and (d) interpolating the values of the components from one frame to
the next to obtain a parametric representation of the waveform. A
synthetic waveform can then be constructed by generating a series of sine
waves corresponding to the parametric representation.
In one simple embodiment of the invention, a device is disclosed which uses
only the amplitudes and frequencies of the component sine waves to
represent the waveform. In this so-called "magnitude-only" system, phase
continuity is maintained by defining the phase to be the integral of the
instantaneous frequency. In a more comprehensive embodiment, explicit use
is made of the measured phases as well as the amplitudes and frequencies
of the components.
The invention is particularly useful in speech coding and time-scale
modification and has been demonstrated successfully in both of these
applications. Robust devices can be built according to the invention to
operate in environments of additive acoustic noise. The invention also can
be used to analyze single and multiple speaker signals, music or even
biological sounds. The invention will also find particular applications,
for example, in reading machines for the blind, in broadcast journalism
editing and in transmission of music to remote players.
In one illustrated embodiment of the invention, the basic method summarized
above is employed to choose amplitudes, frequencies, and phases
corresponding to the largest peaks in a periodogram of the measured
signal, independently of the speech state. In order to reconstruct the
speech waveform, the amplitudes, frequencies, and phases of the sine waves
estimated on one frame are matched and allowed to continuously evolve into
the corresponding parameter set on the successive frame. Because the
number of estimated peaks are not constant and slowly varying, the
matching process is not straightforward. Rapidly varying regions of speech
such as unvoiced/voiced transitions can result in large changes in both
the location and number of peaks. To account for such rapid movements in
spectral energy, the concept of "birth" and "death" of sinusoidal
components is employed in a nearest-neighbor matching method based on the
frequencies estimated on each frame. If a new peak appears, a "birth" is
said to occur and a new track is initiated. If an old peak is not matched,
a "death" said to occur and the corresponding track is allowed to decay to
zero. Once the parameters on successive frames have been matched, phase
continuity of each sinusoidal component is ensured by unwrapping the
phase. In one preferred embodiment the phase is unwrapped using a cubic
phase interpolation function having parameter values that are chosen to
satisfy the measured phase and frequency constraints at the frame
boundaries while maintaining maximal smoothness over the frame duration.
Finally, the corresponding sinusoidal amplitudes are simply interpolated
in a linear manner across each frame.
In speech coding applications, pitch estimates are used to establish a set
of harmonic frequency bins to which the frequency components are assigned.
(Pitch is used herein to mean the fundamental rate at which a speaker's
vocal cords are vibrating). The amplitudes of the components can be coded
directly using adaptive pulse code modulation (ADPCM) across frequency or
indirectly using linear predictive coding. In each harmonic frequency bin
the peak having the largest amplitude is selected and assigned to the
frequency at the center of the bin. This results in a harmonic series
based upon the coded pitch period. The phases can then be coded by using
the frequencies to predict phase at the end of the frame, unwrapping the
measured phase with respect to this prediction and then coding the phase
residual using 4 bits per phase peak. If there are not enough bits
available to code all of the phase peaks (e.g. for low-pitch speakers),
phase tracks for the high frequency peaks can be artificially generated.
In one preferred embodiment, this is done by translating the frequency
tracks of the base band peaks to the high frequency of the uncoded phase
peaks. This new coding scheme has the important property of adaptively
allocating the bits for each speaker and hence is self-tuning to both low-
and high-pitched speakers. Although pitch is used to provide side
information for the coding algorithm, the standard voice-excitation model
for speech is not used. This means that recourse is never made to a
voiced-unvoiced decision. As a consequence the invention is robust in
noise and can be applied at various data transmission rates simply by
changing the rules for the bit allocation.
The invention is also well-suited for time-scale modification, which is
accomplished by time-scaling the amplitudes and phases such that the
frequency variations are preserved. The time-scale at which the speech is
played back is controlled simply by changing the rate at which the matched
peaks are interpolated. This means that the time-scale can be speeded up
or slowed down by any factor and this factor can be time-varying. This
rate can be controlled by a panel knob which allows an operator complete
flexibility for varying the time-scale. There is no perceptual delay in
performing the time-scaling.
The invention will next be described in connection with certain illustrated
embodiments. However, it should be clear that various changes and
modifications can be made by those skilled in the art without departing
from the spirit and scope of the invention. For example other sampling
techniques can be substituted for the use of a variable frame length and
Hamming window. Moreover the length of such frames and windows can vary in
response to the particular application. Likewise, frequency matching can
be accomplished by various means. A variety of commercial devices are
available to perform Fourier analysis; such analysis can also be performed
by custom hardware or specially-designed programs.
Various techniques for extracting pitch information can be employed. For
example, the pitch period can be derived from the Fourier transform. Other
techniques such as the Gold-Malpass techniques can also be used. See
generally, M. L. Malpass, "The Gold Pitch Detector in a Real Time
Environment" Proc. of EASCON 1975 (Sept. 1975); B. Gold, "Description of a
Computer Program for Pitch Detection", Fourth International Congress on
Acoustics, Copenhagen Aug. 21-28, 1962 and B. Gold, "Note on Buzz-Hiss
Detection", J. Acoust. Soc. Amer. 365, 1659-1661 (1964), all incorporated
herein by reference.
Various coding techniques can also be used interchangeably with those
described below. Channel encoding techniques are described in J. N.
Holmes, "The JSRU Channel Vocoder", Inst. of Electrical Eng. Proceedings
(British), 27, 53-60 (1980). Adaptive pulse code modulation is described
in L. R. Rabiner and R. W. Schafer Digital Processing of Signal, (Prentice
Hall 1978). Linear predictive coding is described by J. D. Markel, Linear
Prediction of Speech, (Springer-Verlog, 1967). These teachings are also
incorporated by reference.
It should be appreciated that the term "interpolation" is used broadly in
this application to encompass various techniques for filling in data
values between those measured at the frame boundaries. In the
magnitude-only system linear interpolation is employed to fill in
amplitude and frequency values. In this simple system phase values are
obtained by first defining a series of instantaneous frequency values by
interpolating matched frequency components from one frame to the next and
then integrating the series of instantaneous frequency values to obtain a
series of interpolated phase values. In the more comprehensive system the
phase value of each frame is derived directly and a cubic polynomial
equation preferably is employed to obtain maximally smooth phase
interpolations from frame to frame.
Other techniques that accomplish the same purpose are also referred to in
this application as interpolation techniques. For example, the so-called
"overlap and add" method of filling in data values can also be used. In
this method a weighted overlapping function can be applied to the
resulting sine waves generated during each frame and then the overlapped
values can be summed to fill in the values between those measured at the
frame boundaries.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a schematic block diagram of one embodiment of the invention in
which only the magnitude and frequencies of the components are used to
reconstruct a sampled waveform.
FIG. 2 is an illustration of the extracted amplitude and frequency
components of a waveform sampled according to the present invention.
FIG. 3 is a general illustration of the frequency matching method of the
present invention.
FIGS. 4A-4F are detailed schematic illustrations of a frequency matching
method according to the present invention.
FIG. 5 is an illustration of tracked frequency components of an exemplary
speech pattern.
FIG. 6 is a schematic block diagram of another embodiment of the invention
in which magnitude and phase of frequency components are used to
reconstruct a sampled waveform.
FIG. 7 is an illustrative set of cubic phase interpolation functions for
smoothing the phase functions useful in connection with the embodiment of
FIG. 6 from which the "maximally smooth" phase function is selected.
FIG. 8 is a schematic block diagram of another embodiment of the invention
particularly useful for time-scale modification.
FIG. 9 is a schematic block diagram showing an embodiment of the system
estimation function of FIG. 8.
FIG. 10 is a block diagram of one real-time implementation of the invention
.
DETAILED DESCRIPTION
In the present invention the speech waveform is modelled as a sum of sine
waves. If s(n) represents the sampled speech waveform then
s(n) =.SIGMA.a.sub.i (n)sin[.phi..sub.i (n)] (1)
where a.sub.i (n) and .phi..sub.i (n) are time-varying amplitudes and
phases of the i'th tone.
In a simple embodiment the phase can be defined to be the integral of the
instantaneous frequency f.sub.i (n) and therefore satisfies the recursion
.phi..sub.i (n)=.phi..sub.i (n-1)+2.pi.f.sub.i (n)/f.sub.s (2)
where f.sub.s is the sampling frequency. If the tones are harmonically
related, then
f.sub.i (n)=i*f.sub.O (n) (3)
where f.sub.O (n) represents the fundamental frequency at time n. One
particularly attractive property of the above model is the fact that phase
continuity, hence waveform continuity, is guaranteed as a consequence of
the definition of phase in terms of the instantaneous frequency. This
means that waveform reconstruction is possible from the "magnitude-only"
spectrum since a high-resolution spectral analysis reveals the amplitudes
and frequencies of the component sine waves.
A block diagram of an analysis/synthesis system according to the invention
is illustrated in FIG. 1. As shown in FIG. 1, system 10 includes sampling
window 11, a discrete Fourier transform (DFT) analyzer 12, magnitude
computer 13, a frequency amplitude estimator 14, and an optional coder 16
in the transmitter segment and a frequency matching means 18, an
interpolator 20 and a sine wave generator 22 in the receiver segment of
the system. The peaks of the magnitude of the discrete Fourier transform
(DFT) of a windowed waveform are found simply by determining the locations
of a change in slope (concave down). In addition, the total number of
peaks can be limited and this limit can be adapted to the expected average
pitch of the speaker.
In a simple embodiment the speech waveform can be digitized at a 10 kHz
sampling rate, low-passed filtered at 5 kHz, and analyzed at 20 msec frame
intervals with a 20 msec Hamming window. Speech representations according
to the invention can also be obtained by employing an analysis window of
variable duration. For some applications it is preferable to have the
width of the analysis window be pitch adaptive, being set, for example, at
2.5 times the average pitch period with a minimum width of 20 msec.
Plotted in FIG. 2 is a typical periodogram for a frame of speech along with
the amplitudes and frequencies that are estimated using the above
procedure. The DFT was computed using a 512-point fast Fourier transform
(FFT). Different sets of these parameters will be obtained for each
analysis frame. To obtain a representation of the waveform over time,
frequency components measured on one frame must be matched with those that
are obtained on a successive frame.
FIG. 3 illustrates the basic process of frequency component matching. If
the number of peaks were constant and slowly varying from frame to frame,
the problem of matching the parameters estimated on one frame with those
on a successive frame would simply require a frequency ordered assignment
of peaks. In practice, however, there will be spurious peaks that come and
go due to the effects of sidelobe interaction; the locations of the peaks
will change as the pitch changes; and there will be rapid changes in both
the location and the number of peaks corresponding to rapidly-varying
regions of speech, such as at voiced/unvoiced transitions. In order to
account for such rapid movements in the spectral peaks, the present
invention employs the concept of "birth" and "death" of sinusoidal
components as part of the matching process.
The matching process is further explained by consideration of FIG. 4.
Assume that peaks up to frame k have been matched and a new parameter set
for frame k+1 is generated. Let the chosen frequencies on frames k and k+1
be denoted by .omega..sub.o.sup.k, .omega..sub.1.sup.k, . . .
.omega..sub.N-1.sup.k and .omega..sub.o.sup.k=1, .omega..sub.1.sup.k=1, .
. . .omega..sub.M-1.sup.k=1 respectively, where N and M represent the
total number of peaks selected on each frame (N.noteq.M in general). One
process of matching each frequency in frame k, .omega..sub.n.sup.k, to
some frequency in frame k+1, .omega..sub.m.sup.k+1, is given in the
following three steps.
Step 1
Suppose that a match has been found for frequencies .omega..sub.o.sup.k,
.omega..sub.1.sup.k . . . .omega..sub.n-1.sup.k. A match is now attempted
for frequency .omega..sub.n.sup.k. FIG. 4(a) depicts the case where all
frequencies .omega..sub.m.sup.k+1 in frame k+1 lie outside a "matching
interval" .DELTA. of .omega..sub.n.sup.k, i.e.,
.vertline..omega..sub.n.sup.k -.omega..sub.m.sup.k+1
.vertline..gtoreq..DELTA. (4)
for all m. In this case the frequency track associated with
.omega..sub.n.sup.k is declared "dead" on entering frame k+1, and
.omega..sub.n.sup.k is matched to itself in frame k+1, but with zero
amplitude. Frequency .omega..sub.n.sup.k is then eliminated from further
consideration and Step 1 is repeated for the next frequency in the list,
.omega..sub.n+1.sup.k.
If on the other hand there exists a frequency .omega..sub.m.sup.k+1 in
frame k+1 that lies within the matching interval about
.omega..sub.n.sup.k, and is the closest such frequency, i.e.,
.vertline..omega..sub.n.sup.k -.omega..sub.m.sup.k+1
.vertline.<.vertline..omega..sub.n.sup.k -.omega..sub.i.sup.k+1
.vertline.<.DELTA. (5)
for all i.noteq.m, then .omega..sup.k+1.sub.m is declared to be candidate
match to .omega..sup.k.sub.n. A definitive match is not yet made, since
there may exist a better match in frame k to the frequency
.omega..sup.k+1.sub.m , a contingency which is accounted for in Step 2.
Step 2
In this step, a candidate match from Step 1 is confirmed. Suppose that a
frequency .omega..sup.k.sub.n of frame k has been tentatively matched to
frequency .omega..sup.k+1.sub.m of frame k+1 . Then, if
.omega..sup.k+1.sub.m has no better to the remaining unmatched frequencies
of frame k, then the candidate match is declared to be a definitive match.
This condition, illustrated in FIG. 4 (c), is given by
.vertline..omega..sub.m.sup.k+1 -.omega..sub.n.sup.k
.vertline.<.vertline..omega..sub.m.sup.k+1 -.omega..sub.i+1.sup.k
.vertline.for i<n (6)
where the first bracketed value in Equation 6 is illustrated as
.sigma..sub.2 in FIG. 4 and the second bracketed value of Equation 6 is
illustrated as .sigma..sub.1. When this occurs, frequencies
.omega..sub.n.sup.k and .omega..sub.m.sup.k+1 are eliminated from further
consideration and Step 1 is repeated for the next frequency in the list,
.omega..sup.k.sub.n+1.
If the condition (6) is not satisfied, then the frequency
.omega..sup.k+1.sub.m in frame k+1 is better matched to the frequency
.omega..sup.k.sub.n+1 in frame k than it is to the test frequency
.omega..sub.n.sup.k. Two additional cases are then considered. In the
first case, illustrated in FIG. 4(d), the adjacent remaining lower
frequency .omega..sup.k+1.sub.m+1 (if one exists) lies below the matching
interval, hence no match can be made. As a result, the frequency track
associated with .omega..sub.n.sup.k is declared "dead" on entering frame
k+1, and .omega..sub.n.sup.k is matched to itself with zero amplitude. In
the second case, illustrated in FIG. 4(e), the frequency
.omega..sup.k+1.sub.m-1 is within the matching interval about
.omega..sup.k.sub.n and a definitive match is made. After either case Step
1 is repeated using the next frequency in the frame k list,
.omega..sub.n+1. It should be noted that many other situations are
possible in this step, but to keep the tracker alternatives as simple as
possible only the two cases are discussed.
Step 3
When all frequencies of frame k have been tested and assigned to continuing
tracks or to dying tracks, there may remain frequencies in frame k+1 for
which no matches have been made. Suppose that .omega..sup.k+1.sub.m is one
such frequency, then it is concluded that .omega..sup.k+1.sub.m was "born"
in frame k and its match, a new frequency, .omega..sup.k+1.sub.m, is
created in frame k with zero magnitude. This is done for all such
unmatched frequencies. This last step is illustrated in FIG. 4(f).
The results of applying the tracker to a segment of real speech is shown in
FIG. 5, which demonstrates the ability of the tracker to adapt quickly
through transitory speech behavior such as voiced/unvoiced transitions,
and mixed voiced/unvoiced regions.
In the simple "magnitude-only" system, synthesis is accomplished in a
straightforward manner. Each pair of match frequencies (and their
corresponding magnitudes) are linearly interpolated across consecutive
frame boundaries. As noted above, in the magnitude-only system, phase
continuity is guaranteed by the definition of phase in terms of the
instantaneous frequency. The interpolated values are then used to drive a
sine wave generator which yields the synthetic waveform as shown in FIG.
1. It should be noted that | | |