|
Description  |
|
|
FIELD OF THE INVENTION
The present invention is directed to the manipulation of sounds and other
one-dimensional signals, and more particularly to the morphing of two
audio signals to generate a new sound having characteristics between those
of the original sounds.
BACKGROUND OF THE INVENTION
The manipulation of a sound, to produce a different sound, has
applicability to a number of different fields. For example, in musical
applications the transformation of one audio signal into another audio
signal can be used to produce new sounds with synthesizers and the like.
In the movie industry, the transformation of one sound into another sound,
such as changing a speaker's voice to sound like the voice of a different
person, can be used to create special effects. In a similar fashion, a
person's voice can be manipulated so that it is disguised, for security
purposes.
Different types of sound manipulation are employed for these various
purposes. A first type of sound modification involves the mixing of two or
more sounds. This type of modification might be employed in a musical
environment, for example, to provide equalization or reverberation. These
effects are achieved by passing the sounds through simple filters whose
operation is independent of the actual data being filtered.
A second type of sound modification is based upon data-dependent filtering.
For example, the pitch of a sound can be increased or decreased by a
predetermined percentage to disguise a person's voice.
A third type of manipulation, which is more heavily data-dependent, is
known as voice transformation. In this type of manipulation, an acoustic
feature of speech, such as its spectral profile or average pitch, is
analyzed to represent it as a sequence of numbers, and then modified from
the original speaker's voice, typically in accordance with the statistical
properties of a target voice. For example, histogram mapping might be
employed to transform the speaker's pitch to that of the target voice.
Each time a particular sound is spoken, its formant frequencies are
changed so they are similar to those of the target speaker. When the sound
is resynthesized with the new acoustical parameters, the target voice
results. Further information relating to this type of sound manipulation
is described in U.S. Pat. No. 5,327,521, as well as in Savic et al, "Voice
Personality Transformation", Digital Signal Processing 1, Academic Press,
Inc., 1991, pp. 107-110; and Valbret et al, "Voice Transformation Using
PSOLA Technique", Speech Communication 11, Elsevier Science Publishers,
1992, pp. 175-187.
A fourth type of audio manipulation, and the one to which the present
invention is directed, is known as audio morphing. Audio morphing differs
from sound filtering, from the standpoint that two or more sounds are used
as inputs to create a single sound having characteristics of each of the
original sounds. Audio morphing also differs from voice transformation by
virtue of the fact that the resulting sound is a smooth warp and blend of
two or more original sounds. The morphed sounds share some of the
properties of the original sounds.
Generally speaking, morphing is the process of changing one physical
sensation smoothly into another. Its most prevalent use today is in the
visual domain. In this context, the two images are warped, and then cross
fades are implemented so that one image blends smoothly into the other.
Typically, the beginning and ending images are static, i.e., they do not
change with time as the morphing process is carried out.
Audio morphing involves the process of generating sounds that lie between
two source sounds. For example, in a series of steps the sound of a human
scream might morph into the sound of a siren. Unlike images, sounds are
not static. The amplitude of a sound at any given time, by itself, does
not present meaningful information. Rather, it must be considered over a
period of time. Thus, audio morphing is more complex, because it must take
into consideration the time course of a sound during the morphed sequence.
In the past, audio morphing has been carried out by using a sinusoidal
analysis of the sounds used to create the morph. See, for example, Tellman
et al, "Timbre Morphing of Sounds with Unequal Numbers of Features", Jour.
of Audio Eng. Soc., Vol. 43, No. 9, September 1995. In sinusoidal
analysis, a sound is broken down into a number of discrete sinusoids. A
morph is generated by changing the amplitude and frequency of the
sinusoids. This technique only has applicability to harmonic sounds, such
as those from musical instruments. It cannot be used to morph other types
of sounds, such as noise or speech that includes fricatives, i.e.
inharmonic sounds, as exemplified by the consonant "c" in the word
"corner."
Another limitation associated with morphing based upon sinusoidal analysis
is that it does not readily lend itself to automation to correctly label
individual sinusoids in the two original sounds and match them to one
another. Often, there is a significant amount of manual tuning that is
required, to identify the discrete sinusoids that result in the best
sound.
An important requirement, and the source of difficulty in any type of
morph, is preserving the perception of objects. Except for fortuitous
circumstances, simply cross-fading two pictures of faces will give an
image that looks like two faces. The perception that one is looking at a
single object is lost because features (such as ear lobes) are duplicated.
Likewise in audio, a morph should preserve the perception that the result
has the same number of auditory objects as the original. Many of the
properties that cause sounds to be perceived as one object are described
in Bregman, "Auditory Scene Analysis", MIT Press. An audio morph should
preserve these properties.
It is desirable, therefore, to provide a technique for morphing any given
sound into any other sound, which is not limited to specific types of
sounds, such as harmonic sounds. It is further desirable to provide such a
technique which readily lends itself to automation, and thereby reduces
the manual effort required to produce a morphed sound.
BRIEF STATEMENT OF THE INVENTION
In accordance with the present invention, these objectives are achieved by
a sound morphing process that is based on the fact that the different
dimensions of sounds can be separated and individually operated upon. A
sound morphing process in accordance with the present invention is
comprised of a series of basic steps. As a first step, each sound which
forms the basis for the morph is converted into multiple representations
that encode different features of the sound and quantitatively depict one
or more salient features of the sounds. In a preferred embodiment of the
invention, the multiple representations are independent of one another.
After the representations have been obtained, the temporal axes of the two
sounds are matched, so that similar components of the two sounds, such as
onsets, harmonic regions and inharmonic regions, are aligned with one
another. After the temporal matching, other relevant characteristics of
the sounds, such as pitch, are also matched for each corresponding instant
of time in the two sounds. Once the energy in each of the sounds has been
accounted for and matched to that of the other sound, the two sounds can
be warped and cross-faded, to produce a representation of the morphed
sound, such as a new spectrogram. The interpolated representation is then
inverted, to generate the morphed sound.
By using a spectrogram or other dense representation of a sound, the
morphing process is not limited to harmonic sounds. Rather, any sound
which is capable of being represented can form the basis for an audio
morph. The particular representations that are chosen will be dependent
upon the characteristics of the sound that are important. The primary
criteria is that the representation be perceptually relevant, i.e. it
relates to some dimension of the sound which is detectable to the human
ear, and allows the sound to be smoothly interpolated along that
dimension. Using such representations, any two or more sounds can be
matched to one another to produce a morph.
Another advantage of the morphing process of the present invention is that
it can be easily automated. For example, the temporal warping of two
representations of a sound, to match them to one another, can be computed
using known techniques, such as dynamic time warping that produces the
lowest mean-squared-difference. Similarly, other components of the sound
can be automatically matched with one another, for example, by applying
dynamic time warping between two spectral frames.
Further features of the invention, and the advantages provided thereby, are
explained in greater detail hereinafter with reference to exemplary
embodiments illustrated in the accompanying drawings.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a block diagram illustrating the overall process for morphing two
sounds in accordance with the present invention;
FIG. 2 is a more detailed block diagram of an embodiment of the invention
for morphing speech;
FIG. 3 is an illustration of the audio correspondence between two sounds;
FIG. 4 is a diagram of the procedure to warp and interpolate two signals;
FIGS. 5A and 5B are illustrations of a continuous morph and a
cyclostationary morph, respectively;
FIG. 6 is a spectrogram illustrating a morph in which the pitch of a spoken
vowel changes; and
FIG. 7 is an illustration of a sequence of spectrograms in a
cyclostationary morph.
DETAILED DESCRIPTION
Generally speaking, morphing is the process of generating a range of
sensations that move smoothly from one arbitrary entity to another. For
example, a video morph consists of a series of images which successively
show one object smoothly changing its shape and texture until it becomes
another object. The same objectives are desirable for an audio morph. A
sound that is perceived as coming from one object should smoothly change
into another sound, maintaining the shared properties of the starting and
ending sounds while smoothly changing other properties.
In the following discussion of the invention, it is described with
reference to its implementation in the morphing of two or more sounds. It
will be appreciated, however, that the principles of the invention are not
limited to sound signals. Rather, they are applicable to any type of
one-dimensional waveform.
In the context of the present invention, two different types of audio
morphing can be produced. One type of morph is temporally based. In this
situation, a sound is considered as a point in a multi-dimensional space.
The dimensions of this space can include the spectral shape, pitch, rhythm
and other perceptually relevant auditory dimensions. A morph is obtained
by defining a path between two sounds represented at two points in the
space. This type of morph is analogous to image morphing. For example, a
steady state clarinet tone might morph into the sound of an oboe or into a
singer's voice.
In the second type of morph, a sequence of individual sounds are generated
which smoothly change from one to another. For example, the spoken word
"corner" can change into the word "morning" in a sequence of small steps.
Each individual step represents a small difference from the previous word,
and in the middle of the sequence the word sounds like a cross between
"corner" and "morning." This type of morph is referred to as a
cyclostationary morph. It is cyclic because a sound is played repetitively
to transition from one word to the other. It is also stationary since each
sound instance is a static example of one of the in-between sounds in the
sequence.
Different variations of this second type of morph are possible. For
example, rather than generating a sequence of sounds that transition from
one word to another, the desired output may be just one of the
intermediate sounds. Alternatively, a sound can be produced that is a
mixture of different components of the original sounds. For example, the
output sound might utilize the pitch from one word, the timing from a
second word, and the spectral resonances from a third word.
The morphing of one sound into another, in accordance with one embodiment
of the present invention, is schematically illustrated in the block
diagram of FIG. 1. A brief description of the overall process is first
presented, and followed by a more detailed discussion of individual
aspects of the process. This particular embodiment relates to the morphing
of speech. It will be appreciated, however, that this example is for
illustrative purposes. The principles which underlie the invention are
equally applicable to music and other types of sound as well.
Referring to FIG. 1, two input sounds provide the basis from which the
morphed sound is produced. In practice, more than two sounds can be used
to provide the original input data. For purposes of the present
explanation, a two-sound example will be described. As a first step,
various representations 10 of each sound are generated. For example, the
representations might be two or more different kinds of spectrograms for
each sound. Corresponding representations of the two sounds are then
temporally matched, such as by means of a dynamic time warping process 12.
In this step, similar components of each sound, such as the onset or
attack portion, harmonic and inharmonic regions, and a decay region, are
temporally aligned with one another. After the temporal alignment, other
relevant features of the two sounds undergo a matching process 14. For
example, if the sounds contain harmonic components, the pitches of the two
sounds can be matched. The matching of the two sounds results in a dense
mapping of corresponding elements of the sounds to one another, for each
of the dimensions of interest.
After all of the relevant energy components in the two sound signals have
been matched, the sounds undergo warping, interpolation and cross fading
16. For example, if a morph from Sound 1 to Sound 2 is to take place in
five steps, the first interpolation of the sound in the sequence comprises
100% of Sound 1 and 0% of Sound 2. The second interpolated sound of the
sequence is comprised of 75% of Sound 1's components and 25% of Sound 2's
components. Successive interpolation steps comprise greater proportions of
Sound 2, until the final step is comprised entirely of Sound 2. For each
step in the sequence, the interpolation determines the appropriate
percentage of each of the two components to combine with one another.
These combined components form a new representation of the morphed sound,
e.g., a new spectrogram. This representation can then be inverted, at 18,
to generate the actual morphed sound for that step in the sequence. By
successively reproducing each of the sounds in the sequence, a smooth
transition from Sound 1 to Sound 2 can be heard.
The calculation of the representation 10 transforms the sound from a simple
waveform into a multi-dimensional representation that can be warped, or
modified, to produce a desired result. To be useful, the representation of
the sound must be one that is invertible, i.e. after one or more of its
parameters are modified, the result can be used to generate an audible
sound. The particular representation that is employed should preserve all
relevant dimensions of the sound. For example, in harmonic sounds pitch is
an important characteristic. Thus, for the morphing of harmonic sounds, a
representation which preserves the pitch information should be employed.
Examples of suitable representations for harmonic sound include
spectrograms, such as the short-term Fourier transform, as well as
cochleagrams and correlograms.
Inharmonic sounds, such as noise and spoken fricatives, do not have a pitch
component. Similarly, if a spoken word is whispered, its pitch is not
significant. Consequently, other types of representation may be more
appropriate for these types of sounds. For example, linear predictive
coding (LPC) coefficients might be used to represent the broad spectral
characteristics of an inharmonic sound.
Sinusoidal analysis is often accomplished by analysing a sound with a
wide-band spectrogram. Individual sinusoids are displayed as peaks or
lines in the spectrogram. A sinusoidal analysis of the sound uses the
locations of the individual peaks or lines in the spectrum to model the
entire sound. This approach uses a sparse representation of the sound
since some sort of threshold is empoyed to pick the discrete sinusoids
that are used. This enforces a model on the signal, whether it fits or
not. In contrast, a spectrogram preserves the level of all components of
the sound, the representation is dense and continuous as a function of
frequency. In a dense representation, the entire spectrum is preserved,
not just the peaks.
Preferably, a multi-dimensional dense representation of sounds is employed,
where each dimension is independent and salient to the perceived result.
In the case of speech, two relevant dimensions of a sound are its pitch
and its broad spectral shape, i.e. its formant frequencies. These two
dimensions roughly correspond to the rate at which the human glottis
produces air pulses during speech (pitch) and the filtering of these
pulses that is carried out by the mouth and nasal passages (formants). As
discussed previously, another relevant dimension of sounds is their
timing.
FIG. 2 illustrates one embodiment of the invention in which each of these
three dimensions can be separately represented to generate a morph. At the
outset, a conventional narrow-band spectrogram of a sound is obtained by
processing it through a Fast Fourier Transform 20. The Fast Fourier
Transform provides a quantitative analysis of the sound in terms of its
frequency content. The spectrogram of the sound is then further analyzed
to determine its mel-frequency cepstral coefficients (MFCC) 22. For a
description of the procedure for calculating an MFCC representation, see
Hunt et al., "Experiments in Syllable-based Recognition of Continuous
Speech", Proceedings of the 1980 ICASSP, Denver, Colo., pp. 880-883, the
disclosure of which is incorporated herein by reference. Briefly, the MFCC
for a sound is computed by resampling the magnitude spectrum to match
critical bands that are related to auditory perception. This is carried
out by combining channels of the spectrogram to produce a filter bank
which approximates the auditory characteristics of the human ear. The
filter bank produces a number of output signals, e.g. forty signals, which
are compressed using a logarithm and undergo a discrete cosine transform
to rearrange the data values. A predetermined number of the lowest
frequency components, e.g. the thirteen lowest filter coefficients, are
then selected. These coefficients define a space where the Euclidean
distance between vectors provides a good measure of how close two sounds
are. Hence, they can be used to find a temporal match between two sounds,
as described in detail hereinafter.
Since the MFCC is a low dimensional representation of the sound, it can be
used to compute its broad spectral shape. To this end, the MFCC is
inverted at 24 by applying the inverse of the cosine transform, to provide
a smooth estimate of the filter bank output that was used to compute the
MFCC. After undoing the logarithm, this smooth estimate is then
reinterpolated, for example by means of an inverse Bark scale, to yield a
new spectrogram. This spectrogram corresponds to the original spectrogram,
without the high spatial-frequency variations due to pitch. In the context
of the present invention, this spectrogram is referred to as a "smooth
spectrogram", and provides a representation of the frequency formats in
the original sound.
Other types of processing, such as homomorphic filtering or LPC, can be
used to calculate a smooth spectrogram. However, MFCC processing is
preferred for many speech recognizers and is easier to apply to different
sounds such as music.
Furthermore, the smooth spectrogram can be used to obtain a representation
of the pitch information in a sound. More particularly, a conventional
spectrogram encodes all of the information in a sound signal, and the
smooth spectrogram describes the sound's overall spectral shape. The
conventional spectrogram is divided by the smooth spectrogram at 26, to
produce a residual spectrogram that contains the pitch and voicing
information in a sound. In the context of the present invention, the
residual spectrogram is referred to as a "pitch spectrogram."
In the embodiment of FIG. 2, three representations are derived for each
sound, namely the MFCC transform which is used for temporal matching, the
smooth spectrogram which provides format information, and the pitch
spectrogram which provides pitch and voicing information. In the
illustration of FIG. 2, the individual steps for obtaining these
representations are shown with respect to one sound. It will be
appreciated that similar processing is carried out to provide
representation for a second sound, which forms another component of the
audio morph. The corresponding representations of the two sounds are then
matched to one another at 28-32.
Temporal matching of sounds at 28 (FIG. 2) is desirable since, over the
course of a morph, features which are common to both sounds should be
matched and remain relatively fixed in time. Referring to FIG. 3, an
example of the temporal correspondence between two sounds is illustrated.
In the figure, a spectrogram for one sound, e.g. a beginning sound, is
shown at the bottom of the figure, and the spectrogram for a ending sound
is shown above and to the left of the spectrogram for the beginning sound.
In the spectrogram for the beginning sound, time is represented along the
horizontal axis, and frequency is depicted on the vertical axis. To
illustrate the temporal matching of the two sounds, the spectrogram for
the ending sound is rotated counter-clockwise 90.degree. relative to the
spectrogram for the beginning sound.
In the preferred embodiment of the invention, dynamic time warping is
employed to find the best temporal match between two sounds, using the
distance metric provided by the MFCC transforms of the sounds. For
detailed information regarding dynamic time warping, reference is made to
Deller et al, "Dynamic Time Warping", Discrete-time Processing of Speech
Signals, New York, Macmillan Pub. Co., 1993, pp. 623-676, the disclosure
of which is incorporated herein by reference. The result of the dynamic
time warping process is to provide control points in time which identify
the frames of one sound that line up with those of the other sound. The
correspondence of the frames provides an indication of the amount by which
each segment of a sound must be temporally compressed or expanded to match
it to the corresponding features in the other sound.
Once the two sounds have been aligned temporally at 28, they can be matched
at each corresponding time instant. For each pair of corresponding frames,
the relevant acoustical features that are indicated by the representations
of the two sounds need to be matched. For example, in the pitch
spectrogram, the pitch information in the sound is visible as a series of
peaks. The spacing of the peaks is proportional to the pitch. The matching
of the pitch data for two sounds at 30 essentially involves expanding or
compressing the pitch spectrograms to align the harmonic peaks. For any
given instant in time, the pitch of one sound can be represented as p1,
and the pitch of the other sound at the corresponding time is p2. For the
best match, the frequency axis of the second sound's pitch spectrogram
must be compressed by p1/p2. If p1 is larger than p2, the frequency axis
of the pitch spectrogram for the second sound is actually stretched. When
this process is carried out, the result is a dense match linking a
frequency f.sub.1 in the first pitch spectrogram and a corresponding
frequency f.sub.2 =p.sub.2 /p.sub.1 *f.sub.1 in the second pitch
spectrogram.
Some sounds contain both harmonic and inharmonic components. For example, a
spoken word may include both voiced and unvoiced sounds. An example of an
unvoiced sound is the consonant "c" in the word "corner". The unvoiced
components of the word do not contain pitch information. However, the
voiced, or harmonic, components have a pitch, which should be matched to
the pitch of another sound to form the morph. Another difficulty arises
when parts of a sound are only partially voiced. To ensure that the pitch
of the morphed sound is consistent and smoothly changing, an assumption is
made during the matching process that a pitch exists throughout the
duration of each of the sounds which forms the basis for the morph. Using
this assumption, a smoothly varying curve is estimated for pitch
throughout the entire sound, including the inharmonic regions where it is
normally absent. In a preferred implementation of the invention, a dynamic
programming technique can be used to calculate a smooth pitch function for
the duration of a sound. An example of a suitable dynamic pitch
programming technique is disclosed, for example, in Secrest et al, "An
Integrated Pitch Tracking Algorithm for Speech Systems", Proceedings of
1983 ICASSP, Boston, Mass., vol. 3, pp. 1352-1355, 1983. In particular,
one implementation combines a clipped autocorrelation, as described in
Rabiner and Schafer, Digital Processing of Speech Signals, Prentice Hall,
1978, p. 154, with the energy minimization technique described in Amini et
al, "Using Dynamic Programming for Solving Variational Problems in
Vision," IEEE Transactions on Pattern Analysis and Machine Intelligence,
Vol. 12, No. 9, September 1990, pp. 855-867. The pitch functions that are
calculated for respective sounds with such a technique can then be matched
to one another, as described previously.
Once all of the relevant energy in each sound has been accounted for and
matched, the corresponding portions of the two sounds can be warped and
cross-faded to produce a representation for a new sound. Warping in both
the time and frequency dimensions lines up corresponding features in the
two sounds. A morph includes some type of interpolation or cross-fading
step. Scalar dimensions are easiest to morph. If one component of a sound
description is loudness, then the loudness of the morph should change
smoothly from the loudness of the first sound to the loudness of the
second. The same holds true for a scalar quantity like pitch. However,
acoustic information is not always scalar. Interpolations of temporal
information, smooth spectrograms, and pitch spectrograms present a more
complex problem, because they are based upon a dense match between pairs
of one-dimensional curves.
Audio morphing is simpler than image morphing because each dimension can be
considered independently. An important step in audio morphing is to warp
and interpolate two one-dimensional signals. The one-dimensional signals
might be cepstral coefficients over time as used to match the temporal
aspects of a sound, or spectral amplitudes over frequency when morphing
spectrogram slices. In each case, one-dimensional morphing involves a
determination of a dense set of matches. For each point in the output
signal, the best two points in the original waveforms are determined.
These points are then warped and interpolated to give the value of the
morphed signal. The process is the same whether the signal is scalar or a
vector value.
With reference to FIG. 4, the data to be morphed is described as s1(t) and
s2(t). These two curves might represent slices of smooth spectrograms, for
example. The objective of the morph is to find a new curve s(lambda,t)
| | |