|
Claims  |
|
|
What is claimed is:
1. For use with a costume depicting a character having a defined voice with
a pre-established voice characteristic, a voice transformation system
comprising:
a microphone that is positionable to receive and transduce speech that is
spoken by a person wearing the costume into a source speech signal;
a mask that is positionable to cover the mouth of the person wearing the
costume to muffle the speech of the person wearing the costume to tend to
prevent communication of the speech beyond the costume, the mask enabling
placement of the microphone between the mouth and the mask;
a speaker disposed on or within the costume to broadcast acoustic waves
carrying speech in the defined voice of the character depicted by the
costume; and
a voice transformation device coupled to receive the signal from the
microphone representing source speech spoken by a person wearing the
costume, the voice transformation device transforming the received source
speech signal to a target speech signal representing the utterances of the
source speech signals in the defined voice of the character depicted by
the costume;
wherein the voice transformation device stores a plurality of
representations of the defined voice and transforms the voice of the
person wearing the costume into the same defined voice of the character
depicted by the costume, based upon association of the voice of the
particular person with particular ones of the stored representations.
2. A voice transformation system according to claim 1, wherein the voice
transformation device includes:
a processing subsystem segmenting and windowing the received source speech
signal to generate a sequence of preprocessed speech signal segments;
an analysis subsystem processing the received preprocessed speech signal
segments to generate for each segment a pitch signal indicating a dominant
pitch of the segment, a frequency domain vector representing a smoothed
frequency characteristic of the segment and an excitation signal
representing excitation characteristics of the segment;
a transformation subsystem storing target frequency domain vectors that are
representative of the target speech, substituting a corresponding target
frequency domain vector for the frequency domain vector derived by the
analysis subsystem, adjusting the pitch of the target excitation spectrum
in response to the pitch signal derived by the analysis subsystem, and
convolving the substituted target frequency domain vector with the
adjusted excitation spectrum to produce a segmented frequency domain
representation of the target voice; and
a post processing subsystem performing an inverse Fourier transform and an
inverse segmenting and windowing operation on each segmented frequency
domain representation of the target voice to generate a time domain signal
representing the source speech in the voice of the character depicted by
the costume.
3. A voice transformation system comprising:
a preprocessing subsystem receiving a source voice signal and digitizing
and segmenting the source voice signal to generate a segmented time domain
signal;
an analysis subsystem responding to each segment of the segmented time
domain signal by generating a source speech pitch signal representative of
a pitch thereof, an excitation signal representative of the excitation
thereof and a source vector that is representative of a smoothed spectrum
of the segment;
a transformation subsystem storing a plurality of source and target vectors
and voice pitch indications for the source voice and a target voice
different from the source voice, a correspondence between the source and
target vectors and the source and target voice pitch indications, the
transformation subsystem using the stored information to substitute a
target vector for each received source vector, adjusting the pitch of the
frequency domain excitation spectrum in response to the source and target
pitch indications to generate a pitch adjusted excitation spectrum, and
convolving the pitch adjusted excitation spectrum with a signal
represented by the substituted target vector to generate a sequence of
segmented target voice segments defining a segmented target voice signal;
and
a post processing subsystem converting the segmented target voice signal
into a segmented time domain target voice signal that represents the words
of the source signal with vocal characteristics of the different target
voice.
4. A voice transformation system according to claim 3, wherein the
preprocessing subsystem includes a digitizing sampling circuit that
samples the source voice signal to produce digital samples that are
representative thereof and a segmenting and windowing circuit that devices
the digital samples into overlapping segments having a shift distance of
at most 1/4 of a segment and applies a windowing function to each segment
that reduces aliasing during a subsequent transformation to the frequency
domain to produce a sequence of windowed source segments.
5. A voice transformation system according to claim 4, wherein each of the
segments represent 256 voice samples.
6. A voice transformation system according to claim 3, wherein the analysis
subsystem includes:
a discrete Fourier transform unit generating a frequency domain
representation of each segment;
an LPC cepstrum parametrization unit generating source cepstrum coefficient
voice vectors representing a smoothed spectrum of each frequency domain
segment;
an inverse convolution unit deconvolving each frequency domain segment with
the smoothed cepstrum coefficient representation thereof to produce the
excitation signal in the form of a frequency domain excitation spectrum;
a pitch adjustment unit responding to the source speech pitch signal and
adjusting the pitch of the excitation spectrum to generate a pitch
adjusted excitation spectrum;
a substitution unit substituting target cepstrum coefficient voice vectors
for the source cepstrum coefficient voice vectors for each corresponding
segment; and
a convolver convolving the pitch adjusted excitation spectrum with the
substituted target cepstrum coefficient voice vectors.
7. A voice transformation system according to claim 3, wherein the
transformation subsystem includes:
a store storing the target voice pitch information, a plurality of the
target vectors, a plurality of the source vectors and the correspondence
between the source and target vectors;
a pitch adjustment unit adjusting the pitch of the frequency domain
excitation spectrum to generate a pitch adjusted excitation spectrum;
a substitution unit receiving source vectors and responsive to the stored
voice and target vectors and substituting one of the stored target vectors
for each received source vector; and
a convolver convolving each substituted target vector with the
corresponding pitch adjusted excitation spectrum to generate a segmented
frequency domain target voice signal.
8. A voice transformation system according to claim 3, wherein the post
processing subsystem includes:
an inverse Fourier transform unit transforming the segmented target voice
signal to the segmented time domain target voice signal;
an inverse segmenting and windowing unit converting the segmented time
domain target voice signal to a sampled nonsegmented target voice signal;
and
a time duration adjustment unit adjusting the time duration of
representations of the sampled nonsegmented target voice signal.
9. A voice transformation system according to claim 8, further comprising a
digital-to-analog converter converting the time duration adjusted sampled
nonsegmented target voice signal to a continuous time varying signal
representing spoken utterances of the source voice with acoustical
characteristics of the target voice.
10. A method of transforming a source signal representing a source voice to
a target signal representing a target voice comprising the steps of:
preprocessing the source signal to produce a time domain sampled and
segmented source signal in response thereto;
analyzing the sampled and segmented source signal, the analysis including
executing a transformation of the source signal to the frequency domain,
generating a cepstrum vector representation of a smoothed spectrum of each
segment of the source signal, generating an excitation signal representing
the excitation of each segment of the source signal, determining a pitch
for each segment of the source signal, and adjusting the excitation signal
for each segment of the source signal in response to the pitch for each
segment of the source signal;
transforming each segment by storing cepstrum vectors representing target
speech and corresponding cepstrum vectors representing source speech,
substituting a stored target speech cepstrum vector for an analyzed source
cepstrum vector and convolving the substituted target cepstrum vector with
the excitation signal to generate a target segmented frequency domain
signal; and
post processing the target segmented frequency domain signal to provide
transformation to the time domain and inverse segmentation to generate the
target voice signal.
11. For use with a costume depicting a predefined character having a voice
with a pre-established voice characteristic, a voice transformation system
comprising:
a microphone that is positionable to receive and transduce speech that is
spoken by a person wearing the costume into a source speech signal;
a mask that is positionable to cover the mouth of the person wearing the
costume to muffle the speech of the person wearing the costume to tent to
prevent communication of the speech beyond the costume, the mask enabling
placement of the microphone between the mouth and the mask;
a speaker disposed on or within the costume to broadcast acoustic waves
carrying speech in the voice of the character depicted by the costume; and
a voice transformation device coupled to receive the signal from the
microphone representing source speech spoken by a person wearing the
costume, the voice transformation device transforming the received source
speech signal to a target speech signal by replacing vocal characteristics
of the speaker, represented by the signal, with predefined and stored
substitute vocal characteristics of the voice of the character depicted by
the costume, the target speech signal being communication to the speaker
to be transduced and acoustically broadcast by the speaker. |
|
|
|
|
Claims  |
|
|
Description  |
|
|
COPYRIGHT AUTHORIZATION
A portion of the disclosure of this patent document contains material which
is subject to copyright protection. The copyright owner has no objection
to the facsimile reproduction by anyone of the patent document or the
patent disclosure, as it appears in the Patent and Trademark Office patent
file or records, but otherwise reserves all copyright rights whatsoever.
BACKGROUND OF THE INVENTION
In 1928 Mickey Mouse was introduced to the public in the first "talking"
animation film entitled, "Steamboat Willy". Walt Disney, who created
Mickey Mouse, was also the voice of Mickey Mouse. Consequently, when Walt
Disney died in 1966 the world lost a creative genius and Mickey Mouse lost
his voice.
It is not unusual to discover during the editing of a dramatic production
that one or more scenes are artistically flawed. Minor background problems
can sometimes be corrected by altering the scene images. However, if the
problem lies with the performance itself or there is a major visual
problem, a scene must be done over. Not only is this expensive, but
occasionally an actor in the scene will no longer be available to redo the
scene. The editor must then either accept the artistically flawed scene or
make major changes in the production to circumvent the flawed scene.
A double could typically be used to visually replace a missing actor in a
scene that is being redone. However, it is extremely difficult to
convincingly imitate the voice of a missing actor.
A need thus exists for a high quality voice transformation system that can
convincingly transform the voice of any given source speaker to the voice
of a target speaker. In addition to its use for motion picture and
television productions, a voice transformation system would have great
entertainment value. People of all ages could take great delight in having
their voices transformed to those of characters such as Mickey Mouse or
Donald Duck or even to the voice of their favorite actress or actor.
Alternatively, an actor dressed in the costume of a character and
imitating a character could be even more entertaining if he or she could
speak the voice of the character.
A great deal of research has been conducted in the field of voice
transformation and related fields. Much of the research has been directed
to transformation of source voices to a standardized target voice that can
be more easily recognized by computerized voice recognition systems.
A more general speech transformation system is suggested by an article by
Masanobu Abe, Satoshi Nakamura, Kiyohiro Shikano and Hisao Kuwabara,
"Voice Conversion Through Vector Quantization," IEEE International
Conference on Acoustics, Speech and Signal Processing, (April 1988), pp.
655-658. While the disclosed method produced a voice transformation, the
transformed target voice was less than ideal. It contained a considerable
amount of distortion and was recognizable as the target voice less than
2/3 of the time in an experimental evaluation.
SUMMARY OF THE INVENTION
A high quality voice transformation system and method in accordance with
the invention provides transformation of the voice of a source speaker to
the voice of a selected target speaker. The pitch and tonal qualities of
the source voice are transformed while retaining the words and voice
emphasis of the source speaker. In effect the vocal chords and glottal
characteristics of the target speaker are substituted for those of the
source speaker. The words spoken by the source speaker thus assume the
voice characteristics of the target speaker while retaining the inflection
and emphasis of the source speaker. The transformation system may be
implemented along with a costume of a character to enable an actor wearing
the costume to speak with the voice of the character.
In a method of voice transformation in accordance with the invention, a
learning step is executed wherein selected matching utterances from source
and target speakers are divided into corresponding short segments. The
segments are transformed from the time domain to the frequency domain and
representations of corresponding pairs of smoothed spectral data are
stored as source and target code books in a table. During voice
transformation the source speech is divided into segments which are
transformed to the frequency domain and then separated into a smoothed
spectrum and an excitation spectrum. The closest match of the smoothed
spectrum for each segment is found in the stored source code book and the
corresponding target speech smoothed spectrum from the target code book is
substituted therefore in a substitution or transformation step. This
substituted target smoothed spectrum is convolved with the original source
excitation spectrum for the same segment and the resulting transformed
speech spectrum is transformed back to the time domain for amplification
and playback through a speaker or for storage on a recording medium.
It has been found advantageous to represent the original speech segments as
the cepstrum of the Fourier transform of each segment. The source
excitation spectrum is attained by dividing or deconvolving the
transformed source speech spectrum by a smoothed representation thereof.
A real time voice transformation system includes a plurality of similar
signal processing circuits arranged in sequential pipelined order to
transform source voice signals into target voice signals. Voice
transformation thus appears to be instantaneous as heard by a normal
listener.
BRIEF DESCRIPTION OF THE DRAWINGS
A better understanding of the invention may be had from a consideration of
the following Detailed Description, taken in conjunction with the
accompanying drawings in which:
FIG. 1 is a pictorial representation of an actor wearing a costume that has
been fitted with a voice transformation system in accordance with the
invention;
FIG. 2 is a block diagram representation of a method of transforming a
source voice to a different target voice in accordance with the invention;
FIG. 3 is a block diagram representation of a digital sampling step used in
the processor shown in FIG. 2.
FIG. 4 is a pictorial representation of a segmentation of a sampled data
signal;
FIG. 5 is a graphical representation of a windowing function;
FIG. 6 is a block diagram representation of a training step used in a voice
transformation processor shown in FIG. 2;
FIG. 7 is a graphical representation of interpolation of the magnitude of
the excitation spectrum of a speech segment for linear pitch scaling;
FIG. 8 is a graphical representation of interpolation of the real part of
the excitation spectrum of a speech segment for linear pitch scaling;
FIG. 9 is a block diagram representation of a code book generation step
used by a training step shown in FIG. 2;
FIG. 10 is a block diagram representation of a generate mapping code book
step used by a training step shown in FIG. 2;
FIG. 11 is a pictorial representation useful in understanding the generate
mapping code book step shown in FIG. 10;
FIG. 12 is a block diagram representation of an initialize step used in the
time duration adjustment step shown in FIG. 16.
DETAILED DESCRIPTION OF THE INVENTION
Referring now to FIG. 1, a voice transformation system 10 in accordance
with the invention includes a battery powered, portable transformation
processor 12 electrically coupled to a microphone 14 and a speaker 16. The
microphone 14 is mounted on a mask 18 that is worn by a person 20. The
mask 18 muffles or contains the voice of the person 20 to at least limit,
and preferably block, the extent to which the voice of the person 20 can
be heard beyond a costume 22 which supports the speaker 16.
With the voice contained within costume 22, the person 20 can be an actor
portraying a character such as Mickey Mouse.RTM. or Pluto.RTM. that is
depicted by the costume 22. The person 20 can speak into microphone 14,
have his or her voice transformed by transformation processor 12 into that
of the depicted character. The actor can thus provide the words and
emotional qualities of speech, while the speaker 16 broadcasts the speech
with the predetermined vocal characteristics corresponding to the voice of
a character being portrayed.
The voice transformation system 10 can be used for other applications as
well. For example, it might be used in a fixed installation where a person
selects a desired character, speaks a training sequence that creates a
correspondence between the voice of the person and the voice of the
desired character, and then speaks randomly into a microphone to have his
or her voice transformed and broadcast from a speaker as that of the
character. Alternatively, the person can be an actor substituting for an
unavailable actor to create a voice imitation that would not otherwise be
possible. The voice transformation system 10 can thus be used to recreate
a defective scene in a movie or television production at a time when an
original actor is unavailable. The system 10 could also be used to create
a completely new character voice that could subsequently be imitated by
other people using the system 10.
Referring now to FIG. 2, a voice transformation system 10 for transforming
a source voice into a selected target voice includes microphone 14 picking
up the acoustical sounds of a source voice and transducing them into a
time domain analog signal x(t), a voice transformation processor 12 and a
speaker 16 that 10 receives a transformed target time domain analog voice
signal X.sub.T (t) and transduces the signal into acoustical waves that
can be heard by people. Alternatively, the transformed speech signal can
be communicated to some kind of recording device 24 such as a motion
picture film recording device or a television recording device.
The transformation processor 12 includes a preprocessing unit or subsystem
30, an analysis unit or subsystem 32, a transformation unit or subsystem
34, and a post processing unit or subsystem 36.
The voice transformation system 10 may be implemented on any data
processing system 12 having sufficient processing capacity to meet the
real time computational demands of the transformation system 10. The
system 12 initially operates in a training mode, which need not be in real
time. In the training mode the system receives audio signals representing
an identical sequence of words from both source and target speakers. The
two speech signals are stored and compared to establish a correlation
between sounds spoken by the source speaker and the same sounds spoken by
the target speaker.
Thereafter the system may be operated in a real time transformation mode to
receive voice signals representing the voice signals of the source speaker
and use the previously established correlations to substitute voice
signals of the target speaker for corresponding signals of the source
speaker. The tonal qualities of the target speaker may thus be substituted
for those of the source speaker in any arbitrary sequences of source
speech while retaining the emphases and word content provided by the
source speaker.
The preprocessing unit 30 includes a digital sampling step 40 and a
segmenting and windowing step 42. The digital sampling step 40 digitally
samples the analog voice signal x(t) at a rate of 10 kHz to generate a
corresponding sampled data signal x(n). Segmenting and windowing step 42
segments the sample data sequences into overlapping blocks of 256 samples
each with a shift distance of 1/4 segment or 64 samples. Each sample thus
appears redundantly in 4 successive segments. After segmentation, each
segment is subjected to a windowing function such as a Hamming window
function to reduce aliasing of the segment during a subsequent Fourier
transformation to the frequency domain. The segmented and windowed signal
is identified as X.sub.w (mS,n) wherein m is the segment size of 256, S is
the shift size of 64 and n is an index into the sampled data value of each
segment (0-255). The value mS thus indexes the starting point of each
segment within the original sample data signal X(n).
The analysis unit 32 receives the segmented signal X.sub.w (mS,n) and
generates from this signal an excitation signal E(k) representing the
excitation of each segment and a 24 term cepstrum vector K(mS,k)
representing a smoothed spectrum for each segment.
The analysis unit 32 includes a short time Fourier transform step 44 (STFT)
that converts the segmented signal X.sub.w (mS,n) to a corresponding
frequency domain signal X.sub.w (mS,k). An LPC cepstrum parametrization
step 46 produces for each segment a 24 term vector K(mS,k) representing a
smoothed spectrum of the voice signal represented by the segment.
A deconvolver 52 deconvolves the smoothed spectrum represented by the
cepstrum vectors K(mS,k) with the original spectrum X.sub.w (mS,k) to
produce an excitation spectrum E(k) that represents the emotional energy
of each segment of speech.
The transformation unit 34 is operable during a training mode to receive
and store the sequence of cepstrum vectors K(mS,k) for both a target
speaker and a source speaker as they utter identical scripts containing
word sequences designed to elicit all of the sounds used in normal speech.
The vectors representing this training speech are assembled into target
and source code books, each unique to a particular speaker. These code
books, along with a mapping code book establishing a correlation between
target and source speech vectors, are stored for later use in speech
transformation. The average pitch of the target and source voices is also
determined during the training mode for later use during a transformation
mode.
The transformation unit 34 includes a training step 54 that receives the
cepstrum vectors K(mS,k) to generate and store the target, source and
mapping code books during a training mode of operation. Training step 54
also determines the pitch signals Ps for each segment so as to determine
and store indications of overall average pitch for both the target and the
source.
Thereafter, during real time transformation mode of operation, the cepstrum
vectors are received by a substitute step 56 that accesses the stored
target, source and mapping code books and substitutes a target vector for
each received source vector. A target vector is selected that best
corresponds to the same speech content as the source vector.
A pitch adjustment step 58 responds to the ratio of the pitch indication
P.sub.TS for the source speech to the pitch indication P.sub.TT for the
target speech determined by the training step 54 to adjust the excitation
spectrum E(k) for the change in pitch from source to target speech. The
adjusted signal is designated E.sub.PA (k). A convolver 60 then combines
the target spectrum as represented by the substituted cepstrum vectors
K.sub.T (mS,k) with the pitch adjusted excitation signal E.sub.PA (k) to
produce a frequency domain, segmented transformed speech signal X.sub.WT
(mS,k) representing the utterances and excitation of the source speaker
with the glottal or acoustical characteristics of the target speaker.
The post processing unit responds to the transformed speech signal X.sub.WT
(mS,k) with an inverse discrete Fourier transform step 62, an inverse
segmenting and windowing step 64 that recombines the overlapping segments
into a single sequence of sampled data and a time duration adjustment step
66 that uses an LSEE/MSTM algorithm to generate a time domain,
nonsegmented sampled data signal X.sub.T (n) representing the transformed
speech. A digital-to-analog converter and amplifier converts the sampled
signal X.sub.T (n) to a continuous analog electrical signal X.sub.T (t).
Referring now to FIG. 3, the digital sampling step 40 includes a low pass
filter 80 and an analog-to-digital converter 82. The time varying source
voice signal, x(t), from speech source 14 is filtered by a low pass filter
80 with a cutoff frequency of 4.5 kHz. Then the signal is converted from
an analog to a digital signal by using an analog to digital converter 82
(A/D converter) which derives the sequence x(n) by valuing x(t) at
t=nT=(n/f) where f is the sampling frequency of 10 kHz, T is the sampling
period, and n increments from 0 to some count, X-1, at the end of a given
source voice utterance interval.
As shown in FIG. 4, the sampled source voice signal, x(n), goes through a
segmenting and windowing step 42 which breaks the signal into overlapping
segments. Then the segments are windowed by a suitable windowing function
such as a Hamming function illustrated in FIG. 5.
The combination of creating overlapping sequences of the speech signal and
then windowing of these overlapping sequences at window function step 42
is used to isolate short segments of the speech signal by emphasizing a
finite segment of the speech waveform in the vicinity of the sample and
de-emphasizing the remainder of the waveform. Thus, the waveform in the
time interval to be analyzed can be processed as if it were a short
segment from a sustained sound with fixed properties. Also, the windowing
function reduces the end point discontinuities when the windowed data is
subjected to the discrete Fourier transformation (DFT) at step 44.
As illustrated in FIG. 4, the segmentation step 42 segments the discrete
time signal into a plurality of overlapping segments or sections of the
samples waveform 48 which segments are sequentially numbered from m=0 to
m=(M-1). Any specific sample can be identified as,
X(mS,n)=X(n).vertline..sub.n=(mS,n'), 0.ltoreq.n.ltoreq.L-1(1)
In equation (1), S represents the numbers of samples in the time dimension
by which each successive window is shifted, otherwise known as the window
shift size, L is the window size, and mS defines the beginning sample of a
segment. The variable n is the ordinate position of a data sample within
the sampled source data and n' is the ordinate position of a data sample
within a segment. Because each sample, x(n), is redundantly represented in
four different quadrants of four overlapping segments, the original source
data, x(n), can be reconstructed with minimal distortion. In the preferred
embodiment the segment size is L=256 and the window shift size is S=64 or
1/4 of the segment size.
Now referring to FIG. 5, each segment is subjected to a conventional
windowing function, w(n), which is preferably a Hamming window function.
The window function is also indexed from mS (the start of each segment) so
as to multiply the speech samples in each segment directly with the
selected window function to produce windowed samples, X.sub.w (mS, n), in
the time domain as follows:
X.sub.W (mS, n)=X(mS, n)W(mS, n) (2)
The Hamming window has the function,
##EQU1##
The Hamming window reduces ripples at the expense of adding some
distortion and produces a further smoothing of the spectrum. The Hamming
window has tapered edges which allows periodic shifting of the analysis
frame along an input signal without a large effect on the speech
parameters created by pitch period boundary discontinuities or other
sudden changes in the speech signal. Some alternative windowing functions
are the Harming, Blackman, Bartlett, and Kaiser windows which each have
known respective advantages and disadvantages.
The allowable window duration is limited by the desired time resolution
which usually corresponds to the rate at which spectral changes occur in
speech. Short windows are used when high time resolution is important and
when the smoothing of spectral harmonics into wider frequency formats is
desirable. Long windows are used when individual harmonics must be
resolved. The window size, L, in the preferred embodiment is a 256 point
speech segment having 10,000 samples per second. An L-point Hamming window
requires a minimum time overlap of 4 to 1; thus, the sampling period (or
window shift size), S, must be less than or equal to L/4 or
S.ltoreq.256/4.ltoreq.64 samples. To be sure that S is small enough to
avoid time aliasing for the preferred embodiment a shift length of 64
samples has been chosen.
Each windowed frame is subjected to a DFT 44 in the form of a 512 Point
fast Fourier transform (FFT) to create a frequency domain speech signal,
X.sub.w (mS,k),
##EQU2##
where K is frequency and the frame length, N, is preferably selected to be
512.
The exponential function in this equation is the short time Fourier
transform (STFT) function which transforms the frame from the time domain
to the frequency domain. The DFT is used instead of the standard Fourier
transform so that the frequency variable, k, will only take on N discrete
values where N corresponds to the frame length of the DFT. Since the DFT
is invertible, no information about the signal x(n) during the window is
lost in the representation, X.sub.w (mS,k), as long as the transform is
sampled in frequency sufficiently often at N equally spaced values of k
and the transform X.sub.w (mS,k) has no zero valued terms among its N
terms. Low values for N result in short frequency domain functions or
windows and DFTS using few points give poor frequency resolution since the
window low pass filter is wide. Also, low values of segment length, L,
yield good time resolution since the speech properties are averaged only
over short time intervals. Large values of N, however, give poor time
resolution and good frequency resolution. N must be large enough to
minimize the interference of aliased copies of a segment on the copy of
interest near n=0. As the DFT of x(n) provides information about how x(n)
is composed of complex exponentials at different frequencies, the
transform, X.sub.w (mS,k), is referred to as the spectrum of x(n). This
time dependent DFT can be interpreted as a smoothed version Fourier
transform of each windowed finite length speech segment.
The N values of the DFT, X.sub.W (mS,k), can be computed very efficiently
by a set of computational algorithms known collectively as the fast
Fourier transform (FFT) in a time roughly proportional to N log.sub.2 N
instead of the 4N.sup.2 real multiplications and N(4N-2) real additions
required by the DFT. These algorithms exploit both the symmetry and
periodicity of the sequence e.sup.-j(2.pi.k/N)n. They also decompose the
DFT computation into successively smaller DFTs. (See A. Oppenheim and R.
Schafer, Digital Signal Processing, Prentice-Hall, 1975 (see especially
pages 284-327) and L. Rabiner and R. Schafer, Digital Processing of Speech
Signals, Prentice-Hall, 1978 (see especially pages 303-306) which are
hereb | | |