WikiPatents - Community Patent Review
Create Free Account  |  License or Sell Your Patent  |  WikiPatents Marketplace  |  WikiPatents Blog
Username:  Password:  
    
Advanced Search
Speech transformation system    

Custom CD of patents similar to US5327521 : Speech transformation system - $19.95
United States Patent5327521   
Link to this pagehttp://www.wikipatents.com/5327521.html
Inventor(s)Savic; Michael I. (Ballston Lake, NY); Tan; Seow-Hwee (Glendale, CA); Nam; Il-Hyun (Seoul, KR)
AbstractA high quality voice transformation system and method operates during a training mode to store voice signal characteristics representing target and source voices. Thereafter, during a real time transformation mode, a signal representing source speech is segmented into overlapping segments, analyzed to separate the excitation spectrum from the tone quality spectrum. A stored target tone quality spectrum is substituted for the source spectrum and then convolved with the actual source speech excitation spectrum to produce a transformed speech signal having the word and excitation content of the source, but the acoustical characteristics of a target speaker. The system may be used to enable a talking, costumed character, or in other applications where a source speaker wishes to imitate the voice characteristics of a different, target speaker.
   














 Title Information Submit all comments and votes
 
Patent Text Patent PDF Print Page Summary File History
Plain text PDF images Print Summary File History
Drawing from US Patent 5327521
Speech transformation system - US Patent 5327521 Drawing
Speech transformation system
Inventor     Savic; Michael I. (Ballston Lake, NY); Tan; Seow-Hwee (Glendale, CA); Nam; Il-Hyun (Seoul, KR)
Owner/Assignee     The Walt Disney Company (Burbank, CA)
Patent assignment
All assignments
Company News
Publication Date     July 5, 1994
Application Number     08/114,603
PAIR File History     Application Data   Transaction History
Image File Wrapper   Patent Term   Fees
Litigation
Filing Date     August 31, 1993
US Classification     704/272 704/200 704/203
Int'l Classification     G10L 003/00
Examiner     Knepper; David D.
Assistant Examiner    
Attorney/Law Firm     Pretty, Schroeder, Brueggemann & Clark
Address
Parent Case     This application is a continuation of a prior pending application, application Ser. No. 07/845,375, filed on Mar. 2, 1992, now abandoned.
Priority Data    
USPTO Field of Search     381/61 381/62 381/36 381/37 381/38 381/39 381/40 381/43 381/45 381/49 381/50 381/53 381/54 395/2.67 395/2 395/2.7 395/2.79 395/2.81 395/2.87 395/2.12
Patent Tags     speech transformation
   
Enter a comma (,) or semicolon (;) between multiple tag words/phrases.
Describe this patent:
 Amusing   
 Clever   
 Complex   
 Efficient   
 Historic   
 Important   
 Innovative   
 Interesting   
 Practical   
 Simple   
[no votes]
Patent WIKI

Share information and news about this patent, including information and news about the technology, inventors, company, ligation and licensing.

 References Submit all comments and votes
 
*references marked with an asterisk below are user-added references
 U.S. References
 
Add a new US reference:  
ReferenceRelevancyCommentsReferenceRelevancyComments
5113449
Blanton
704/261
May,1992

[0 after 0 votes]
5029211
Ozawa
704/266
Jul,1991

[0 after 0 votes]
4937873
McAulay
704/265
Jun,1990

[0 after 0 votes]
4885790
McAulay
704/265
Dec,1989

[0 after 0 votes]
4864626
Yang
381/61
Sep,1989

[0 after 0 votes]
4856068
Quatieri, Jr.
704/227
Aug,1989

[0 after 0 votes]
4827516
Tsukahara
704/224
May,1989

[0 after 0 votes]
4815135
Taguchi
704/217
Mar,1989

[0 after 0 votes]
4683588
Goldberg
381/61
Jul,1987

[0 after 0 votes]
4667340
Arjmand
704/207
May,1987

[0 after 0 votes]
4400591
Jennings
381/61
Aug,1983

[0 after 0 votes]
4058676
Wilkes
704/220
Nov,1977

[0 after 0 votes]
 Foreign References
 Other References
 Market Review Submit all comments and votes
   
Market Size
Estimate the gross annual revenues of the relevant market sector:
> $10B
$5B - $10B
$2B - $5B
$500M - $2B
$100M - $500M
$10M - $100M
$1M - $10M
$500K - $1M
$100K - $500K
< $100K
[No votes]
$0
 
$0   $2.5B   $5B   $7.5B   $10B

[0 market size comments]
Market Share
Estimate the percentage of the relevant market sector this invention will capture:
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%

[0 market share comments]
Reasonable Royalty
What percentage of gross sales should the inventor or assignee be paid?
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%

[0 reasonable royalty comments]
Public's "Guesstimation" of Royalty Value
Market SizeN/A[No votes]
xMarket ShareN/A[No votes]
xReasonable RoyaltyN/A[No votes]

N/A

[0 Guesstimation of Royalty Value Comments]
License Availablity
If you are NOT the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
[0 license availability comments]
License Availablity
If you ARE the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
[0 owner/assignee comments]
Competitive Advantage
Does this invention have a significant competitive advantage over similar technologies?
Yes

No



[No votes]
Most helpful competitive advantage comment
[No comments]

[0 competitive advantage comments]
Commercial Alternatives
Are there viable commercial alternatives for this invention?
Yes

No



[No votes]
Most helpful commercial alternative comment
[No comments]

[0 commercial alternatives comments]
 Technical Review Submit all comments and votes
 Claims Submit all comments and votes
 


What is claimed is:

1. For use with a costume depicting a character having a defined voice with a pre-established voice characteristic, a voice transformation system comprising:

a microphone that is positionable to receive and transduce speech that is spoken by a person wearing the costume into a source speech signal;

a mask that is positionable to cover the mouth of the person wearing the costume to muffle the speech of the person wearing the costume to tend to prevent communication of the speech beyond the costume, the mask enabling placement of the microphone between the mouth and the mask;

a speaker disposed on or within the costume to broadcast acoustic waves carrying speech in the defined voice of the character depicted by the costume; and

a voice transformation device coupled to receive the signal from the microphone representing source speech spoken by a person wearing the costume, the voice transformation device transforming the received source speech signal to a target speech signal representing the utterances of the source speech signals in the defined voice of the character depicted by the costume;

wherein the voice transformation device stores a plurality of representations of the defined voice and transforms the voice of the person wearing the costume into the same defined voice of the character depicted by the costume, based upon association of the voice of the particular person with particular ones of the stored representations.

2. A voice transformation system according to claim 1, wherein the voice transformation device includes:

a processing subsystem segmenting and windowing the received source speech signal to generate a sequence of preprocessed speech signal segments;

an analysis subsystem processing the received preprocessed speech signal segments to generate for each segment a pitch signal indicating a dominant pitch of the segment, a frequency domain vector representing a smoothed frequency characteristic of the segment and an excitation signal representing excitation characteristics of the segment;

a transformation subsystem storing target frequency domain vectors that are representative of the target speech, substituting a corresponding target frequency domain vector for the frequency domain vector derived by the analysis subsystem, adjusting the pitch of the target excitation spectrum in response to the pitch signal derived by the analysis subsystem, and convolving the substituted target frequency domain vector with the adjusted excitation spectrum to produce a segmented frequency domain representation of the target voice; and

a post processing subsystem performing an inverse Fourier transform and an inverse segmenting and windowing operation on each segmented frequency domain representation of the target voice to generate a time domain signal representing the source speech in the voice of the character depicted by the costume.

3. A voice transformation system comprising:

a preprocessing subsystem receiving a source voice signal and digitizing and segmenting the source voice signal to generate a segmented time domain signal;

an analysis subsystem responding to each segment of the segmented time domain signal by generating a source speech pitch signal representative of a pitch thereof, an excitation signal representative of the excitation thereof and a source vector that is representative of a smoothed spectrum of the segment;

a transformation subsystem storing a plurality of source and target vectors and voice pitch indications for the source voice and a target voice different from the source voice, a correspondence between the source and target vectors and the source and target voice pitch indications, the transformation subsystem using the stored information to substitute a target vector for each received source vector, adjusting the pitch of the frequency domain excitation spectrum in response to the source and target pitch indications to generate a pitch adjusted excitation spectrum, and convolving the pitch adjusted excitation spectrum with a signal represented by the substituted target vector to generate a sequence of segmented target voice segments defining a segmented target voice signal; and

a post processing subsystem converting the segmented target voice signal into a segmented time domain target voice signal that represents the words of the source signal with vocal characteristics of the different target voice.

4. A voice transformation system according to claim 3, wherein the preprocessing subsystem includes a digitizing sampling circuit that samples the source voice signal to produce digital samples that are representative thereof and a segmenting and windowing circuit that devices the digital samples into overlapping segments having a shift distance of at most 1/4 of a segment and applies a windowing function to each segment that reduces aliasing during a subsequent transformation to the frequency domain to produce a sequence of windowed source segments.

5. A voice transformation system according to claim 4, wherein each of the segments represent 256 voice samples.

6. A voice transformation system according to claim 3, wherein the analysis subsystem includes:

a discrete Fourier transform unit generating a frequency domain representation of each segment;

an LPC cepstrum parametrization unit generating source cepstrum coefficient voice vectors representing a smoothed spectrum of each frequency domain segment;

an inverse convolution unit deconvolving each frequency domain segment with the smoothed cepstrum coefficient representation thereof to produce the excitation signal in the form of a frequency domain excitation spectrum;

a pitch adjustment unit responding to the source speech pitch signal and adjusting the pitch of the excitation spectrum to generate a pitch adjusted excitation spectrum;

a substitution unit substituting target cepstrum coefficient voice vectors for the source cepstrum coefficient voice vectors for each corresponding segment; and

a convolver convolving the pitch adjusted excitation spectrum with the substituted target cepstrum coefficient voice vectors.

7. A voice transformation system according to claim 3, wherein the transformation subsystem includes:

a store storing the target voice pitch information, a plurality of the target vectors, a plurality of the source vectors and the correspondence between the source and target vectors;

a pitch adjustment unit adjusting the pitch of the frequency domain excitation spectrum to generate a pitch adjusted excitation spectrum;

a substitution unit receiving source vectors and responsive to the stored voice and target vectors and substituting one of the stored target vectors for each received source vector; and

a convolver convolving each substituted target vector with the corresponding pitch adjusted excitation spectrum to generate a segmented frequency domain target voice signal.

8. A voice transformation system according to claim 3, wherein the post processing subsystem includes:

an inverse Fourier transform unit transforming the segmented target voice signal to the segmented time domain target voice signal;

an inverse segmenting and windowing unit converting the segmented time domain target voice signal to a sampled nonsegmented target voice signal; and

a time duration adjustment unit adjusting the time duration of representations of the sampled nonsegmented target voice signal.

9. A voice transformation system according to claim 8, further comprising a digital-to-analog converter converting the time duration adjusted sampled nonsegmented target voice signal to a continuous time varying signal representing spoken utterances of the source voice with acoustical characteristics of the target voice.

10. A method of transforming a source signal representing a source voice to a target signal representing a target voice comprising the steps of:

preprocessing the source signal to produce a time domain sampled and segmented source signal in response thereto;

analyzing the sampled and segmented source signal, the analysis including executing a transformation of the source signal to the frequency domain, generating a cepstrum vector representation of a smoothed spectrum of each segment of the source signal, generating an excitation signal representing the excitation of each segment of the source signal, determining a pitch for each segment of the source signal, and adjusting the excitation signal for each segment of the source signal in response to the pitch for each segment of the source signal;

transforming each segment by storing cepstrum vectors representing target speech and corresponding cepstrum vectors representing source speech, substituting a stored target speech cepstrum vector for an analyzed source cepstrum vector and convolving the substituted target cepstrum vector with the excitation signal to generate a target segmented frequency domain signal; and

post processing the target segmented frequency domain signal to provide transformation to the time domain and inverse segmentation to generate the target voice signal.

11. For use with a costume depicting a predefined character having a voice with a pre-established voice characteristic, a voice transformation system comprising:

a microphone that is positionable to receive and transduce speech that is spoken by a person wearing the costume into a source speech signal;

a mask that is positionable to cover the mouth of the person wearing the costume to muffle the speech of the person wearing the costume to tent to prevent communication of the speech beyond the costume, the mask enabling placement of the microphone between the mouth and the mask;

a speaker disposed on or within the costume to broadcast acoustic waves carrying speech in the voice of the character depicted by the costume; and

a voice transformation device coupled to receive the signal from the microphone representing source speech spoken by a person wearing the costume, the voice transformation device transforming the received source speech signal to a target speech signal by replacing vocal characteristics of the speaker, represented by the signal, with predefined and stored substitute vocal characteristics of the voice of the character depicted by the costume, the target speech signal being communication to the speaker to be transduced and acoustically broadcast by the speaker.
 Description Submit all comments and votes
 


COPYRIGHT AUTHORIZATION

A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.

BACKGROUND OF THE INVENTION

In 1928 Mickey Mouse was introduced to the public in the first "talking" animation film entitled, "Steamboat Willy". Walt Disney, who created Mickey Mouse, was also the voice of Mickey Mouse. Consequently, when Walt Disney died in 1966 the world lost a creative genius and Mickey Mouse lost his voice.

It is not unusual to discover during the editing of a dramatic production that one or more scenes are artistically flawed. Minor background problems can sometimes be corrected by altering the scene images. However, if the problem lies with the performance itself or there is a major visual problem, a scene must be done over. Not only is this expensive, but occasionally an actor in the scene will no longer be available to redo the scene. The editor must then either accept the artistically flawed scene or make major changes in the production to circumvent the flawed scene.

A double could typically be used to visually replace a missing actor in a scene that is being redone. However, it is extremely difficult to convincingly imitate the voice of a missing actor.

A need thus exists for a high quality voice transformation system that can convincingly transform the voice of any given source speaker to the voice of a target speaker. In addition to its use for motion picture and television productions, a voice transformation system would have great entertainment value. People of all ages could take great delight in having their voices transformed to those of characters such as Mickey Mouse or Donald Duck or even to the voice of their favorite actress or actor. Alternatively, an actor dressed in the costume of a character and imitating a character could be even more entertaining if he or she could speak the voice of the character.

A great deal of research has been conducted in the field of voice transformation and related fields. Much of the research has been directed to transformation of source voices to a standardized target voice that can be more easily recognized by computerized voice recognition systems.

A more general speech transformation system is suggested by an article by Masanobu Abe, Satoshi Nakamura, Kiyohiro Shikano and Hisao Kuwabara, "Voice Conversion Through Vector Quantization," IEEE International Conference on Acoustics, Speech and Signal Processing, (April 1988), pp. 655-658. While the disclosed method produced a voice transformation, the transformed target voice was less than ideal. It contained a considerable amount of distortion and was recognizable as the target voice less than 2/3 of the time in an experimental evaluation.

SUMMARY OF THE INVENTION

A high quality voice transformation system and method in accordance with the invention provides transformation of the voice of a source speaker to the voice of a selected target speaker. The pitch and tonal qualities of the source voice are transformed while retaining the words and voice emphasis of the source speaker. In effect the vocal chords and glottal characteristics of the target speaker are substituted for those of the source speaker. The words spoken by the source speaker thus assume the voice characteristics of the target speaker while retaining the inflection and emphasis of the source speaker. The transformation system may be implemented along with a costume of a character to enable an actor wearing the costume to speak with the voice of the character.

In a method of voice transformation in accordance with the invention, a learning step is executed wherein selected matching utterances from source and target speakers are divided into corresponding short segments. The segments are transformed from the time domain to the frequency domain and representations of corresponding pairs of smoothed spectral data are stored as source and target code books in a table. During voice transformation the source speech is divided into segments which are transformed to the frequency domain and then separated into a smoothed spectrum and an excitation spectrum. The closest match of the smoothed spectrum for each segment is found in the stored source code book and the corresponding target speech smoothed spectrum from the target code book is substituted therefore in a substitution or transformation step. This substituted target smoothed spectrum is convolved with the original source excitation spectrum for the same segment and the resulting transformed speech spectrum is transformed back to the time domain for amplification and playback through a speaker or for storage on a recording medium.

It has been found advantageous to represent the original speech segments as the cepstrum of the Fourier transform of each segment. The source excitation spectrum is attained by dividing or deconvolving the transformed source speech spectrum by a smoothed representation thereof.

A real time voice transformation system includes a plurality of similar signal processing circuits arranged in sequential pipelined order to transform source voice signals into target voice signals. Voice transformation thus appears to be instantaneous as heard by a normal listener.

BRIEF DESCRIPTION OF THE DRAWINGS

A better understanding of the invention may be had from a consideration of the following Detailed Description, taken in conjunction with the accompanying drawings in which:

FIG. 1 is a pictorial representation of an actor wearing a costume that has been fitted with a voice transformation system in accordance with the invention;

FIG. 2 is a block diagram representation of a method of transforming a source voice to a different target voice in accordance with the invention;

FIG. 3 is a block diagram representation of a digital sampling step used in the processor shown in FIG. 2.

FIG. 4 is a pictorial representation of a segmentation of a sampled data signal;

FIG. 5 is a graphical representation of a windowing function;

FIG. 6 is a block diagram representation of a training step used in a voice transformation processor shown in FIG. 2;

FIG. 7 is a graphical representation of interpolation of the magnitude of the excitation spectrum of a speech segment for linear pitch scaling;

FIG. 8 is a graphical representation of interpolation of the real part of the excitation spectrum of a speech segment for linear pitch scaling;

FIG. 9 is a block diagram representation of a code book generation step used by a training step shown in FIG. 2;

FIG. 10 is a block diagram representation of a generate mapping code book step used by a training step shown in FIG. 2;

FIG. 11 is a pictorial representation useful in understanding the generate mapping code book step shown in FIG. 10;

FIG. 12 is a block diagram representation of an initialize step used in the time duration adjustment step shown in FIG. 16.

DETAILED DESCRIPTION OF THE INVENTION

Referring now to FIG. 1, a voice transformation system 10 in accordance with the invention includes a battery powered, portable transformation processor 12 electrically coupled to a microphone 14 and a speaker 16. The microphone 14 is mounted on a mask 18 that is worn by a person 20. The mask 18 muffles or contains the voice of the person 20 to at least limit, and preferably block, the extent to which the voice of the person 20 can be heard beyond a costume 22 which supports the speaker 16.

With the voice contained within costume 22, the person 20 can be an actor portraying a character such as Mickey Mouse.RTM. or Pluto.RTM. that is depicted by the costume 22. The person 20 can speak into microphone 14, have his or her voice transformed by transformation processor 12 into that of the depicted character. The actor can thus provide the words and emotional qualities of speech, while the speaker 16 broadcasts the speech with the predetermined vocal characteristics corresponding to the voice of a character being portrayed.

The voice transformation system 10 can be used for other applications as well. For example, it might be used in a fixed installation where a person selects a desired character, speaks a training sequence that creates a correspondence between the voice of the person and the voice of the desired character, and then speaks randomly into a microphone to have his or her voice transformed and broadcast from a speaker as that of the character. Alternatively, the person can be an actor substituting for an unavailable actor to create a voice imitation that would not otherwise be possible. The voice transformation system 10 can thus be used to recreate a defective scene in a movie or television production at a time when an original actor is unavailable. The system 10 could also be used to create a completely new character voice that could subsequently be imitated by other people using the system 10.

Referring now to FIG. 2, a voice transformation system 10 for transforming a source voice into a selected target voice includes microphone 14 picking up the acoustical sounds of a source voice and transducing them into a time domain analog signal x(t), a voice transformation processor 12 and a speaker 16 that 10 receives a transformed target time domain analog voice signal X.sub.T (t) and transduces the signal into acoustical waves that can be heard by people. Alternatively, the transformed speech signal can be communicated to some kind of recording device 24 such as a motion picture film recording device or a television recording device.

The transformation processor 12 includes a preprocessing unit or subsystem 30, an analysis unit or subsystem 32, a transformation unit or subsystem 34, and a post processing unit or subsystem 36.

The voice transformation system 10 may be implemented on any data processing system 12 having sufficient processing capacity to meet the real time computational demands of the transformation system 10. The system 12 initially operates in a training mode, which need not be in real time. In the training mode the system receives audio signals representing an identical sequence of words from both source and target speakers. The two speech signals are stored and compared to establish a correlation between sounds spoken by the source speaker and the same sounds spoken by the target speaker.

Thereafter the system may be operated in a real time transformation mode to receive voice signals representing the voice signals of the source speaker and use the previously established correlations to substitute voice signals of the target speaker for corresponding signals of the source speaker. The tonal qualities of the target speaker may thus be substituted for those of the source speaker in any arbitrary sequences of source speech while retaining the emphases and word content provided by the source speaker.

The preprocessing unit 30 includes a digital sampling step 40 and a segmenting and windowing step 42. The digital sampling step 40 digitally samples the analog voice signal x(t) at a rate of 10 kHz to generate a corresponding sampled data signal x(n). Segmenting and windowing step 42 segments the sample data sequences into overlapping blocks of 256 samples each with a shift distance of 1/4 segment or 64 samples. Each sample thus appears redundantly in 4 successive segments. After segmentation, each segment is subjected to a windowing function such as a Hamming window function to reduce aliasing of the segment during a subsequent Fourier transformation to the frequency domain. The segmented and windowed signal is identified as X.sub.w (mS,n) wherein m is the segment size of 256, S is the shift size of 64 and n is an index into the sampled data value of each segment (0-255). The value mS thus indexes the starting point of each segment within the original sample data signal X(n).

The analysis unit 32 receives the segmented signal X.sub.w (mS,n) and generates from this signal an excitation signal E(k) representing the excitation of each segment and a 24 term cepstrum vector K(mS,k) representing a smoothed spectrum for each segment.

The analysis unit 32 includes a short time Fourier transform step 44 (STFT) that converts the segmented signal X.sub.w (mS,n) to a corresponding frequency domain signal X.sub.w (mS,k). An LPC cepstrum parametrization step 46 produces for each segment a 24 term vector K(mS,k) representing a smoothed spectrum of the voice signal represented by the segment.

A deconvolver 52 deconvolves the smoothed spectrum represented by the cepstrum vectors K(mS,k) with the original spectrum X.sub.w (mS,k) to produce an excitation spectrum E(k) that represents the emotional energy of each segment of speech.

The transformation unit 34 is operable during a training mode to receive and store the sequence of cepstrum vectors K(mS,k) for both a target speaker and a source speaker as they utter identical scripts containing word sequences designed to elicit all of the sounds used in normal speech. The vectors representing this training speech are assembled into target and source code books, each unique to a particular speaker. These code books, along with a mapping code book establishing a correlation between target and source speech vectors, are stored for later use in speech transformation. The average pitch of the target and source voices is also determined during the training mode for later use during a transformation mode.

The transformation unit 34 includes a training step 54 that receives the cepstrum vectors K(mS,k) to generate and store the target, source and mapping code books during a training mode of operation. Training step 54 also determines the pitch signals Ps for each segment so as to determine and store indications of overall average pitch for both the target and the source.

Thereafter, during real time transformation mode of operation, the cepstrum vectors are received by a substitute step 56 that accesses the stored target, source and mapping code books and substitutes a target vector for each received source vector. A target vector is selected that best corresponds to the same speech content as the source vector.

A pitch adjustment step 58 responds to the ratio of the pitch indication P.sub.TS for the source speech to the pitch indication P.sub.TT for the target speech determined by the training step 54 to adjust the excitation spectrum E(k) for the change in pitch from source to target speech. The adjusted signal is designated E.sub.PA (k). A convolver 60 then combines the target spectrum as represented by the substituted cepstrum vectors K.sub.T (mS,k) with the pitch adjusted excitation signal E.sub.PA (k) to produce a frequency domain, segmented transformed speech signal X.sub.WT (mS,k) representing the utterances and excitation of the source speaker with the glottal or acoustical characteristics of the target speaker.

The post processing unit responds to the transformed speech signal X.sub.WT (mS,k) with an inverse discrete Fourier transform step 62, an inverse segmenting and windowing step 64 that recombines the overlapping segments into a single sequence of sampled data and a time duration adjustment step 66 that uses an LSEE/MSTM algorithm to generate a time domain, nonsegmented sampled data signal X.sub.T (n) representing the transformed speech. A digital-to-analog converter and amplifier converts the sampled signal X.sub.T (n) to a continuous analog electrical signal X.sub.T (t).

Referring now to FIG. 3, the digital sampling step 40 includes a low pass filter 80 and an analog-to-digital converter 82. The time varying source voice signal, x(t), from speech source 14 is filtered by a low pass filter 80 with a cutoff frequency of 4.5 kHz. Then the signal is converted from an analog to a digital signal by using an analog to digital converter 82 (A/D converter) which derives the sequence x(n) by valuing x(t) at t=nT=(n/f) where f is the sampling frequency of 10 kHz, T is the sampling period, and n increments from 0 to some count, X-1, at the end of a given source voice utterance interval.

As shown in FIG. 4, the sampled source voice signal, x(n), goes through a segmenting and windowing step 42 which breaks the signal into overlapping segments. Then the segments are windowed by a suitable windowing function such as a Hamming function illustrated in FIG. 5.

The combination of creating overlapping sequences of the speech signal and then windowing of these overlapping sequences at window function step 42 is used to isolate short segments of the speech signal by emphasizing a finite segment of the speech waveform in the vicinity of the sample and de-emphasizing the remainder of the waveform. Thus, the waveform in the time interval to be analyzed can be processed as if it were a short segment from a sustained sound with fixed properties. Also, the windowing function reduces the end point discontinuities when the windowed data is subjected to the discrete Fourier transformation (DFT) at step 44.

As illustrated in FIG. 4, the segmentation step 42 segments the discrete time signal into a plurality of overlapping segments or sections of the samples waveform 48 which segments are sequentially numbered from m=0 to m=(M-1). Any specific sample can be identified as,

X(mS,n)=X(n).vertline..sub.n=(mS,n'), 0.ltoreq.n.ltoreq.L-1(1)

In equation (1), S represents the numbers of samples in the time dimension by which each successive window is shifted, otherwise known as the window shift size, L is the window size, and mS defines the beginning sample of a segment. The variable n is the ordinate position of a data sample within the sampled source data and n' is the ordinate position of a data sample within a segment. Because each sample, x(n), is redundantly represented in four different quadrants of four overlapping segments, the original source data, x(n), can be reconstructed with minimal distortion. In the preferred embodiment the segment size is L=256 and the window shift size is S=64 or 1/4 of the segment size.

Now referring to FIG. 5, each segment is subjected to a conventional windowing function, w(n), which is preferably a Hamming window function. The window function is also indexed from mS (the start of each segment) so as to multiply the speech samples in each segment directly with the selected window function to produce windowed samples, X.sub.w (mS, n), in the time domain as follows:

X.sub.W (mS, n)=X(mS, n)W(mS, n) (2)

The Hamming window has the function, ##EQU1## The Hamming window reduces ripples at the expense of adding some distortion and produces a further smoothing of the spectrum. The Hamming window has tapered edges which allows periodic shifting of the analysis frame along an input signal without a large effect on the speech parameters created by pitch period boundary discontinuities or other sudden changes in the speech signal. Some alternative windowing functions are the Harming, Blackman, Bartlett, and Kaiser windows which each have known respective advantages and disadvantages.

The allowable window duration is limited by the desired time resolution which usually corresponds to the rate at which spectral changes occur in speech. Short windows are used when high time resolution is important and when the smoothing of spectral harmonics into wider frequency formats is desirable. Long windows are used when individual harmonics must be resolved. The window size, L, in the preferred embodiment is a 256 point speech segment having 10,000 samples per second. An L-point Hamming window requires a minimum time overlap of 4 to 1; thus, the sampling period (or window shift size), S, must be less than or equal to L/4 or S.ltoreq.256/4.ltoreq.64 samples. To be sure that S is small enough to avoid time aliasing for the preferred embodiment a shift length of 64 samples has been chosen.

Each windowed frame is subjected to a DFT 44 in the form of a 512 Point fast Fourier transform (FFT) to create a frequency domain speech signal, X.sub.w (mS,k), ##EQU2## where K is frequency and the frame length, N, is preferably selected to be 512.

The exponential function in this equation is the short time Fourier transform (STFT) function which transforms the frame from the time domain to the frequency domain. The DFT is used instead of the standard Fourier transform so that the frequency variable, k, will only take on N discrete values where N corresponds to the frame length of the DFT. Since the DFT is invertible, no information about the signal x(n) during the window is lost in the representation, X.sub.w (mS,k), as long as the transform is sampled in frequency sufficiently often at N equally spaced values of k and the transform X.sub.w (mS,k) has no zero valued terms among its N terms. Low values for N result in short frequency domain functions or windows and DFTS using few points give poor frequency resolution since the window low pass filter is wide. Also, low values of segment length, L, yield good time resolution since the speech properties are averaged only over short time intervals. Large values of N, however, give poor time resolution and good frequency resolution. N must be large enough to minimize the interference of aliased copies of a segment on the copy of interest near n=0. As the DFT of x(n) provides information about how x(n) is composed of complex exponentials at different frequencies, the transform, X.sub.w (mS,k), is referred to as the spectrum of x(n). This time dependent DFT can be interpreted as a smoothed version Fourier transform of each windowed finite length speech segment.

The N values of the DFT, X.sub.W (mS,k), can be computed very efficiently by a set of computational algorithms known collectively as the fast Fourier transform (FFT) in a time roughly proportional to N log.sub.2 N instead of the 4N.sup.2 real multiplications and N(4N-2) real additions required by the DFT. These algorithms exploit both the symmetry and periodicity of the sequence e.sup.-j(2.pi.k/N)n. They also decompose the DFT computation into successively smaller DFTs. (See A. Oppenheim and R. Schafer, Digital Signal Processing, Prentice-Hall, 1975 (see especially pages 284-327) and L. Rabiner and R. Schafer, Digital Processing of Speech Signals, Prentice-Hall, 1978 (see especially pages 303-306) which are hereb