WikiPatents - Community Patent Review
Create Free Account  |  License or Sell Your Patent  |  WikiPatents Marketplace  |  WikiPatents Blog
Username:  Password:  
    
Advanced Search
Processing of acoustic waveforms    
United States Patent4885790   
Link to this pagehttp://www.wikipatents.com/4885790.html
Inventor(s)McAulay; Robert J. (Lexington, MA); Quatieri, Jr.; Thomas F. (Newton, MA)
AbstractA sinusoidal model for acoustic waveforms is applied to develop a new analysis/synthesis technique which characterizes a waveform by the amplitudes, frequencies, and phases of component sine waves. These parameters are estimated from a short-time Fourier transform. Rapid changes in the highly-resolved spectral components are tracked using the concept of "birth" and "death" of the underlying sine waves. The component values are interpolated from one frame to the next to yield a respresentation that is applied to a sine wave generator. The resulting synthetic waveform preserves the general waveform shape and is perceptually indistinguishable from the original. Furthermore, in the presence of noise the perceptual characteristics of the waveform as well as the noise are maintained. The method and devices are particularly useful in speech coding, time-scale modification, frequency scale modification and pitch modification.
   














 Title Information Submit all comments and votes
 
Patent Text Patent PDF Print Page Summary File History
Plain text PDF images Print Summary File History
Drawing from US Patent 4885790
Processing of acoustic waveforms - US Patent 4885790 Drawing
Processing of acoustic waveforms
Inventor     McAulay; Robert J. (Lexington, MA); Quatieri, Jr.; Thomas F. (Newton, MA)
Owner/Assignee     Massachusetts Institute of Technology (Cambridge, MA)
Patent assignment
All assignments
Publication Date     December 5, 1989
Application Number     07/339,957
PAIR File History     Application Data   Transaction History
Image File Wrapper   Patent Term   Fees
Litigation
Filing Date     April 18, 1989
US Classification     704/265 704/261
Int'l Classification     G10L 003/00 G10L 005/00
Examiner     Harkcom; Gary V.
Assistant Examiner     Knepper; David D.
Attorney/Law Firm     Engellenner; Thomas J.
Address
Parent Case    
Priority Data    
USPTO Field of Search     381/29 381/30 381/31 381/32 381/33 381/34 381/35 381/36 381/37 381/38 381/39 381/40 381/41 381/29 381/30 381/31 381/32 381/33 381/34 381/35 381/36 381/37 381/38 381/39 381/40 381/41
Patent Tags     processing acoustic waveforms
   
Enter a comma (,) or semicolon (;) between multiple tag words/phrases.
Describe this patent:
 Amusing   
 Clever   
 Complex   
 Efficient   
 Historic   
 Important   
 Innovative   
 Interesting   
 Practical   
 Simple   
[no votes]
Patent WIKI

Share information and news about this patent, including information and news about the technology, inventors, company, ligation and licensing.

 References Submit all comments and votes
 
*references marked with an asterisk below are user-added references
 U.S. References
 
Add a new US reference:  
ReferenceRelevancyCommentsReferenceRelevancyComments
3296374



[0 after 0 votes]
3360610



[0 after 0 votes]
3484556



[0 after 0 votes]
4701955
Taguchi
704/223
Oct,1987

[0 after 0 votes]
4076958
Fulghum
704/268
Feb,1978

[0 after 0 votes]
4058676
Wilkes
704/220
Nov,1977

[0 after 0 votes]
4034160
Van Gerwen
704/201
Jul,1977

[0 after 0 votes]
3982070
Flanagan
704/265
Sep,1976

[0 after 0 votes]
3978287
Fletcher
704/231
Aug,1976

[0 after 0 votes]
 Foreign References
 Other References
 Market Review Submit all comments and votes
   
Market Size
Estimate the gross annual revenues of the relevant market sector:
> $10B
$5B - $10B
$2B - $5B
$500M - $2B
$100M - $500M
$10M - $100M
$1M - $10M
$500K - $1M
$100K - $500K
< $100K
[No votes]
$0
 
$0   $2.5B   $5B   $7.5B   $10B
Market Share
Estimate the percentage of the relevant market sector this invention will capture:
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Reasonable Royalty
What percentage of gross sales should the inventor or assignee be paid?
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Public's "Guesstimation" of Royalty Value
Market SizeN/A[No votes]
xMarket ShareN/A[No votes]
xReasonable RoyaltyN/A[No votes]

N/A

License Availablity
If you are NOT the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
License Availablity
If you ARE the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
Competitive Advantage
Does this invention have a significant competitive advantage over similar technologies?
Yes

No



[No votes]
Most helpful competitive advantage comment
[No comments]

Commercial Alternatives
Are there viable commercial alternatives for this invention?
Yes

No



[No votes]
Most helpful commercial alternative comment
[No comments]

 Technical Review Submit all comments and votes
 Claims Submit all comments and votes
 


We claim:

1. A method of processing an acoustic waveform, the method comprising:

sampling the waveform to obtain a series of discrete samples and constructing therefrom a series of frames, each frame spanning a plurality of samples;

analyzing each frame of samples to extract a set of variable frequency components having individual amplitudes;

matching said variable components from one frame to a next frame such that a component in one frame is matched with a component in a successive frame that has a similar value regarless of shifts in frequency and spectral energy; and

interpolating the matched values of the components from the one frame to the next frame to obtain a parametric representation of the waveform whereby a synthetic waveform can be constructed by generating a set of sine waves corresponding to the interpolated values of the parametric representation.

2. The method of claim 1 wherein the step of sampling further includes determining a pitch period for said waveform and varying the length of the frame in accordance with the pitch period, the length being at least twice the pitch period of the waveform.

3. The method of claim 2 wherein the step of sampling further includes sampling the waveform according to a pitch-adaptive Hamming window.

4. The method of claim 1 wherein the step of analyzing further includes analyzing each frame by Fourier analysis.

5. The method of claim 1 wherein the step of analyzing further includes selecting a harmonic series to approximate the frequency components.

6. The method of claim 5 wherein the step of selecting a harmonic series further includes determining a pitch period for the waveform and varying the number of frequency components in the harmonic series in accordance with the pitch period of the waveform.

7. The method of claim 1 wherein the step of tracking further includes matching a frequency component from the one frame with a component in the next frame having a similar value.

8. The method of claim 7 wherein said matching further provides for the birth of new frequency components and the death of old frequency components.

9. The method of claim 1 wherein the step of interpolating values further includes defining a series of instantaneous frequency values by interpolating matched frequency components from the one frame to the next frame and then integrating the series of instantaneous frequency values to obtain a series of interpolated phase values.

10. The method of claim 1 wherein the step of interpolating further includes deriving phase values from frequency and phase measurements taken at each frame and then interpolating the phase measurements.

11. The method of claim 1 wherein the step of interpolating is achieved by performing an overlap and add function.

12. The method of claim 1 wherein the method further includes coding the frequency components for digital transmission.

13. The method of claim 12 wherein the frequency components are limited to a predetermined number defined by a plurality of harmonic frequency bins.

14. The method of claim 13 wherein the amplitude of only one of said components is coded for gain and the amplitudes of the others are coded relative to the neighboring component at the next lowest frequency.

15. The method of claim 12 wherein the phases are coded by applying pulse code modulation techniques to a predicted phase residual.

16. The method of claim 12 wherein high frequency regeneration is applied.

17. The method of claim 1 wherein the method further comprises constructing a synthetic waveform by generating a series of constituent sine waves corresponding in frequency and amplitude to the extracted components.

18. The method of claim 17 wherein the time-scale of said reconstructed waveform is varied by changing the rate at which said series of constituent sine waves are interpolated.

19. The method of claim 18 wherein the time-scale is continuously variable over a defined range.

20. The method of claim 17 wherein the pitch of the synthetic waveform is varied by adjusting the frequency of each frequency component while maintaining the overall spectral envelope.

21. The method of claim 1 wherein the method further comprises constructing a synthetic waveform by generating a series of constituent sine waves corresponding in frequency, amplitude, and phase to the extracted components.

22. The method of claim 21 wherein the time-scale of said reconstructed waveform is varied by changing the rate at which said series of constitutent sine waves are interpolated.

23. The method of claim 22 wherein the time-scale is continuously variable over a defined range.

24. The device of claim 22 wherein the device further comprises means for constructing a synthetic waveform by generating a series of constituent sine waves corresponding in frequency and amplitude to the extracted components.

25. The device of claim 24 wherein the device further includes means for varying the time-scale of said reconstructed waveform by changing the rate at which said series of constituent sine waves are interpolated.

26. The device of claim 25 wherein the means for varying the time-scale is continuously variable over a defined range.

27. The device of claim 24 wherein the constituent sine waves are further defined by system contributions and excitation contributions and wherein the means for varying the time-scale of said reconstructed waveform further includes means for changing the rate at which parameters defining the system contributions of the sine waves are interpolated.

28. The device of claim 27 wherein the device further includes a scaling means for scaling the frequency components.

29. The device of claim 27 wherein the device further includes a scaling means for scaling the excitation-contributed frequency components.

30. The method of claim 21 wherein the constituent sine waves are further defined by system contributions and excitation contributions and wherein the time-scale of said reconstructed waveform is varied by changing the rate at which parameters defining the system contributions of the sine waves are interpolated.

31. The method of claim 30 wherein the pitch of the synthetic waveform is altered by adjusting the frequencies of the excitation-contributed frequency components while maintaining the overall spectral envelope.

32. A device for processing an acoustic waveform, the device comprising:

sampling means for sampling the waveform to obtain a series of discrete samples and constructing therefrom a series of frames, each frame spanning a plurality of samples;

analyzing means for analyzing each frame of samples to extract a set of variable frequency components having individual amplitudes;

matching means for matching said variable components from one frame to a next frame such that a component in one frame is matched with a component in a successive frame that has a similar value regardless of shifts in frequency and spectral energy; and

interpolating means for interpolating the matched values of the components from the one frame to the next frame to obtain a parametric representation of the waveform whereby a synthetic waveform can be constructed by generating a set of sine waves corresponding to the interpolated values of the parametric representation.

33. The device of claim 32 wherein the sampling means further includes means for constructing a frame having variable length, which varies in accordance with the pitch period, the length being at least twice the pitch period of the waveform.

34. The device of claim 32 wherein the sampling means further includes means for sampling according to a Hamming window.

35. The device of claim 32 wherein the analyzing means further includes means for analyzing each frame by Fourier analysis.

36. The device of claim 32 wherein the analyzing means further includes means for selecting a harmonic series to approximate the frequency components.

37. The device of claim 36 wherein the number of frequency components in the harmonic series varies according to the pitch period of the waveform.

38. The device of claim 32 wherein the tracking means further includes means for matching a frequency component from the one frame with a component in the next frame having a similar value.

39. The device of claim 38 wherein said matching means further provides for the birth of new frequency components and the death of old frequency components.

40. The device of claim 38 wherein the frequency components are limited to a predetermined number defined by a plurality of harmonic frequency bins.

41. The device of claim 40 wherein the amplitude of only one of said components is coded for gain and the amplitudes of the others are coded relative to the neighboring component of the next lowest frequency.

42. The device of claim 32 wherein the interpolating means further includes means defining a series of instantaneous frequency values by interpolating matched frequency components from the one frame to the next frame and means for integrating the series of instantaneous frequency values to obtain a series of interpolated phase values.

43. The device of claim 32 wherein the interpolating means further includes means for deriving phase values from the frequency and phase measurements taken at each frame and then interpolating the phase measurements.

44. The device of claim 32 wherein the interpolating means further includes means for performing an overlap and add function.

45. The device of claim 32 wherein the device further includes coding means for coding the frequency components for digital transmission.

46. The device of claim 45 wherein the coding means further comprises means for applying pulse code modulation techniques to a predicted phase residual.

47. The device of claim 45 wherein the coding means further comprises means for generating high frequency components.

48. The device of claim 32 wherein the device further comprises means for constructing a synthetic waveform by generating a series of constitutent sine waves corresponding in frequency, amplitude, and phase to the extracted components.

49. The device of claim 48 wherein the device further includes means for varying the time-scale of said reconstructed waveform by changing the rate at which said series of constituent sine waves are interpolated.

50. The device of claim 49 wherein the means for varying the time-scale is continuously variable over a defined range.

51. A coded speech transmission system comprising:

sampling means for sampling a speech waveform to obtain a series of discrete samples and for constructing therefrom a series of frames, each frame spanning a plurality of samples;

analyzing means for analyzing each frame of samples by Fourier analysis to extract a set of variable frequency components having individual amplitude values;

coding means for coding the component values;

decoding means for decoding the coded values after transmission and for reconstituting the variable components;

matching means for matching the reconstituted, variable components from one frame to a next frame such that a component is one frame is matched with a component in a successive frame that has a similar value regardless of shifts in frequency and spectral energy; and

interpolation means for interpolating the values of the frequency components from the one frame to the next frame to obtain a representation of the waveform whereby synthetic speech can be constructed by generating a set of sine waves corresponding to the interpolated values of the parametric representation.

52. The device of claim 51 wherein the coding means further includes means for selecting a harmonic series of bins to approximate the frequency components and the number of bins varies according to the pitch of the waveform.

53. The device of claim 51 wherein the amplitude of only one of said components is coded for gain and the amplitudes of the other components are coded relative to the neighboring component at the next lowest frequency.

54. The device of claim 51 wherein the amplitudes of the components are coded by linear prediction techniques.

55. The device of claim 51 wherein the amplitudes of the components are coded by adaptive delta modulation techniques.

56. The device of claim 51 wherein the analyzing means further comprises means for measuring phase values for each frequency component.

57. The device of claim 56 wherein the coding means further includes means for coding the phase values by applying pulse code modulations to a predicted phase residual.

58. A device for altering the time-scale of an audible waveform, the device comprising:

sampling means for sampling the waveform to obtain a series of discrete samples and constructing therefrom a series of frames, each frame spanning a plurality of samples;

analyzing means for analyzing each frame of samples to extract a set of variable frequency components having individual amplitudes;

matching means for matching said variable components from one frame to a next frame such that a component in one frame is matched with a component in a successive frame that has a similar value regardless of shifts in frequency and spectral energy;

interpolating means for interpolating the amplitude and frequency values of the components from the one frame to the next frame to obtain a representation of the waveform whereby a synthetic waveform can be constructed by generating a set of sine waves corresponding to the interpolated representation;

interpolation rate adjusting means for altering the rate of interpolation; and

synthesizing means for constructing a time-scaled synthetic waveform by generating a series of constituent sine waves corresponding in frequency and amplitude to the extracted components, the sine waves being generated at said alterable interpolation rate.

59. The device of claim 58 wherein the interpolation rate adjusting means is continuously variable over a defined range.

60. The device of claim 58 wherein the analyzing means further comprises means for measuring phase values for each frequency component.

61. The device of claim 60 wherein the component phase values are interpolated by cubic interpolation.

62. The device of claim 60 wherein the interpolation rate adjusting means is continuously variable over a defined range and further includes means for adjusting the rate of phase value interpolations.

63. The device of claim 60 wherein the device further comprises means for separating the measured frequency components into system contributions and excitation contributions and wherein the interpolation rate adjusting means varies the time-scale of the synthetic waveform by altering the rate at which values defining the system contributions are interpolated.

64. The device of claim 63 wherein the interpolation rate adjusting means alters the rate at which the system amplitudes and phases and the excitation amplitudes and frequencies are interpolated.
 Description Submit all comments and votes
 


TECHNICAL FIELD

The field of this invention is speech technology generally and, in particular, methods and devices for analyzing, digitally-encoding, modifying and synthesizing speech or other acoustic waveforms.

BACKGROUND OF THE INVENTION

Typically, the problem of representing speech signals is approached by using a speech production model in which speech is viewed as the result of passing a glottal excitation waveform through a time-varying linear filter that models the resonant characteristics of the vocal tract. In many speech applications it suffices to assume that the glottal excitation can be in one of two possible states corresponding to voiced or unvoiced speech. In the voiced speech state the excitation is periodic with a period which is allowed to vary slowly over time relative to the analysis frame rate (typically 10-20 msecs). For the unvoiced speech state the glottal excitation is modelled as random noise with a flat spectrum. In both cases the power level in the excitation is also considered to be slowly time-varying.

While this binary model has been used successfully to design narrowband vocoders and speech synthesis systems, its limitations are well known. For example, often the excitation is mixed having both voiced and unvoiced components simultaneously, and often only portions of the spectrum are truly harmonic. Furthermore, the binary model requires that each frame of data be classified as either voiced or unvoiced, a decision which is particularly difficult to make if the speech is also subject to additive acoustic noise.

Speech coders at rates compatible with conventional transmission lines (i.e. 2.4-9.6 kilobits per second) would meet a substantial need. At such rates the binary model is ill-suited for coding applications. Additionally, speech processing devices and methods that allow the user to modify various parameters in reconstructing waveform would find substantial usage. For example, time-scale modification (without pitch alteration) would be a very useful feature for a variety of speech applications (i.e. slowing down speech for translation purposes or speeding it up for scanning purposes) as well as for musical composition or analysis. Unfortunately, time-scale (and other parameter) modifications also are not accomplished with high quality by devices employing the binary model.

Thus, there exists a need for better methods and devices for processing audible waveforms. In particular, speech coders operable at mid-band rates and in noisy environments as well as synthesizers capable of maintaining their perceptual quality of speech while changing the rate of articulation would satisfy long-felt needs and provide substantial contributions to the art.

SUMMARY OF THE INVENTION

It has been discovered that speech analysis and synthesis as well as coding and time-scale modification can be accomplished simply and effectively by employing a time-frequency representation of the speech waveform which is independent of the speech state. Specifically, a sinusoidal model for the speech waveform is used to develop a new analysis-synthesis technique.

The basic method of the invention includes the steps of: (a) selecting frames (i.e. windows of about 20-40 milliseconds) of samples from the waveform; (b) analyzing each frame of samples to extract a set of frequency components; (c) tracking the components from one frame to the next; and (d) interpolating the values of the components from one frame to the next to obtain a parametric representation of the waveform. A synthetic waveform can then be constructed by generating a series of sine waves corresponding to the parametric representation.

In one simple embodiment of the invention, a device is disclosed which uses only the amplitudes and frequencies of the component sine waves to represent the waveform. In this so-called "magnitude-only" system, phase continuity is maintained by defining the phase to be the integral of the instantaneous frequency. In a more comprehensive embodiment, explicit use is made of the measured phases as well as the amplitudes and frequencies of the components.

The invention is particularly useful in speech coding and time-scale modification and has been demonstrated successfully in both of these applications. Robust devices can be built according to the invention to operate in environments of additive acoustic noise. The invention also can be used to analyze single and multiple speaker signals, music or even biological sounds. The invention will also find particular applications, for example, in reading machines for the blind, in broadcast journalism editing and in transmission of music to remote players.

In one illustrated embodiment of the invention, the basic method summarized above is employed to choose amplitudes, frequencies, and phases corresponding to the largest peaks in a periodogram of the measured signal, independently of the speech state. In order to reconstruct the speech waveform, the amplitudes, frequencies, and phases of the sine waves estimated on one frame are matched and allowed to continuously evolve into the corresponding parameter set on the successive frame. Because the number of estimated peaks are not constant and slowly varying, the matching process is not straightforward. Rapidly varying regions of speech such as unvoiced/voiced transitions can result in large changes in both the location and number of peaks. To account for such rapid movements in spectral energy, the concept of "birth" and "death" of sinusoidal components is employed in a nearest-neighbor matching method based on the frequencies estimated on each frame. If a new peak appears, a "birth" is said to occur and a new track is initiated. If an old peak is not matched, a "death" said to occur and the corresponding track is allowed to decay to zero. Once the parameters on successive frames have been matched, phase continuity of each sinusoidal component is ensured by unwrapping the phase. In one preferred embodiment the phase is unwrapped using a cubic phase interpolation function having parameter values that are chosen to satisfy the measured phase and frequency constraints at the frame boundaries while maintaining maximal smoothness over the frame duration. Finally, the corresponding sinusoidal amplitudes are simply interpolated in a linear manner across each frame.

In speech coding applications, pitch estimates are used to establish a set of harmonic frequency bins to which the frequency components are assigned. (Pitch is used herein to mean the fundamental rate at which a speaker's vocal cords are vibrating). The amplitudes of the components can be coded directly using adaptive pulse code modulation (ADPCM) across frequency or indirectly using linear predictive coding. In each harmonic frequency bin the peak having the largest amplitude is selected and assigned to the frequency at the center of the bin. This results in a harmonic series based upon the coded pitch period. The phases can then be coded by using the frequencies to predict phase at the end of the frame, unwrapping the measured phase with respect to this prediction and then coding the phase residual using 4 bits per phase peak. If there are not enough bits available to code all of the phase peaks (e.g. for low-pitch speakers), phase tracks for the high frequency peaks can be artificially generated. In one preferred embodiment, this is done by translating the frequency tracks of the base band peaks to the high frequency of the uncoded phase peaks. This new coding scheme has the important property of adaptively allocating the bits for each speaker and hence is self-tuning to both low- and high-pitched speakers. Although pitch is used to provide side information for the coding algorithm, the standard voice-excitation model for speech is not used. This means that recourse is never made to a voiced-unvoiced decision. As a consequence the invention is robust in noise and can be applied at various data transmission rates simply by changing the rules for the bit allocation.

The invention is also well-suited for time-scale modification, which is accomplished by time-scaling the amplitudes and phases such that the frequency variations are preserved. The time-scale at which the speech is played back is controlled simply by changing the rate at which the matched peaks are interpolated. This means that the time-scale can be speeded up or slowed down by any factor and this factor can be time-varying. This rate can be controlled by a panel knob which allows an operator complete flexibility for varying the time-scale. There is no perceptual delay in performing the time-scaling.

The invention will next be described in connection with certain illustrated embodiments. However, it should be clear that various changes and modifications can be made by those skilled in the art without departing from the spirit and scope of the invention. For example other sampling techniques can be substituted for the use of a variable frame length and Hamming window. Moreover the length of such frames and windows can vary in response to the particular application. Likewise, frequency matching can be accomplished by various means. A variety of commercial devices are available to perform Fourier analysis; such analysis can also be performed by custom hardware or specially-designed programs.

Various techniques for extracting pitch information can be employed. For example, the pitch period can be derived from the Fourier transform. Other techniques such as the Gold-Malpass techniques can also be used. See generally, M. L. Malpass, "The Gold Pitch Detector in a Real Time Environment" Proc. of EASCON 1975 (Sept. 1975); B. Gold, "Description of a Computer Program for Pitch Detection", Fourth International Congress on Acoustics, Copenhagen Aug. 21-28, 1962 and B. Gold, "Note on Buzz-Hiss Detection", J. Acoust. Soc. Amer. 365, 1659-1661 (1964), all incorporated herein by reference.

Various coding techniques can also be used interchangeably with those described below. Channel encoding techniques are described in J. N. Holmes, "The JSRU Channel Vocoder", Inst. of Electrical Eng. Proceedings (British), 27, 53-60 (1980). Adaptive pulse code modulation is described in L. R. Rabiner and R. W. Schafer Digital Processing of Signal, (Prentice Hall 1978). Linear predictive coding is described by J. D. Markel, Linear Prediction of Speech, (Springer-Verlog, 1967). These teachings are also incorporated by reference.

It should be appreciated that the term "interpolation" is used broadly in this application to encompass various techniques for filling in data values between those measured at the frame boundaries. In the magnitude-only system linear interpolation is employed to fill in amplitude and frequency values. In this simple system phase values are obtained by first defining a series of instantaneous frequency values by interpolating matched frequency components from one frame to the next and then integrating the series of instantaneous frequency values to obtain a series of interpolated phase values. In the more comprehensive system the phase value of each frame is derived directly and a cubic polynomial equation preferably is employed to obtain maximally smooth phase interpolations from frame to frame.

Other techniques that accomplish the same purpose are also referred to in this application as interpolation techniques. For example, the so-called "overlap and add" method of filling in data values can also be used. In this method a weighted overlapping function can be applied to the resulting sine waves generated during each frame and then the overlapped values can be summed to fill in the values between those measured at the frame boundaries.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a schematic block diagram of one embodiment of the invention in which only the magnitude and frequencies of the components are used to reconstruct a sampled waveform.

FIG. 2 is an illustration of the extracted amplitude and frequency components of a waveform sampled according to the present invention.

FIG. 3 is a general illustration of the frequency matching method of the present invention.

FIGS. 4A-4F are detailed schematic illustrations of a frequency matching method according to the present invention.

FIG. 5 is an illustration of tracked frequency components of an exemplary speech pattern.

FIG. 6 is a schematic block diagram of another embodiment of the invention in which magnitude and phase of frequency components are used to reconstruct a sampled waveform.

FIG. 7 is an illustrative set of cubic phase interpolation functions for smoothing the phase functions useful in connection with the embodiment of FIG. 6 from which the "maximally smooth" phase function is selected.

FIG. 8 is a schematic block diagram of another embodiment of the invention particularly useful for time-scale modification.

FIG. 9 is a schematic block diagram showing an embodiment of the system estimation function of FIG. 8.

FIG. 10 is a block diagram of one real-time implementation of the invention .

DETAILED DESCRIPTION

In the present invention the speech waveform is modelled as a sum of sine waves. If s(n) represents the sampled speech waveform then

s(n) =.SIGMA.a.sub.i (n)sin[.phi..sub.i (n)] (1)

where a.sub.i (n) and .phi..sub.i (n) are time-varying amplitudes and phases of the i'th tone.

In a simple embodiment the phase can be defined to be the integral of the instantaneous frequency f.sub.i (n) and therefore satisfies the recursion

.phi..sub.i (n)=.phi..sub.i (n-1)+2.pi.f.sub.i (n)/f.sub.s (2)

where f.sub.s is the sampling frequency. If the tones are harmonically related, then

f.sub.i (n)=i*f.sub.O (n) (3)

where f.sub.O (n) represents the fundamental frequency at time n. One particularly attractive property of the above model is the fact that phase continuity, hence waveform continuity, is guaranteed as a consequence of the definition of phase in terms of the instantaneous frequency. This means that waveform reconstruction is possible from the "magnitude-only" spectrum since a high-resolution spectral analysis reveals the amplitudes and frequencies of the component sine waves.

A block diagram of an analysis/synthesis system according to the invention is illustrated in FIG. 1. As shown in FIG. 1, system 10 includes sampling window 11, a discrete Fourier transform (DFT) analyzer 12, magnitude computer 13, a frequency amplitude estimator 14, and an optional coder 16 in the transmitter segment and a frequency matching means 18, an interpolator 20 and a sine wave generator 22 in the receiver segment of the system. The peaks of the magnitude of the discrete Fourier transform (DFT) of a windowed waveform are found simply by determining the locations of a change in slope (concave down). In addition, the total number of peaks can be limited and this limit can be adapted to the expected average pitch of the speaker.

In a simple embodiment the speech waveform can be digitized at a 10 kHz sampling rate, low-passed filtered at 5 kHz, and analyzed at 20 msec frame intervals with a 20 msec Hamming window. Speech representations according to the invention can also be obtained by employing an analysis window of variable duration. For some applications it is preferable to have the width of the analysis window be pitch adaptive, being set, for example, at 2.5 times the average pitch period with a minimum width of 20 msec.

Plotted in FIG. 2 is a typical periodogram for a frame of speech along with the amplitudes and frequencies that are estimated using the above procedure. The DFT was computed using a 512-point fast Fourier transform (FFT). Different sets of these parameters will be obtained for each analysis frame. To obtain a representation of the waveform over time, frequency components measured on one frame must be matched with those that are obtained on a successive frame.

FIG. 3 illustrates the basic process of frequency component matching. If the number of peaks were constant and slowly varying from frame to frame, the problem of matching the parameters estimated on one frame with those on a successive frame would simply require a frequency ordered assignment of peaks. In practice, however, there will be spurious peaks that come and go due to the effects of sidelobe interaction; the locations of the peaks will change as the pitch changes; and there will be rapid changes in both the location and the number of peaks corresponding to rapidly-varying regions of speech, such as at voiced/unvoiced transitions. In order to account for such rapid movements in the spectral peaks, the present invention employs the concept of "birth" and "death" of sinusoidal components as part of the matching process.

The matching process is further explained by consideration of FIG. 4. Assume that peaks up to frame k have been matched and a new parameter set for frame k+1 is generated. Let the chosen frequencies on frames k and k+1 be denoted by .omega..sub.o.sup.k, .omega..sub.1.sup.k, . . . .omega..sub.N-1.sup.k and .omega..sub.o.sup.k=1, .omega..sub.1.sup.k=1, . . . .omega..sub.M-1.sup.k=1 respectively, where N and M represent the total number of peaks selected on each frame (N.noteq.M in general). One process of matching each frequency in frame k, .omega..sub.n.sup.k, to some frequency in frame k+1, .omega..sub.m.sup.k+1, is given in the following three steps.

Step 1

Suppose that a match has been found for frequencies .omega..sub.o.sup.k, .omega..sub.1.sup.k . . . .omega..sub.n-1.sup.k. A match is now attempted for frequency .omega..sub.n.sup.k. FIG. 4(a) depicts the case where all frequencies .omega..sub.m.sup.k+1 in frame k+1 lie outside a "matching interval" .DELTA. of .omega..sub.n.sup.k, i.e.,

.vertline..omega..sub.n.sup.k -.omega..sub.m.sup.k+1 .vertline..gtoreq..DELTA. (4)

for all m. In this case the frequency track associated with .omega..sub.n.sup.k is declared "dead" on entering frame k+1, and .omega..sub.n.sup.k is matched to itself in frame k+1, but with zero amplitude. Frequency .omega..sub.n.sup.k is then eliminated from further consideration and Step 1 is repeated for the next frequency in the list, .omega..sub.n+1.sup.k.

If on the other hand there exists a frequency .omega..sub.m.sup.k+1 in frame k+1 that lies within the matching interval about .omega..sub.n.sup.k, and is the closest such frequency, i.e.,

.vertline..omega..sub.n.sup.k -.omega..sub.m.sup.k+1 .vertline.<.vertline..omega..sub.n.sup.k -.omega..sub.i.sup.k+1 .vertline.<.DELTA. (5)

for all i.noteq.m, then .omega..sup.k+1.sub.m is declared to be candidate match to .omega..sup.k.sub.n. A definitive match is not yet made, since there may exist a better match in frame k to the frequency .omega..sup.k+1.sub.m , a contingency which is accounted for in Step 2.

Step 2

In this step, a candidate match from Step 1 is confirmed. Suppose that a frequency .omega..sup.k.sub.n of frame k has been tentatively matched to frequency .omega..sup.k+1.sub.m of frame k+1 . Then, if .omega..sup.k+1.sub.m has no better to the remaining unmatched frequencies of frame k, then the candidate match is declared to be a definitive match. This condition, illustrated in FIG. 4 (c), is given by

.vertline..omega..sub.m.sup.k+1 -.omega..sub.n.sup.k .vertline.<.vertline..omega..sub.m.sup.k+1 -.omega..sub.i+1.sup.k .vertline.for i<n (6)

where the first bracketed value in Equation 6 is illustrated as .sigma..sub.2 in FIG. 4 and the second bracketed value of Equation 6 is illustrated as .sigma..sub.1. When this occurs, frequencies .omega..sub.n.sup.k and .omega..sub.m.sup.k+1 are eliminated from further consideration and Step 1 is repeated for the next frequency in the list, .omega..sup.k.sub.n+1.

If the condition (6) is not satisfied, then the frequency .omega..sup.k+1.sub.m in frame k+1 is better matched to the frequency .omega..sup.k.sub.n+1 in frame k than it is to the test frequency .omega..sub.n.sup.k. Two additional cases are then considered. In the first case, illustrated in FIG. 4(d), the adjacent remaining lower frequency .omega..sup.k+1.sub.m+1 (if one exists) lies below the matching interval, hence no match can be made. As a result, the frequency track associated with .omega..sub.n.sup.k is declared "dead" on entering frame k+1, and .omega..sub.n.sup.k is matched to itself with zero amplitude. In the second case, illustrated in FIG. 4(e), the frequency .omega..sup.k+1.sub.m-1 is within the matching interval about .omega..sup.k.sub.n and a definitive match is made. After either case Step 1 is repeated using the next frequency in the frame k list, .omega..sub.n+1. It should be noted that many other situations are possible in this step, but to keep the tracker alternatives as simple as possible only the two cases are discussed.

Step 3

When all frequencies of frame k have been tested and assigned to continuing tracks or to dying tracks, there may remain frequencies in frame k+1 for which no matches have been made. Suppose that .omega..sup.k+1.sub.m is one such frequency, then it is concluded that .omega..sup.k+1.sub.m was "born" in frame k and its match, a new frequency, .omega..sup.k+1.sub.m, is created in frame k with zero magnitude. This is done for all such unmatched frequencies. This last step is illustrated in FIG. 4(f).

The results of applying the tracker to a segment of real speech is shown in FIG. 5, which demonstrates the ability of the tracker to adapt quickly through transitory speech behavior such as voiced/unvoiced transitions, and mixed voiced/unvoiced regions.

In the simple "magnitude-only" system, synthesis is accomplished in a straightforward manner. Each pair of match frequencies (and their corresponding magnitudes) are linearly interpolated across consecutive frame boundaries. As noted above, in the magnitude-only system, phase continuity is guaranteed by the definition of phase in terms of the instantaneous frequency. The interpolated values are then used to drive a sine wave generator which yields the synthetic waveform as shown in FIG. 1. It should be noted that