WikiPatents - Community Patent Review
Create Free Account  |  License or Sell Your Patent  |  WikiPatents Marketplace  |  WikiPatents Blog
Username:  Password:  
    
Advanced Search
Segment-based apparatus and method for speech recognition by analyzing multiple speech unit frames and modeling both temporal and spatial correlation    
United States Patent5625749   
Link to this pagehttp://www.wikipatents.com/5625749.html
Inventor(s)Goldenthal; William D. (Cambridge, MA); Glass; James R. (Arlington, MA)
AbstractPhonetic recognition is provided by capturing dynamical behavior and statistical dependencies of the acoustic attributes used to represent a subject speech waveform. A segment based framework is employed. Temporal behavior is modelled explicitly by creating dynamic templates, called tracks, of the acoustic attributes used to represent the speech waveform, and by generating the estimation of the acoustic spatio-temporal correlation structure. An error model represents this estimation as the temporal and spatial correlations between the input speech waveform and track generated speech segment. Models incorporating these two components (track and error estimation) are created for both phonetic units and for phonetic transitions. Phonetic contextual influences are accounted for by merging context-dependent tracks and pooling error statistics over the different contexts. This allows for a large number of contextual models without compromising the robustness of the statistical parameter estimates. The transition models also supply contextual information.
   














 Title Information Submit all comments and votes
 
Patent Text Patent PDF Print Page Summary File History
Plain text PDF images Print Summary File History
Drawing from US Patent 5625749
Segment-based apparatus and method for speech recognition by analyzing

     multiple speech unit frames and modeling both temporal and spatial

     correlation - US Patent 5625749 Drawing
Segment-based apparatus and method for speech recognition by analyzing multiple speech unit frames and modeling both temporal and spatial correlation
Inventor     Goldenthal; William D. (Cambridge, MA); Glass; James R. (Arlington, MA)
Owner/Assignee     Massachusetts Institute of Technology (Cambridge, MA)
Patent assignment
All assignments
Publication Date     April 29, 1997
Application Number     08/293,584
PAIR File History     Application Data   Transaction History
Image File Wrapper   Patent Term   Fees
Litigation
Filing Date     August 22, 1994
US Classification     704/254 704/237 704/239 704/240 704/241 704/242 704/253 704/255
Int'l Classification     G10L 005/00 G10L 005/06
Examiner     MacDonald; Allen R.
Assistant Examiner     Smits; Talivadis Ivars
Attorney/Law Firm     Hamilton, Brook, Smith & Reynolds, P.C.
Address
Parent Case    
Priority Data    
USPTO Field of Search     395/2.55 395/2.6 395/2.64 395/2.62 395/2.46 395/2.48 395/2.49 395/2.5 395/2.51
Patent Tags     segment-based speech recognition analyzing multiple speech frames modeling both temporal spatial correlation
   
Enter a comma (,) or semicolon (;) between multiple tag words/phrases.
Describe this patent:
 Amusing   
 Clever   
 Complex   
 Efficient   
 Historic   
 Important   
 Innovative   
 Interesting   
 Practical   
 Simple   
[no votes]
Patent WIKI

Share information and news about this patent, including information and news about the technology, inventors, company, ligation and licensing.

 References Submit all comments and votes
 
*references marked with an asterisk below are user-added references
 U.S. References
 
Add a new US reference:  
ReferenceRelevancyCommentsReferenceRelevancyComments
5333236
Bahl
704/256.4
Jul,1994

[0 after 0 votes]
5199077
Wilcox
704/256
Mar,1993

[0 after 0 votes]
5036539
Wrench, Jr.
704/246
Jul,1991

[0 after 0 votes]
5023911
Gerson
704/253
Jun,1991

[0 after 0 votes]
4994983
Landell
704/245
Feb,1991

[0 after 0 votes]
 Foreign References
 Other References
 Market Review Submit all comments and votes
   
Market Size
Estimate the gross annual revenues of the relevant market sector:
> $10B
$5B - $10B
$2B - $5B
$500M - $2B
$100M - $500M
$10M - $100M
$1M - $10M
$500K - $1M
$100K - $500K
< $100K
[No votes]
$0
 
$0   $2.5B   $5B   $7.5B   $10B
Market Share
Estimate the percentage of the relevant market sector this invention will capture:
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Reasonable Royalty
What percentage of gross sales should the inventor or assignee be paid?
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Public's "Guesstimation" of Royalty Value
Market SizeN/A[No votes]
xMarket ShareN/A[No votes]
xReasonable RoyaltyN/A[No votes]

N/A

License Availablity
If you are NOT the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
License Availablity
If you ARE the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
Competitive Advantage
Does this invention have a significant competitive advantage over similar technologies?
Yes

No



[No votes]
Most helpful competitive advantage comment
[No comments]

Commercial Alternatives
Are there viable commercial alternatives for this invention?
Yes

No



[No votes]
Most helpful commercial alternative comment
[No comments]

 Technical Review Submit all comments and votes
 Claims Submit all comments and votes
 


What is claimed is:

1. In a digital processor, speech recognition apparatus for decoding an input speech signal to a corresponding speech unit, the apparatus comprising:

a source providing an input speech signal formed of multiple observation frames;

a plurality of unit templates, each unit template for representing acoustic attributes of a respective speech unit and each unit template generating a respective synthetic segment indicative of the respective speech unit;

a plurality of error models associated with the unit templates, each unit template having an error model for explicitly measuring and quantitatively representing temporal and spatial correlations between the synthetic segments and a subject speech signal, the temporal and spatial correlations being between acoustic attributes in the observation frames of the subject speech signal; and

processor means coupled to the unit templates and error models and coupled to the source to receive the input speech signal, the processor means comparing the synthetic segments to different plural observation frames of the input speech signal to define a set of error sequences and based on the error models, the processor means analyzing the error sequences and determining the corresponding speech unit of the input speech signal.

2. Apparatus as claimed in claim 1 wherein the unit templates employ a generation function to generate the synthetic segments.

3. Apparatus as claimed in claim 2 wherein the generation function is used to form each unit template.

4. Apparatus as claimed in claim 1 wherein each error model is formed from a probability density function; and

the processor means determines the corresponding speech unit of the input speech signal to be the respective speech unit of the unit template corresponding to the most likely error model.

5. Apparatus as claimed in claim 1 wherein each error model is formed from a distance metric; and

the processor means determines the corresponding speech unit of the input speech signal to be the respective speech unit of the unit template corresponding to the best error model.

6. Apparatus as claimed in claim 1 wherein each error sequence is normalized to a single error feature vector of fixed dimension before the processor means generates the error models.

7. Apparatus as claimed in claim 1 wherein the plurality of unit templates includes transition unit templates for representing acoustic transition dynamics between speech units within a speech signal.

8. Apparatus as claimed in claim 7 wherein the transition unit templates provide an indication of one of location of a transition in the input speech signal and speech units involved in the transition.

9. Apparatus as claimed in claim 1 further comprising a multiplicity of merged templates formed by a combination of a plurality of unit templates.

10. Apparatus as claimed in claim 1 wherein certain ones of the unit templates are templates for representing context-dependent acoustic attributes of a respective speech unit.

11. Apparatus as claimed in claim 1 wherein the respective speech unit for each unit template is a phonetic unit or a string of phonetic units.

12. In a digital processor, a method for decoding an input speech signal to a corresponding speech unit comprising the steps of:

providing an input speech signal formed of multiple observation frames;

providing a plurality of unit templates in stored memory of the digital processor, each unit template for representing acoustic attributes of a respective speech unit and for generating a respective target speech unit;

providing a plurality of error models associated with the unit templates in stored memory, each unit template having an error model for explicitly measuring and quantitatively representing temporal and spatial correlations between the synthetic segments and a subject speech signal, the temporal and spatial correlations being between acoustic attributes in the observation frames of the subject speech signal;

receiving the input speech signal in working memory of the digital processor;

comparing the target speech units with different plural observation frames of the input speech signal in working memory such that the comparison defines a set of error sequences in working memory;

and

using the error models, analyzing the error sequences and determining the corresponding speech unit of the input speech signal.

13. A method as claimed in claim 12 wherein the unit templates employ a generation function to generate the target speech units.

14. A method as claimed in claim 13 wherein the generation function is used to form each unit template.

15. A method as claimed in claim 12 wherein:

the step of generating the error models includes forming each error model from a probability density function; and

the step of determining the corresponding speech unit includes determining a most likely error model such that the respective speech unit of the unit template corresponding to the most likely error model is the corresponding speech unit of the input speech signal.

16. A method as claimed in claim 12 wherein:

the step of generating the error models includes forming each error model from a distance metric; and

the step of determining the corresponding speech unit includes determining a best error model, such that the respective speech unit of the unit template corresponding to the best error model is the corresponding speech unit of the input speech signal.

17. A method as claimed in claim 12 further comprising the step of normalizing each error sequence to a single error feature vector of fixed dimension before generating the error models.

18. A method as claimed in claim 17 wherein the step of normalizing includes averaging across each error sequence.

19. A method as claimed in claim 12 wherein the step of providing a plurality of unit templates includes providing transition unit templates for representing acoustic transition dynamics between speech units within a speech signal.

20. A method as claimed in claim 19 wherein the transition unit templates provide an indication of one of location of a transition in the input speech signal and speech units involved in the transition.

21. A method as claimed in claim 12 wherein the step of providing a plurality of unit templates includes combining a plurality of unit templates to form a multiplicity of merged templates that account for contextual effects on the respective speech units of the unit templates.

22. A method as claimed in claim 12 wherein the step of providing a plurality of unit templates includes providing a multiplicity of templates for representing context dependent acoustic attributes of a respective speech unit.

23. A method as claimed in claim 12 wherein the step of providing a plurality of unit templates includes providing phonetic unit templates for representing one of phonetic units of speech and strings of phonetic units of speech.

24. In a digital processor, speech recognition apparatus for decoding an input speech signal to a corresponding speech unit, the apparatus comprising:

a source providing an input speech signal formed of multiple observation frames;

a plurality of unit templates, each unit template for representing acoustic attributes of a respective speech unit and each unit template generating a respective synthetic segment indicative of the respective speech unit;

a plurality of error models associated with the unit templates, each unit template having an error model; and

processor means coupled to the unit templates and error models and coupled to the source to receive the input speech signal, the processor means comparing the synthetic segments to different plural observation frames of the input speech signal to define a set of error sequences, the processor means transforming each error sequence to a fixed dimension error feature vector independent of the number of observation frames, and based on the error models, the processor means computing a score for the error feature vector.

25. The apparatus of claim 24 wherein each error model explicitly measures and quantitatively represents temporal and spatial correlations between the synthetic segments and a subject speech signal, the temporal and spatial correlations being between acoustic attributes in the observation frames of the subject speech signal.

26. The apparatus of claim 25 wherein the temporal and spatial correlations are between different acoustic attributes in different observation frames of the subject speech signal.
 Description Submit all comments and votes
 


BACKGROUND

The task of automatic speech recognition (ASR) essentially consists of decoding a word sequence from a continuous speech signal. In order to achieve reasonable levels of performance, past ASR systems have constrained the permissible speech input in order to simplify the decoding task. Typical constraints are (i) speaker dependency, i.e., training the system for each individual speaker, (ii) word quantity, i.e., limiting the system vocabulary to a small number of words or requiring input to be isolated words only, and (iii) read speech (as opposed to also permitting spontaneous speech), or some combination of (i) through (iii). Recently however, state-of-the-art systems have been able to achieve reasonable performance levels for speaker independent, continuous/spontaneous speech systems, operating with vocabularies of greater than 5,000 words.

A block diagram of the major components of a typical ASR system 10 is shown in FIG. 1. Typically, the samples of the continuous speech signal 12 are first processed by a signal processor 14 to form a discreet sequence of observation vectors 18. The components of the observation vectors are the acoustic attributes that have been chosen to represent the signal 12. Examples of commonly chosen attributes are Discrete Fourier Transform based spectral coefficients or auditory model parameters. Each observation vector 18 is called a frame of speech, and the sequence of T frames forms the signal representation, O={o.sub.1, o.sub.2, . . . , o.sub.T }. Acoustic and language models 20, 22 are then used to score the frame sequence O, search a lexicon and hypothesize word sequences. The models 20, 22, search and scoring procedure 24 are highly implementation dependent.

As the number of words in the lexicon 26 becomes large, the task of training individual word models becomes prohibitive. Consequently an intermediate level of representation is generally used. A common representation involves describing the pronunciation of a word in terms of phonemes. A phoneme is an abstract linguistic unit. Changing a phoneme changes the meaning of a word. For example, if the phoneme /p/ in the word "pit" is changed to a /b/, the word becomes "bit". A small number of phonemes can be used to describe all the words in a given language (English consists of roughly 40 phonemes). By representing word pronunciations as a sequence of phonemes, the number of acoustic models and the required training data can be drastically reduced.

Phonemes can be realized in a variety of acoustically distinct manners depending on the phonetic context (e.g., syllable position, neighboring phones), the stress, the speaker, and other factors. The actual acoustic realization of a phoneme is known as a phone. This distinction between a phoneme and a phone is an important one. The different acoustic realizations of the same phoneme do not affect the meaning of a word. An example of this often occurs in the word "butter" where the phoneme /t/ is frequently realized in American English as a "flap" (a particular phone). The acoustic variability that can occur when realizing the same phoneme is part of what makes the task of identifying a phoneme so challenging. The standard distinction is to utilize / / to indicate a phoneme and [ ] to indicate a phone.

The acoustic models are generally trained to recognize some set of phones (the exact set being a design decision). The task of decoding a phone sequence is known as "phonetic recognition," and the resulting output is known as a phonetic transcription. The phonetic transcription may or may not be mapped to a string of phonemes, but regardless, it is a fundamental importance to the ASR task since it is the foundation upon which the word string search is based. Virtually all modern, state-of-the-art speech systems utilize phonetic models as a basis for recognition.

Phonetic recognition methods tend to fall into two categories. The first, and most widely used, is "frame" based. Each observation frame in the sequence O={o.sub.1, . . . , o.sub.T } receives a score for each phonetic model in the system. There is no presegmentation of the signal into larger units. An example of a frame-based phonetic recognition method is the Hidden Markov Models (HMM's). HMM's consists of a set of states connected to each other via transition probabilities. While occupying a state, observations are generated randomly from a probability density function. The transition probabilities and output distributions together constitute an HMM model. The key assumption inherent in an HMM is that the observations are independent, given the state sequence up to the current time.

Thus HMM's handle certain temporal aspects of the speech problem in an elegant manner. The variability of durations over a phone training set is handled automatically by the fact that the state sequence can be as long or short as necessary. Another advantage of the HMM approach is that it does not require an explicit temporal alignment, or segmentation, of the speech signal. Since each frame in an utterance receives its own score, the likelihood scores for alternative segmentations can be directly compared to each other. The alignment which results in the best score for the entire utterance is then chosen. Finally, an efficient technique, the Baum-Welch reestimation algorithm, exists for training HMM's.

In HMM's,temporat correlations are represented implicitly through the statistics of the state sequence and are not modelled explicitly. However, it has been demonstrated that significant temporal correlations do exist. See V. Digilakis, "Segment-Based Stochastic Models of Spectral Dynamics for Continuous Speech Recognition", Ph. D. Thesis, Boston University, 1992. Also see W. Goldenthal and J. Glass, "Modelling Spectral Dynamics for Vowel Classification," Proc. Eurospeech 93, pp. 289-292, Berlin, Germany, (September 1993), incorporated herein by reference.

There have also been attempts to explicitly model the dynamics of the acoustic attributes within an HMM framework. Generally this has been done with some-success, by incorporating first (and possibly second) order differences of the acoustic parameters in the observation vector. Other approaches are segmental HMM's proposed by Russell and Marcus and state-conditioned trend functions used by Deng. See "A Segmental HMM for Speech Pattern Modelling", by M. Russell in Proceedings of the ICASSP 93, pages 499-502, Minneapolis, Minn. April 1993; "Phonetic Recognition in a Segment-Base HMM" by J. Marcus in Proceedings of the ICASSP 93, pages 479-482 Minneapolis, Minn. April 1993; and "A Generalized Hidden Markov Model With State-Conditioned Trend Functions of Time for the Speech Signal" by L. Deng, Signal Processing 27, Vol. 1, pages 65-78 April 1992. None of these approaches have gained general acceptance within the community or been shown to generate results superior to more traditional HMM's.

A second type of phonetic recognition method involves a "segment" based approach. These methods hypothesize start and end times of larger units within the speech signal which generally represent individual phonetic units of speech. An example of a segment-based method is the Stochastic Segment Models (SSM). SSM's are a segment-based approach that attempts to both model the spectral dynamics of a phonetic unit and to capture the temporal correlation within a phonetic segment. However, SSM's impose a very high dimensionality on the Gaussian probability density functions used to estimate the correlations (on the order of 112 to 140). As a consequence, no implementation of this method has yet to successfully incorporate the temporal correlation information. In fact, an implementation utilizing only the temporal correlations performed slightly worse than an implementation which assumed complete statistical independence. See S. Roucos, M. Ostendorf, H. Gish, A. Derr, "Stochastic Segment Modelling Using the Estimate-Maximize Algorithm", in Proceedings ICASSP 88, pages 127-130, April 1988.

As between segment-based and frame-based methods, segment based systems offer the potential advantage of being able to accurately capture segment level dynamics as well as directly modelling temporal correlations within the segment. Also, segment level features, such as segment duration, are easily incorporated. The advantage of a frame-based system is that each frame receives its own score and the scores for different transcription candidates are directly comparable. In a segment-based frame work, it can be difficult to compare utterance likelihoods which propose different numbers of segments. Also, a frame-based system tends to have a computational advantage since the segmentation step does not have to be explicitly performed.

Further, other methods for phonetic recognition include template-based approaches, statistical approaches and more recently approaches based on dynamic modeling and neural networks. A recursive error propagation neural network approach has been used with the TIMIT speech corpus. See T. Robinson, "Several Improvements to a Recurrent Error Propagation Phone Recognition System", Technical Report CUED/TINFENG/TR. 82, 1991. An inherent drawback of neural networks is a large amount of time needed to train the models.

SUMMARY OF THE PRESENT INVENTION

The present invention overcomes many of the problems and disadvantages of the prior art. In particular, the present invention provides improved phonetic recognition in an automatic speech recognition system, or any other system which utilizes phonetic transcription. The present invention specifically provides improved acoustic models.

The present invention phonetic recognition method is both template-based and statistical-based. The templates are used to capture dynamic characteristics at the segment level, and the statistics measure the spatial (meaning within the parameter space) and temporal correlations of the errors.

In particular, the present invention generates a dynamic representation of a phonetic unit, called a "track". The present invention also generates a statistical model of the error when a track is compared to a phonetic segment. This in effect creates a dynamic trajectory of the acoustic attributes (or measurements) used to represent the speech signal, and the incorporation of the temporal correlations into a statistical model for each phonetic unit. As mentioned above, the HMM's are not able to explicitly model the temporal correlations. The present invention approach provides a vehicle for modelling these correlations.

In the preferred embodiment, speech recognition apparatus of the present invention decodes an input speech signal to a corresponding speech unit (e.g. phonetic unit) in a digital processor as follows. A plurality of unit templates is provided. Each unit template represents acoustic attributes of a respective speech unit such as a phonetic unit or a string of phonetics. In addition, each unit template generates a respective target speech unit or a synthetic segment. Processor means then compares the synthetic segments/target speech units to portions of the input speech signal to define a set of error sequences. The processor means generates therefrom a plurality of error models, one for each unit template. Each error model represents the temporal and spacial correlations in the error sequences defined between the synthetic segments and input speech signal. Based on the error models, a determination is made of the corresponding speech unit of the input speech signal. In particular, the respective speech unit of the unit template corresponding to the best or most likely error model (e.g. the one with greatest probability) is the transcription or translation of the input speech signal.

According to one aspect of the present invention, the unit templates employ a generation function to generate the target speech units or synthetic segments. In addition, the generation function is used to initially form each unit template.

In a preferred embodiment of the present invention, each error model is formed from a probability density function, such as a joint Gaussian probability density function. In addition, each error sequence is normalized to a fixed dimension before the processor means generates the error models. Preferably each error sequence is normalized by averaging.

According to another feature of the present invention, the plurality of unit templates includes transition unit templates. The transition unit templates represent transitions between speech/phonetic units within a speech signal. Further, the transition unit templates provide an indication of either location of a transition in the input speech signal, or the speech units involved in the transition or both.

According to another aspect of the present invention, a combination of unit templates is used to form a multiplicity of merged templates. The merged templates account for contextual effects on the respective speech units of the initial unit templates.

BRIEF DESCRIPTION OF THE DRAWINGS

The foregoing and other objects, features and advantages of the invention will be apparent from the following more particular description of preferred embodiments of the drawings in which like reference characters refer to the same parts throughout the different views. The drawings are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the invention.

FIG. 1 is a block diagram of an automatic speech recognition system of the type in which an embodiment of the present invention may be employed.

FIG. 2 is a schematic diagram of one embodiment of the present invention.

FIG. 3 is a schematic diagram of a track and error model pair in the embodiment of FIG. 2.

FIGS. 4A-4D are graphs illustrating track alignment of each of the Cepstral coefficients CO-C3 employed in the embodiment of FIG. 2.

FIG. 5 is an illustration of a matrix of error correlation coefficients employed by the present invention.

FIG. 6 is a graph of the distance between transition tracks in the clustering processes of an alternative feature of the present invention.

FIG. 7 is an illustration of a portion of an acoustic attribute partitioned into segments.

FIG. 8 is an illustration of a Viterbi search path employed by the search component of the embodiment of FIG. 2.

FIG. 9 is a table of the phone classes employed in the alternative feature of the present invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

By way of background, speech is produced by the coordinated manipulation of a set of articulators, including the tongue, lips, jaw, vocal folds, and the velum. The speaker-dependent characteristics of the articulators and the vocal tract can cause a large amount of acoustic variability in the realization of the same phoneme sequence. The speaker's environment, mood, health, and prosody (pitch and emphasis) can all affect the acoustic realization of a phonemic sequence. In addition to these speaker-dependent effects, the phonemic context influences the motion of the articulators and the resulting acoustic output. It is frequently unclear where one phonetic segment ends and the next begins. The overlapping of phonetic segments stems from overlap in adjacent articulatory gestures. This phenomenon is known as co-articulation, and causes large variations in the acoustic realization of a phoneme.

Despite the high degree of variability in the speech signal, there exists much that is consistent both within a phonetic unit and across an utterance. This consistency is what makes spoken communication so robust. A given phone generally has a configuration of the articulators or target position associated with it. Whether or not the target position is reached, there tend to exist intervals of speech which are predominantly representative of a particular phone. Despite differences among different speaker's physical characteristics, the articulators will share similar relative motions when realizing the same phone. This similarity in the dynamics of the articulators generally translates into similar dynamics in the acoustic attributes of the phone.

Therefore, the applicants have discovered that the trajectories of the acoustic attributes share similar dynamic characteristics for a given sequence of phones as the articulators move through a similar sequence of gestures. The greater the similarity of the phonetic context, the greater the similarity of the motion of the acoustic attributes.

Statistical models of the phonetic units have historically provided a robust method for dealing with the variability between speakers. These statistical models may provide correlation information between the acoustic attributes at a specific time, and over a specified time interval. The applicants have found that the temporal correlation information can provide a means for accounting for the fact that the same vocal tract is producing the entire phonetic sequence in an utterance. These temporal correlations in the speech signal are not modeled directly in most prior art implementations. The most popular current method, HMMs (discussed above), are only able to model these correlations indirectly. The present invention demonstrates the importance of the temporal correlations and constructs models which utilize them effectively. The temporal correlation information provides a means for accounting for the fact that the same vocal tract is producing the entire phonetic sequence in an utterance.

Turning now to the particulars of the present operates in an automatic speech recognition system 40 such as that depicted in FIG. 2 (and similar to that of FIG. 1). As noted earlier, the continuous speech (input) signal is digitally sampled and then processed via a temporal and/or spectral analysis into a sequence of observation frames. In the preferred embodiment, the input signal 12a is preprocessed by signal preprocessor 16 (FIG. 2) as follows. The signal representation 18a to be generated and used throughout the present invention consists of the Mel-frequency cepstral coefficients (MFCC's) described by P. Mermelstein and S. Davis "Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences", IEEE Trans. ASSP, Vol. 23 No. 1, pages 67-72 (February 1975) incorporated herein by reference. These coefficients are based on the short-time Fourier transform of the speech signal 12a. The cepstrals provide a high degree of data reduction over using values of the power spectral density directly, since the power spectrum at each frame is represented using relatively few parameters.

The key steps in producing the MFCC's are:

1. Analog conversion of the continuous speech waveform 12a into digitized samples. Preferably the sample frequency is 16 kHz.

2. The digitized signal is then pre-emphasized via first differencing to reduce the effects of spectral tilt.

3. The digitized samples are blocked or rectangularly windowed into frames. The frames are typically on the order of 25 or 30 ms.

4. The frames are windowed using a Hamming, Hanning or other common window known in the art, to reduce the effects of assuming the signal 12a is zero outside the boundaries of the frame. In the preferred embodiment, a Hamming window of duration 25.6 ms is used.

5. The frames are computed using a fixed rate moving window at increments of 5 to 15 ms. Preferably, 5 ms increments or 200 frames per second are used. Hence, there is a large degree of overlap between frames. The idea is that the signal 12a can be considered quasi-stationary within a frame.

6. A 256 point (for example) Discrete Fourier Transform is then computed for each frame. Other types of transform-based or similar processing, common in the art, are suitable.

7. The Fourier transform coefficients are squared, and the resulting squared magnitude spectrum is passed through a set of 40 overlapping Mel-frequency triangular filter blanks. The log energy outlet of each of these filters collectively form the 40 Mel-frequency spectral coefficients (MFSC), X.sub.j, j=1,2, . . . 40.

8. A cosine transform of the MFSC's is then used to generate the 15 MFCC's which are the acoustic attributes used in the present invention. The Mel-frequency filters consist of thirteen triangles spread evenly on a linear frequency scale form 130 Hz to 1 kHz, and 27 triangles evenly distributed on a log-arithmic scale form 1 kHz to 6.4 kHz. Since the bandwidths of the triangular filters increase with center frequency, the area of each filter is normalized to avoid amplifying the higher frequency coefficients. The cosine transform which yields the MFCC, C.sub.i, i=0,1,2 . . . , 14, from the MFSC is: ##EQU1## Note that the lowest cepstral coefficient, C.sub.o, is a summation of the log energy from each filter. Therefore, it is indirectly related to the amount of energy in a frame.

Once the signal representation 18a has been generated from the digital signal processor 16, a search component 24a employs the acoustic model 30 of the present invention to incorporate dynamical models of the acoustics spectra into the phonetic recognition task as follows. First, the acoustic model 30 of the present invention determines a means of mapping a phone's (or a given unit of speech's) variable duration tokens onto a fixed length track. A track is defined to be a trajectory or temporal evolution of the acoustic attributes (or measurements) over a segment. That is, the purpose of the track is to accurately represent and account for the dynamic behavior of the acoustic attributes (or measurements) over the duration of a phone. A track consists of and is represented by a sequence of M state vectors T={t.sub.1, . . . , t.sub.M } which are used as the basis for generating a variable duration synthetic segment:

G=f(T,N)={g.sub.1, . . . , g.sub.N }

for any number of frames N where f() is a generation function. To that end, the tracks serve as a template for the units of speech (e.g. phones) they are modelling and captures segment level spectral dynamics.

After a track is computed from the training tokens for a particular phone, the same tokens are used to generate an error model EM based on the differences between synthetic segments generated from the track and the training tokens. The error model (EM) is then processed to determine identity of the speech segment. As such, the purpose of the error model is to represent the correlations, both temporal and spatial, that exist in the errors between the synthetic segments and the input tokens. The error model (EM) consists of a probability density function which is used to compute the likelihood scores used for phonetic classification. The error models in the preferred embodiment are jointly Gaussian probability density functions.

The track T and its associated statistical error model EM form a baseline model for each phonetic unit (i.e. form a phonetic model 38). Although the baseline (phonetic) model 38 provides a robust general characterization of the phonetic unit it represents, details attributable to phonetic context and speaker dependencies tend to be "averaged out". That is, since the track represents the phone in all contexts, it tends not to contain contextual information which is critical to enhancing model accuracy due to co-articulation. One means to address this problem is to create context-dependent tracks. Another is to specifically model the transition dynamics between phonemes. Both of these approaches are discussed in detail below.

It is important to distinguish between phonetic recognition and phonetic classification. In phonetic classification, the segmentation boundaries and utterance are known, and the task is to correctly classify each segment. In phonetic recognition, the segment boundaries are not known. As a result, insertion and deletion errors are possible along with substitution errors (i.e., misclassification).

A classification scheme which is compatible with the above components may be incorporated into the phonetic recognition task of the present invention. To that end, segmentation would be provided using existing methodologies common in the art, and an overall evaluation of the dynamic modelling approach of the present invention would be performed.

The foregoing components of FIG. 2 are implemented in computer code generally executed on a computer processor such as a VAX or similar computer/digital processing system. For purposes of illustration and not limitation, FIG. 2 depicts the search component 24a, present invention acoustic model 30 and associated parts operating in processor (memory) 28. Other computer configurations (in hardware, software or both) are in the purview of one skilled in the art.

In particular, a phonetic model 30 (and supporting track and error model pairs 38) of the present invention are implemented as follows and illustrated in FIG. 3.

Tracks .sub.T.sub..alpha. are computed from training data by mapping the training tokens for each phone to a sequence of M states. Each state is recorded as a vector, the sequence of vectors forming the track. The mapping function is known as a generation function f. When all the tokens in the training set for a particular phone have been mapped, the phone-dependent track is calculated from the maximum likelihood estimate of each state.

Once the tracks have been created, they serve as the initial stage in evaluating hypothesized speech segments. As shown in FIG. 3, to evaluate an N frame speech segment, S, a synthetic segment, G is generated. The generation function f (at 32), is used to compute the mapping from the M state track to the N frame synthetic segment 34. That is, for each state of track T, the generation function 32 aligns a data point from the frames values (stretched or compressed) and generates a template or synthetic segment G. The synthetic segment G produced by the generation function 32 is then compared directly to the N frame acoustic segment S to form an error sequence E as follows:

E=S-G={e.sub.1, . . . e.sub.N }

where e.sub.i =s.sub.i -g.sub.i. See step 36 in FIG. 3. The error sequence is subsequently used to formulate the error model EM of the phonetic model 30 of FIG. 2.

Note that the generation function 32 used to map the track to a hypothesized number of frames is the same function that is used in the creation of the track. Hence, it is the generation function 32 which determines both the computation of the tracks and their alignment with speech segments during evaluation.

A key question that must be answered is how to map tokens of varying duration to a track. The fact that the same phone will have a large variability in its duration, even when spoken by the same speaker in the same context, must be accounted for in a robust manner. In consideration of durational variability, Applicants base the creation of tracks and their subsequent use on certain assumptions as follows.

Two simple contrasting assumptions that can be made concerning the durational variability of phonetic segments are:

1. The spectral dynamics involved in realizing an acoustic segment are invariant with duration. Differences in duration primarily reflect differences in speaking rate. Therefore, the trajectory followed by the acoustic attributes is the same. Generation functions which utilize this assumption are referred to as trajectory invariant generation functions. Trajectory invariant generation functions rescale the phonetic track in time, until it is of the same duration as the training or evaluation token. Trajectory invariance as defined here does not imply that the gestures themselves are invariant, only the resulting dynamics of the acoustic attributes.

2. The spectral dynamics involved in realizing an acoustic segment are not invariant with duration. Differences in duration reflect actual differences in the trajectories of the acoustic attributes. In this case, the key assumption is that the dynamics in shorter phones is identical to part of the dynamics expressed in longer phones, such as the initial, central or final portion. Generation functions which utilize this assumption are referred to as time invariant generation functions. Time invariant generation functions align all tokens for the same phone about a fixed reference point in time. Therefore, unlike the trajectory invariant functions, there is no temporal expansion or compression of the acoustic trajectory. Instead, the trajectory of the acoustic attributes through the space will vary with phone duration.

Trajectory invariance assumes that the trajectory through the acoustic space does not vary with the duration of a specific phonetic unit. Under this assumption, tracks of the preferred embodiment consist of a fixed sequence of vectors. Each vector is a state, and hence the track is considered to be a sequence of states that the phone is modelled as passing through. Short phones are aligned to a subset of the track states, and long phones are aligned with the same state more than once. Trajectory invariant generation functions also align observations in between states via interpolation.

The trajectory invariant generation function determine the mapping of the track to the input token during both training (when the track is computed) and evaluation. Five alternative mapping procedures for generation function 32 are described below. In the first four procedures, each frame of the input token is utilized exactly once, both during track creation and evaluation. The fifth procedure is distinct in that data in long duration tokens is subsampled, and data in short tokens is augmented by interpolation. This allows each input token to contribute exactly one data point to each state of the track.

Table I provides pseudo-code for the trajectory invariant generation function Traj1. This method is based on a linearly interpolated mapping of a token's frame to the frames of the track. The initial and final frames of the token are always aligned with the initial and final frames of the track with intermediate frames falling linearly between. If the token is longer than the track, the same procedure is followed, but some frames of the track are mapped to more than one frame from the token. This means that multiple frames of the token are averaged into the same track frame for longer tokens. One problem with this method is that, depending on the number of states in the track, and the typical durations of the tokens it is representing, consecutive states of the track can receive disproportionate amounts of the training data due to the effects of mapping the frame to the nearest state.

TABLE I--Traj1

1. For all phone models, .alpha.

2. Set all elements of T.sub..alpha. and count to zero

3. num=track duration-1

4. For 1.ltoreq.i .ltoreq.M.sub..alpha.

(a) den=duration (i)-1

(b) FOR 0.ltoreq.j<duration(i)

i. track.sub.-- index=round.sub.-- to.sub.-- nearest.sub.-- integer(j * num/den)

ii. T.sub..alpha. (track.sub.-- index) track.sub.-- index)+(j)

iii. count(track.sub.-- index)=count(track.sub.-- index)+1

5. FOR 0.ltoreq.j<track.sub.-- duration

(a) T.sub..alpha. (j)=T.sub..alpha. (j)/count(j)

Where

Track.sub.-- duration is equal to a pre-specified duration (in frames) to be used for this track;

M.sub..alpha. is the number of tokens in the training set for phone model .alpha.;

Count is the vector whose e