|
Description  |
|
|
BACKGROUND
The task of automatic speech recognition (ASR) essentially consists of
decoding a word sequence from a continuous speech signal. In order to
achieve reasonable levels of performance, past ASR systems have
constrained the permissible speech input in order to simplify the decoding
task. Typical constraints are (i) speaker dependency, i.e., training the
system for each individual speaker, (ii) word quantity, i.e., limiting the
system vocabulary to a small number of words or requiring input to be
isolated words only, and (iii) read speech (as opposed to also permitting
spontaneous speech), or some combination of (i) through (iii). Recently
however, state-of-the-art systems have been able to achieve reasonable
performance levels for speaker independent, continuous/spontaneous speech
systems, operating with vocabularies of greater than 5,000 words.
A block diagram of the major components of a typical ASR system 10 is shown
in FIG. 1. Typically, the samples of the continuous speech signal 12 are
first processed by a signal processor 14 to form a discreet sequence of
observation vectors 18. The components of the observation vectors are the
acoustic attributes that have been chosen to represent the signal 12.
Examples of commonly chosen attributes are Discrete Fourier Transform
based spectral coefficients or auditory model parameters. Each observation
vector 18 is called a frame of speech, and the sequence of T frames forms
the signal representation, O={o.sub.1, o.sub.2, . . . , o.sub.T }.
Acoustic and language models 20, 22 are then used to score the frame
sequence O, search a lexicon and hypothesize word sequences. The models
20, 22, search and scoring procedure 24 are highly implementation
dependent.
As the number of words in the lexicon 26 becomes large, the task of
training individual word models becomes prohibitive. Consequently an
intermediate level of representation is generally used. A common
representation involves describing the pronunciation of a word in terms of
phonemes. A phoneme is an abstract linguistic unit. Changing a phoneme
changes the meaning of a word. For example, if the phoneme /p/ in the word
"pit" is changed to a /b/, the word becomes "bit". A small number of
phonemes can be used to describe all the words in a given language
(English consists of roughly 40 phonemes). By representing word
pronunciations as a sequence of phonemes, the number of acoustic models
and the required training data can be drastically reduced.
Phonemes can be realized in a variety of acoustically distinct manners
depending on the phonetic context (e.g., syllable position, neighboring
phones), the stress, the speaker, and other factors. The actual acoustic
realization of a phoneme is known as a phone. This distinction between a
phoneme and a phone is an important one. The different acoustic
realizations of the same phoneme do not affect the meaning of a word. An
example of this often occurs in the word "butter" where the phoneme /t/ is
frequently realized in American English as a "flap" (a particular phone).
The acoustic variability that can occur when realizing the same phoneme is
part of what makes the task of identifying a phoneme so challenging. The
standard distinction is to utilize / / to indicate a phoneme and [ ] to
indicate a phone.
The acoustic models are generally trained to recognize some set of phones
(the exact set being a design decision). The task of decoding a phone
sequence is known as "phonetic recognition," and the resulting output is
known as a phonetic transcription. The phonetic transcription may or may
not be mapped to a string of phonemes, but regardless, it is a fundamental
importance to the ASR task since it is the foundation upon which the word
string search is based. Virtually all modern, state-of-the-art speech
systems utilize phonetic models as a basis for recognition.
Phonetic recognition methods tend to fall into two categories. The first,
and most widely used, is "frame" based. Each observation frame in the
sequence O={o.sub.1, . . . , o.sub.T } receives a score for each phonetic
model in the system. There is no presegmentation of the signal into larger
units. An example of a frame-based phonetic recognition method is the
Hidden Markov Models (HMM's). HMM's consists of a set of states connected
to each other via transition probabilities. While occupying a state,
observations are generated randomly from a probability density function.
The transition probabilities and output distributions together constitute
an HMM model. The key assumption inherent in an HMM is that the
observations are independent, given the state sequence up to the current
time.
Thus HMM's handle certain temporal aspects of the speech problem in an
elegant manner. The variability of durations over a phone training set is
handled automatically by the fact that the state sequence can be as long
or short as necessary. Another advantage of the HMM approach is that it
does not require an explicit temporal alignment, or segmentation, of the
speech signal. Since each frame in an utterance receives its own score,
the likelihood scores for alternative segmentations can be directly
compared to each other. The alignment which results in the best score for
the entire utterance is then chosen. Finally, an efficient technique, the
Baum-Welch reestimation algorithm, exists for training HMM's.
In HMM's,temporat correlations are represented implicitly through the
statistics of the state sequence and are not modelled explicitly. However,
it has been demonstrated that significant temporal correlations do exist.
See V. Digilakis, "Segment-Based Stochastic Models of Spectral Dynamics
for Continuous Speech Recognition", Ph. D. Thesis, Boston University,
1992. Also see W. Goldenthal and J. Glass, "Modelling Spectral Dynamics
for Vowel Classification," Proc. Eurospeech 93, pp. 289-292, Berlin,
Germany, (September 1993), incorporated herein by reference.
There have also been attempts to explicitly model the dynamics of the
acoustic attributes within an HMM framework. Generally this has been done
with some-success, by incorporating first (and possibly second) order
differences of the acoustic parameters in the observation vector. Other
approaches are segmental HMM's proposed by Russell and Marcus and
state-conditioned trend functions used by Deng. See "A Segmental HMM for
Speech Pattern Modelling", by M. Russell in Proceedings of the ICASSP 93,
pages 499-502, Minneapolis, Minn. April 1993; "Phonetic Recognition in a
Segment-Base HMM" by J. Marcus in Proceedings of the ICASSP 93, pages
479-482 Minneapolis, Minn. April 1993; and "A Generalized Hidden Markov
Model With State-Conditioned Trend Functions of Time for the Speech
Signal" by L. Deng, Signal Processing 27, Vol. 1, pages 65-78 April 1992.
None of these approaches have gained general acceptance within the
community or been shown to generate results superior to more traditional
HMM's.
A second type of phonetic recognition method involves a "segment" based
approach. These methods hypothesize start and end times of larger units
within the speech signal which generally represent individual phonetic
units of speech. An example of a segment-based method is the Stochastic
Segment Models (SSM). SSM's are a segment-based approach that attempts to
both model the spectral dynamics of a phonetic unit and to capture the
temporal correlation within a phonetic segment. However, SSM's impose a
very high dimensionality on the Gaussian probability density functions
used to estimate the correlations (on the order of 112 to 140). As a
consequence, no implementation of this method has yet to successfully
incorporate the temporal correlation information. In fact, an
implementation utilizing only the temporal correlations performed slightly
worse than an implementation which assumed complete statistical
independence. See S. Roucos, M. Ostendorf, H. Gish, A. Derr, "Stochastic
Segment Modelling Using the Estimate-Maximize Algorithm", in Proceedings
ICASSP 88, pages 127-130, April 1988.
As between segment-based and frame-based methods, segment based systems
offer the potential advantage of being able to accurately capture segment
level dynamics as well as directly modelling temporal correlations within
the segment. Also, segment level features, such as segment duration, are
easily incorporated. The advantage of a frame-based system is that each
frame receives its own score and the scores for different transcription
candidates are directly comparable. In a segment-based frame work, it can
be difficult to compare utterance likelihoods which propose different
numbers of segments. Also, a frame-based system tends to have a
computational advantage since the segmentation step does not have to be
explicitly performed.
Further, other methods for phonetic recognition include template-based
approaches, statistical approaches and more recently approaches based on
dynamic modeling and neural networks. A recursive error propagation neural
network approach has been used with the TIMIT speech corpus. See T.
Robinson, "Several Improvements to a Recurrent Error Propagation Phone
Recognition System", Technical Report CUED/TINFENG/TR. 82, 1991. An
inherent drawback of neural networks is a large amount of time needed to
train the models.
SUMMARY OF THE PRESENT INVENTION
The present invention overcomes many of the problems and disadvantages of
the prior art. In particular, the present invention provides improved
phonetic recognition in an automatic speech recognition system, or any
other system which utilizes phonetic transcription. The present invention
specifically provides improved acoustic models.
The present invention phonetic recognition method is both template-based
and statistical-based. The templates are used to capture dynamic
characteristics at the segment level, and the statistics measure the
spatial (meaning within the parameter space) and temporal correlations of
the errors.
In particular, the present invention generates a dynamic representation of
a phonetic unit, called a "track". The present invention also generates a
statistical model of the error when a track is compared to a phonetic
segment. This in effect creates a dynamic trajectory of the acoustic
attributes (or measurements) used to represent the speech signal, and the
incorporation of the temporal correlations into a statistical model for
each phonetic unit. As mentioned above, the HMM's are not able to
explicitly model the temporal correlations. The present invention approach
provides a vehicle for modelling these correlations.
In the preferred embodiment, speech recognition apparatus of the present
invention decodes an input speech signal to a corresponding speech unit
(e.g. phonetic unit) in a digital processor as follows. A plurality of
unit templates is provided. Each unit template represents acoustic
attributes of a respective speech unit such as a phonetic unit or a string
of phonetics. In addition, each unit template generates a respective
target speech unit or a synthetic segment. Processor means then compares
the synthetic segments/target speech units to portions of the input speech
signal to define a set of error sequences. The processor means generates
therefrom a plurality of error models, one for each unit template. Each
error model represents the temporal and spacial correlations in the error
sequences defined between the synthetic segments and input speech signal.
Based on the error models, a determination is made of the corresponding
speech unit of the input speech signal. In particular, the respective
speech unit of the unit template corresponding to the best or most likely
error model (e.g. the one with greatest probability) is the transcription
or translation of the input speech signal.
According to one aspect of the present invention, the unit templates employ
a generation function to generate the target speech units or synthetic
segments. In addition, the generation function is used to initially form
each unit template.
In a preferred embodiment of the present invention, each error model is
formed from a probability density function, such as a joint Gaussian
probability density function. In addition, each error sequence is
normalized to a fixed dimension before the processor means generates the
error models. Preferably each error sequence is normalized by averaging.
According to another feature of the present invention, the plurality of
unit templates includes transition unit templates. The transition unit
templates represent transitions between speech/phonetic units within a
speech signal. Further, the transition unit templates provide an
indication of either location of a transition in the input speech signal,
or the speech units involved in the transition or both.
According to another aspect of the present invention, a combination of unit
templates is used to form a multiplicity of merged templates. The merged
templates account for contextual effects on the respective speech units of
the initial unit templates.
BRIEF DESCRIPTION OF THE DRAWINGS
The foregoing and other objects, features and advantages of the invention
will be apparent from the following more particular description of
preferred embodiments of the drawings in which like reference characters
refer to the same parts throughout the different views. The drawings are
not necessarily to scale, emphasis instead being placed upon illustrating
the principles of the invention.
FIG. 1 is a block diagram of an automatic speech recognition system of the
type in which an embodiment of the present invention may be employed.
FIG. 2 is a schematic diagram of one embodiment of the present invention.
FIG. 3 is a schematic diagram of a track and error model pair in the
embodiment of FIG. 2.
FIGS. 4A-4D are graphs illustrating track alignment of each of the Cepstral
coefficients CO-C3 employed in the embodiment of FIG. 2.
FIG. 5 is an illustration of a matrix of error correlation coefficients
employed by the present invention.
FIG. 6 is a graph of the distance between transition tracks in the
clustering processes of an alternative feature of the present invention.
FIG. 7 is an illustration of a portion of an acoustic attribute partitioned
into segments.
FIG. 8 is an illustration of a Viterbi search path employed by the search
component of the embodiment of FIG. 2.
FIG. 9 is a table of the phone classes employed in the alternative feature
of the present invention.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT
By way of background, speech is produced by the coordinated manipulation of
a set of articulators, including the tongue, lips, jaw, vocal folds, and
the velum. The speaker-dependent characteristics of the articulators and
the vocal tract can cause a large amount of acoustic variability in the
realization of the same phoneme sequence. The speaker's environment, mood,
health, and prosody (pitch and emphasis) can all affect the acoustic
realization of a phonemic sequence. In addition to these speaker-dependent
effects, the phonemic context influences the motion of the articulators
and the resulting acoustic output. It is frequently unclear where one
phonetic segment ends and the next begins. The overlapping of phonetic
segments stems from overlap in adjacent articulatory gestures. This
phenomenon is known as co-articulation, and causes large variations in the
acoustic realization of a phoneme.
Despite the high degree of variability in the speech signal, there exists
much that is consistent both within a phonetic unit and across an
utterance. This consistency is what makes spoken communication so robust.
A given phone generally has a configuration of the articulators or target
position associated with it. Whether or not the target position is
reached, there tend to exist intervals of speech which are predominantly
representative of a particular phone. Despite differences among different
speaker's physical characteristics, the articulators will share similar
relative motions when realizing the same phone. This similarity in the
dynamics of the articulators generally translates into similar dynamics in
the acoustic attributes of the phone.
Therefore, the applicants have discovered that the trajectories of the
acoustic attributes share similar dynamic characteristics for a given
sequence of phones as the articulators move through a similar sequence of
gestures. The greater the similarity of the phonetic context, the greater
the similarity of the motion of the acoustic attributes.
Statistical models of the phonetic units have historically provided a
robust method for dealing with the variability between speakers. These
statistical models may provide correlation information between the
acoustic attributes at a specific time, and over a specified time
interval. The applicants have found that the temporal correlation
information can provide a means for accounting for the fact that the same
vocal tract is producing the entire phonetic sequence in an utterance.
These temporal correlations in the speech signal are not modeled directly
in most prior art implementations. The most popular current method, HMMs
(discussed above), are only able to model these correlations indirectly.
The present invention demonstrates the importance of the temporal
correlations and constructs models which utilize them effectively. The
temporal correlation information provides a means for accounting for the
fact that the same vocal tract is producing the entire phonetic sequence
in an utterance.
Turning now to the particulars of the present operates in an automatic
speech recognition system 40 such as that depicted in FIG. 2 (and similar
to that of FIG. 1). As noted earlier, the continuous speech (input) signal
is digitally sampled and then processed via a temporal and/or spectral
analysis into a sequence of observation frames. In the preferred
embodiment, the input signal 12a is preprocessed by signal preprocessor 16
(FIG. 2) as follows. The signal representation 18a to be generated and
used throughout the present invention consists of the Mel-frequency
cepstral coefficients (MFCC's) described by P. Mermelstein and S. Davis
"Comparison of Parametric Representations for Monosyllabic Word
Recognition in Continuously Spoken Sentences", IEEE Trans. ASSP, Vol. 23
No. 1, pages 67-72 (February 1975) incorporated herein by reference. These
coefficients are based on the short-time Fourier transform of the speech
signal 12a. The cepstrals provide a high degree of data reduction over
using values of the power spectral density directly, since the power
spectrum at each frame is represented using relatively few parameters.
The key steps in producing the MFCC's are:
1. Analog conversion of the continuous speech waveform 12a into digitized
samples. Preferably the sample frequency is 16 kHz.
2. The digitized signal is then pre-emphasized via first differencing to
reduce the effects of spectral tilt.
3. The digitized samples are blocked or rectangularly windowed into frames.
The frames are typically on the order of 25 or 30 ms.
4. The frames are windowed using a Hamming, Hanning or other common window
known in the art, to reduce the effects of assuming the signal 12a is zero
outside the boundaries of the frame. In the preferred embodiment, a
Hamming window of duration 25.6 ms is used.
5. The frames are computed using a fixed rate moving window at increments
of 5 to 15 ms. Preferably, 5 ms increments or 200 frames per second are
used. Hence, there is a large degree of overlap between frames. The idea
is that the signal 12a can be considered quasi-stationary within a frame.
6. A 256 point (for example) Discrete Fourier Transform is then computed
for each frame. Other types of transform-based or similar processing,
common in the art, are suitable.
7. The Fourier transform coefficients are squared, and the resulting
squared magnitude spectrum is passed through a set of 40 overlapping
Mel-frequency triangular filter blanks. The log energy outlet of each of
these filters collectively form the 40 Mel-frequency spectral coefficients
(MFSC), X.sub.j, j=1,2, . . . 40.
8. A cosine transform of the MFSC's is then used to generate the 15 MFCC's
which are the acoustic attributes used in the present invention. The
Mel-frequency filters consist of thirteen triangles spread evenly on a
linear frequency scale form 130 Hz to 1 kHz, and 27 triangles evenly
distributed on a log-arithmic scale form 1 kHz to 6.4 kHz. Since the
bandwidths of the triangular filters increase with center frequency, the
area of each filter is normalized to avoid amplifying the higher frequency
coefficients. The cosine transform which yields the MFCC, C.sub.i, i=0,1,2
. . . , 14, from the MFSC is:
##EQU1##
Note that the lowest cepstral coefficient, C.sub.o, is a summation of the
log energy from each filter. Therefore, it is indirectly related to the
amount of energy in a frame.
Once the signal representation 18a has been generated from the digital
signal processor 16, a search component 24a employs the acoustic model 30
of the present invention to incorporate dynamical models of the acoustics
spectra into the phonetic recognition task as follows. First, the acoustic
model 30 of the present invention determines a means of mapping a phone's
(or a given unit of speech's) variable duration tokens onto a fixed length
track. A track is defined to be a trajectory or temporal evolution of the
acoustic attributes (or measurements) over a segment. That is, the purpose
of the track is to accurately represent and account for the dynamic
behavior of the acoustic attributes (or measurements) over the duration of
a phone. A track consists of and is represented by a sequence of M state
vectors T={t.sub.1, . . . , t.sub.M } which are used as the basis for
generating a variable duration synthetic segment:
G=f(T,N)={g.sub.1, . . . , g.sub.N }
for any number of frames N where f() is a generation function. To that end,
the tracks serve as a template for the units of speech (e.g. phones) they
are modelling and captures segment level spectral dynamics.
After a track is computed from the training tokens for a particular phone,
the same tokens are used to generate an error model EM based on the
differences between synthetic segments generated from the track and the
training tokens. The error model (EM) is then processed to determine
identity of the speech segment. As such, the purpose of the error model is
to represent the correlations, both temporal and spatial, that exist in
the errors between the synthetic segments and the input tokens. The error
model (EM) consists of a probability density function which is used to
compute the likelihood scores used for phonetic classification. The error
models in the preferred embodiment are jointly Gaussian probability
density functions.
The track T and its associated statistical error model EM form a baseline
model for each phonetic unit (i.e. form a phonetic model 38). Although the
baseline (phonetic) model 38 provides a robust general characterization of
the phonetic unit it represents, details attributable to phonetic context
and speaker dependencies tend to be "averaged out". That is, since the
track represents the phone in all contexts, it tends not to contain
contextual information which is critical to enhancing model accuracy due
to co-articulation. One means to address this problem is to create
context-dependent tracks. Another is to specifically model the transition
dynamics between phonemes. Both of these approaches are discussed in
detail below.
It is important to distinguish between phonetic recognition and phonetic
classification. In phonetic classification, the segmentation boundaries
and utterance are known, and the task is to correctly classify each
segment. In phonetic recognition, the segment boundaries are not known. As
a result, insertion and deletion errors are possible along with
substitution errors (i.e., misclassification).
A classification scheme which is compatible with the above components may
be incorporated into the phonetic recognition task of the present
invention. To that end, segmentation would be provided using existing
methodologies common in the art, and an overall evaluation of the dynamic
modelling approach of the present invention would be performed.
The foregoing components of FIG. 2 are implemented in computer code
generally executed on a computer processor such as a VAX or similar
computer/digital processing system. For purposes of illustration and not
limitation, FIG. 2 depicts the search component 24a, present invention
acoustic model 30 and associated parts operating in processor (memory) 28.
Other computer configurations (in hardware, software or both) are in the
purview of one skilled in the art.
In particular, a phonetic model 30 (and supporting track and error model
pairs 38) of the present invention are implemented as follows and
illustrated in FIG. 3.
Tracks .sub.T.sub..alpha. are computed from training data by mapping the
training tokens for each phone to a sequence of M states. Each state is
recorded as a vector, the sequence of vectors forming the track. The
mapping function is known as a generation function f. When all the tokens
in the training set for a particular phone have been mapped, the
phone-dependent track is calculated from the maximum likelihood estimate
of each state.
Once the tracks have been created, they serve as the initial stage in
evaluating hypothesized speech segments. As shown in FIG. 3, to evaluate
an N frame speech segment, S, a synthetic segment, G is generated. The
generation function f (at 32), is used to compute the mapping from the M
state track to the N frame synthetic segment 34. That is, for each state
of track T, the generation function 32 aligns a data point from the frames
values (stretched or compressed) and generates a template or synthetic
segment G. The synthetic segment G produced by the generation function 32
is then compared directly to the N frame acoustic segment S to form an
error sequence E as follows:
E=S-G={e.sub.1, . . . e.sub.N }
where e.sub.i =s.sub.i -g.sub.i. See step 36 in FIG. 3. The error sequence
is subsequently used to formulate the error model EM of the phonetic model
30 of FIG. 2.
Note that the generation function 32 used to map the track to a
hypothesized number of frames is the same function that is used in the
creation of the track. Hence, it is the generation function 32 which
determines both the computation of the tracks and their alignment with
speech segments during evaluation.
A key question that must be answered is how to map tokens of varying
duration to a track. The fact that the same phone will have a large
variability in its duration, even when spoken by the same speaker in the
same context, must be accounted for in a robust manner. In consideration
of durational variability, Applicants base the creation of tracks and
their subsequent use on certain assumptions as follows.
Two simple contrasting assumptions that can be made concerning the
durational variability of phonetic segments are:
1. The spectral dynamics involved in realizing an acoustic segment are
invariant with duration. Differences in duration primarily reflect
differences in speaking rate. Therefore, the trajectory followed by the
acoustic attributes is the same. Generation functions which utilize this
assumption are referred to as trajectory invariant generation functions.
Trajectory invariant generation functions rescale the phonetic track in
time, until it is of the same duration as the training or evaluation
token. Trajectory invariance as defined here does not imply that the
gestures themselves are invariant, only the resulting dynamics of the
acoustic attributes.
2. The spectral dynamics involved in realizing an acoustic segment are not
invariant with duration. Differences in duration reflect actual
differences in the trajectories of the acoustic attributes. In this case,
the key assumption is that the dynamics in shorter phones is identical to
part of the dynamics expressed in longer phones, such as the initial,
central or final portion. Generation functions which utilize this
assumption are referred to as time invariant generation functions. Time
invariant generation functions align all tokens for the same phone about a
fixed reference point in time. Therefore, unlike the trajectory invariant
functions, there is no temporal expansion or compression of the acoustic
trajectory. Instead, the trajectory of the acoustic attributes through the
space will vary with phone duration.
Trajectory invariance assumes that the trajectory through the acoustic
space does not vary with the duration of a specific phonetic unit. Under
this assumption, tracks of the preferred embodiment consist of a fixed
sequence of vectors. Each vector is a state, and hence the track is
considered to be a sequence of states that the phone is modelled as
passing through. Short phones are aligned to a subset of the track states,
and long phones are aligned with the same state more than once. Trajectory
invariant generation functions also align observations in between states
via interpolation.
The trajectory invariant generation function determine the mapping of the
track to the input token during both training (when the track is computed)
and evaluation. Five alternative mapping procedures for generation
function 32 are described below. In the first four procedures, each frame
of the input token is utilized exactly once, both during track creation
and evaluation. The fifth procedure is distinct in that data in long
duration tokens is subsampled, and data in short tokens is augmented by
interpolation. This allows each input token to contribute exactly one data
point to each state of the track.
Table I provides pseudo-code for the trajectory invariant generation
function Traj1. This method is based on a linearly interpolated mapping of
a token's frame to the frames of the track. The initial and final frames
of the token are always aligned with the initial and final frames of the
track with intermediate frames falling linearly between. If the token is
longer than the track, the same procedure is followed, but some frames of
the track are mapped to more than one frame from the token. This means
that multiple frames of the token are averaged into the same track frame
for longer tokens. One problem with this method is that, depending on the
number of states in the track, and the typical durations of the tokens it
is representing, consecutive states of the track can receive
disproportionate amounts of the training data due to the effects of
mapping the frame to the nearest state.
TABLE I--Traj1
1. For all phone models, .alpha.
2. Set all elements of T.sub..alpha. and count to zero
3. num=track duration-1
4. For 1.ltoreq.i .ltoreq.M.sub..alpha.
(a) den=duration (i)-1
(b) FOR 0.ltoreq.j<duration(i)
i. track.sub.-- index=round.sub.-- to.sub.-- nearest.sub.-- integer(j *
num/den)
ii. T.sub..alpha. (track.sub.-- index) track.sub.-- index)+(j)
iii. count(track.sub.-- index)=count(track.sub.-- index)+1
5. FOR 0.ltoreq.j<track.sub.-- duration
(a) T.sub..alpha. (j)=T.sub..alpha. (j)/count(j)
Where
Track.sub.-- duration is equal to a pre-specified duration (in frames) to
be used for this track;
M.sub..alpha. is the number of tokens in the training set for phone model
.alpha.;
Count is the vector whose e | | |