|
Description  |
|
|
BACKGROUND AND SUMMARY OF THE INVENTION
Speaker-independent continuous speech recognition is ideal for man/machine
communication. However, the state-the-art modeling techniques still limit
the decoding accuracy of such systems. An inherent difficulty in
statistical modeling of speaker-independent continuous speech is that the
spectral variations of each phone unit come not only from allophone
contextual dependency, but also from the acoustic and phonologic
characteristics of individual speakers. These o speaker variation factors
make the speaker-independent models less effective than speaker-dependent
ones in recognizing individual speakers' speech.
In order to improve speaker-independent continuous speech recognition, it
is of great interest to incorporate efficient learning mechanisms into
speech recognizers, so that speaker adaptation can be accomplished while a
user uses the recognizer and so that decoding accuracy can be gradually
improved to that of speaker-independent recognizers.
In the parent application, of which this application is a
continuation-in-part, a speaker adaptation technique based on the
decomposition of spectral variation sources is disclosed. The technique
has achieved significant error reductions for a speaker-independent
continuous speech recognition system, where the adaptation requires short
calibration speech from both the training and test speakers. The current
work extends this adaptation technique into the paradigm of self-learning
adaptation, i.e. no adaptation speech is explicitly required from the
speaker, and the spectral characteristics of a speaker are learned via
statistical methods from the incoming speech utterances of the speaker
during his normal usage of the recognizer.
RELATED ART
Reference may be had to the following literature for a more complete
understanding of the field to which this invention relates.
S. J. Cox and J. S. Bridle (1989), "Unsupervised Speaker Adaptation by
Probabilistic Fitting," Proc. ICASSP, Glasgow, Scotland, May 1989, pp.
294-297.
M. H. Degroot (1970), Optimal Statistical Decisions, (McGraw-Hill Inc.)
A. P. Dempster, N. M. Laird, D. B. Rubin (1977), "Maximum Likelihood
Estimation From Incomplete Data Via the EM Algorithm," J. Royal
Statistical Society, B 39, No. 1, pp. 1-38.
S. Furui (1989), "Unsupervised Speaker Adaptation Method Based on
Hierarchical Spectral Clustering," Proc. ICASSP, Glasgow, Scotland, May
1989, p. 286-289.
H. Hermansky, B. A. Hanson, H. J. Wakita (1985), "Perceptually Based Linear
Predictive Analysis of Speech," Proc. ICASSP, Tampa, Fla., March 1985, pp.
509-512.
M. J. Hunt (1981), "Speaker Adaptation for Word Based Speech Recognition
Systems," J. Acoust. Soc. Am., 69:S41-S42.
L. F. Lamel, R. H. Kassel, S. Seneff (1986), "Speech Database Development:
Design and Analysis of the Acoustic-Phonetic Corpus," Proc. of Speech
Recognition Workshop (DARPA).
C.-H. Lee, C.-H. Lin, B.-H. Juang (1990), "A Study on Speaker Adaptation of
Continuous Density HMM Parameters," Proc. ICASSP, Minneapolis, Minn.,
April 1990, pp. 145-148.
C.-H. Lee and Jean-L. Gauvain (1993), "Speaker Adaptation Based on MAP
Estimation of HMM Parameters," Proc. ICASSP, Minneapolis, Minn., April
1993, pp. 558-561.
K. Ohkura, M. Sugiyama, S. Sagayama (1992), "Speaker Adaptation Based on
Transfer Vector Field Smoothing With Continuous Mixture Density HMMs,"
Proc. of ICSLP, Banff, Canada, October 1992, pp. 369-372.
D. B. Paul and B. F. Necioglu (1993), "The Lincoln Large-Vocabulary
Stack-Decoder HMM CSR," Proc. ICASSP, Vol. II, Minneapolis, Minn., April
1993, pp. 660-664.
K. Shinoda, K. Iso, T. Watanabe (1991), "Speaker Adaptation for
Demi-Syllable Based Continuous Density HMM," Proc. of ICASSP, Toronto,
Canada, May 1991, pp. 857-860.
Y. Zhao, H. Wakita, X. Zhuang (1991), "An HMM Based Speaker-Independent
Continuous Speech Recognition System With Experiments on the TIMIT
Database," Proc. ICASSP, Toronto, Canada, May 1991, pp. 333-336.
Y. Zhao (1993a), "A Speaker-Independent Continuous Speech Recognition
System Using Continuous Mixture Gaussian Density HMM of Phoneme-Sized
Units," IEEE Trans. on Speech and Audio Processing, Vol. 1, No. 3, Jul.
1993 pp. 345-361.
Y. Zhao (1993b), "Self-Learning Speaker Adaptation Based on Spectral
Variation Source Decomposition," Proc. EuroSpeech '93, Berlin, Germany,
September 1993, pp. 359-362.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a block diagram illustrating how normalization of speaker
acoustic characteristics is performed in a system using short calibration
speech;
FIG. 2 is a block diagram illustrating how phone model adaptation is
performed in the system of FIG. 1; and
FIG. 3 is a block diagram illustrating the presently preferred embodiment
of a self-learning speaker-independent, continuous speech recognition
system according to the invention.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
The speech system of the invention is capable of adapting to the voice
characteristics of a given speaker q using only a very short utterance of
calibration speech from the speaker. This is made possible by an initial
acoustic normalization and subsequent phone model adaptation. FIG. 1
illustrates how normalization of speaker acoustic characteristics is
performed. Normalization can also be performed to handle mismatched data
acquisition and recording conditions during training and test. FIG. 2 then
shows how phone model adaptation is performed. In FIGS. 1 and 2 a
distinction is made between the Training phase and the Test phase.
Training refers to the procedure by which the speech system is "trained"
using a set of known speech data and calibration speech from a plurality
of speakers. Test speech refers to the speech produced by individual
speaker q when the system is actually used in a speech recognition
application. In FIGS. 1 and 2 the Training and Test phases appear in
separate boxes, designated Training phase 10 and Test phase 12. In FIGS. 1
and 2 and in the mathematical equations appearing later in this
description, calibration speech spectra has been designated X.sub.c
whereas test speech spectra has been designated X.sub.t. These spectra are
in the logarithmic domain. FIGS. 1 and 2 are intended to give an overview
of the system. Complete implementation details are discussed later in
conjunction with the mathematical equations.
Referring to FIG. 1, the system is first calibrated by supplying
calibration speech from a plurality of speakers. This is designated at 14
where the speech from speaker 1 . . . speaker Q are input. The capital
letter Q on the left-hand side of the figure denotes the total number of
training speakers. The lower case q on the right-hand side of the figure
denotes a specific test speaker. Speaker q appears at 16 in FIG. 1.
The calibration speech spectra X.sub.c, representing specific Calibration
sentences, are supplied to a speaker-independent phone model estimation
process 18 which produces a set of speaker-independent phone models M1,
illustrated in oval 20. M1 has a set of unimodal Gaussian densities, in
which there is a single Gaussian density for each state of each phone
unit. M1 is then supplied to a process which estimates a spectral bias for
a speaker as a function of his or her calibration speech. This is
illustrated in h-estimator block 22 and also h-estimator block 24. Both
h-estimator blocks are constructed essentially in the same way. They
produce the estimated spectral bias parameter vector h, a factor which is
subtracted from the speech spectra in the logarithmic domain to produce
normalized spectra. The equations for obtaining this estimated spectral
bias are set forth as implementation details below.
On the Training side (box 10) the estimated spectral bias h for each of the
training speakers is subtracted from the speaker's training speech spectra
X.sub.t in the logarithmic domain to produce a set of normalized spectra
which is then modeled using a Hidden Markov model (HMM) at process 26.
This results in production of normalized speaker-independent HMM phone
models M2 and M3, illustrated at 28. Model set M2 is a set of Gaussian
mixture density phone models; M3 is a set of unimodel Gaussian density
phone models. The normalized phone models M2 and M3 are then supplied to
the decoder 30 for use in decoding the test speech of speaker q. The
training speech spectra X.sub.t is obtained using different sentences than
those used to obtain the calibration spectra X.sub.c.
Before speaker q uses the system to recognize sentences, a short utterance
of calibration speech X.sub.c is first supplied to h-estimator 24 to
produce an estimated spectral bias h.sup.(q) for that speaker. This
h.sup.(q) is subtracted from the test speech spectra X.sub.t when the
speaker q enters further speech after calibration. As before, the
estimated spectral bias parameter is subtracted in the logarithmic domain
resulting in acoustically normalized spectra. This normalized spectra then
fed to decoder 30 which constructs decoded word strings using a dictionary
and grammar 32, and the HMM phone models 28.
To further improve performance, the system may also perform phone model
adaptation on M2 and M3. The technique for doing this is illustrated in
FIG. 2. In FIG. 2 the adapted mixture density phone models M2 and M3 are
shown in oval 34. As in FIG. 1, FIG. 2 also divides its functionality into
a training phase 10 and a test phase 12. Test phase 12 is essentially the
same as described for FIG. 1, with the exception that the decoder 30 is
supplied with adapted mixture density phone models M2 and M3. Since the
processes of phase 12 of FIG. 2 are essentially the same as those of phase
12 of FIG. 1, they will not be further described here. The focus for
review of FIG. 2 will be on phase 10 where the phone model adaptation
process takes place.
The calibration spectra X.sub.c for the plurality of training speakers
(Speaker 1, . . . Speaker Q) are normalized by subtracting the estimated
spectral bias parameters in the logarithmic domain as indicated at 36.
This is accomplished, for example, using the h parameters produced by
h-estimator 22 of FIG. 1.
Next, a Viterbi segmentation process is performed on the data at 38, thus
segmenting the data into phone units of defined boundaries. The Viterbi
segmentation process is performed using normalized mixture density phone
models M2 and M3. These models M2 and M3, illustrated by oval 40 in FIG.
2, may be the same models as those depicted by oval 28 in FIG. 1, that is
produced after acoustic normalization.
Once Viterbi segmentation has been performed, the individual phone units
are used to determine context modulation vectors (CMV) by a maximum
likelihood estimation process depicted generally at 42. The resultant
context modulation vectors are depicted by oval 44. These context
modulation vectors are derived from the calibration speech X.sub.c and the
training speech X.sub.t of the training speakers (Speaker 1, . . . Speaker
Q).
The calibration speech X.sub.c for the test speaker, Speaker q, is
normalized by subtracting the estimated spectral bias at 46. Thereafter
Viterbi segmentation is performed at 48 to segment the normalized spectra
of speaker q into allophone subsegments. The spectra of the allophone
subsegments are then context modulated at 50, using the previously derived
context modulation vectors 44. These context modulated spectra are then
used in a Bayesian estimation process 52. The Bayesian estimation operates
on the normalized mixture density phone models M2 and M3, shown in oval
40, to produce the adapted mixture density phone models M2 and M3, shown
in oval 34. The adapted mixture density phone models are thus tuned to the
individual speaker q without requiring speaker q to speak any further
adaptation speech.
Having described a system for speaker adaptation using very short
calibration speech, we turn now to a system which is self-learning.
Referring to FIG. 3, the speech spectra of speaker q is acoustically
normalized by subtracting out an estimated spectral bias h.sup.(q). In
this case, the input speech spectra X.sub.t represents actual test speech,
that is, speech to be decoded by the recOgnizer, as opposed to calibration
speech. As previously noted test speech is designated X.sub.t and
calibration speech is designated X.sub.c.
The actual acoustic normalization is performed by first generating the
estimated spectral bias h.sup.(q). This is done by h-estimator block 100,
which calculates the estimated spectral bias h.sup.(q) from X.sub.t and
the Gaussian density phone model set M3. This calculation is further
described in equation (3) below. Because the speech spectra X.sub.t is in
the logarithmic spectral domain, the estimated spectral bias is removed
from the speech spectra by subtraction. This is illustrated at 102 and 104
in FIG. 3. The Gaussian density phone models M3 used by h-estimator 100
are depicted at 110.
The normalized spectra resulting from the subtraction operation 102 are
supplied to the decoder 106 which produces the decoded word string, namely
a text string representing the recognized speech using dictionary and
grammar 108 and the adapted Guassian mixture density phone models M2 and
M3, 114.
As further explained below, the self-learning ability involves performing
phone model adaptation after each sentence is decoded. In FIG. 3, a dotted
line 112 has been drawn to visually separate the procedures performed
after sentence decoding (below) from the decoding procedures themselves
(above). Note that the decoder 106 uses the adapted mixture density phone
models M2 and M3, shown in oval 114. As will be seen, these models M2 and
M3 are adapted, in self-learning fashion, after each sentence is decoded.
Thus the adapted mixture density phone models M2 and M3 are depicted below
dotted line 112.
The phone model adaptation process begins with Viterbi segmentation 116.
The decoded word strings from decoder 106 and the adapted mixture density
phone models 114 are supplied to the Viterbi segmentation block. The
Viterbi segmentation process is performed on the acoustic normalized
spectra resulting from the subtraction process 104. In layman's terms,
Viterbi segmentation segments a sequence of speech spectra into segments
of phone units which are the physical units of actual speech that
correspond to phonemes. (Phonemes are the smallest speech units from a
linguistic or phonemic point of view. Phonemes are combined to form
syllables, syllables to form words and words to form sentences.)
The Viterbi segmentation process 116 produces adaptation data for each
state of each phone unit. This is symbolized in oval 118. The output of
decoder 106 is supplied to the Viterbi segmentation process because, in
this case, the Viterbi segmentation process is not dealing with known
strings of calibration data.
In step 120 an interpolation parameter .lambda. is estimated for each
mixture component Gaussian density from the adaptation data. The
interpolation parameter is used at step 122 to determine whether there is
enough data to adapt the corresponding component Gaussian density in a
mixture density for a state of a phone unit in the model sets M2 and M3
illustrated in oval 114. If the data is sufficient, the mixture component
is categorized as belonging to Set A and the data is used to adapt the
parameters of the mixture component by Bayesian estimation. This is shown
at step 124. This adaptation process corresponds to equations (9) and (10)
in the mathematical description below.
In some cases, particularly when speaker q first starts using the system,
the amount of speech data may be insufficient for adapting certain mixture
component Gaussian densities of models 114. This condition is detected at
step 122 where the interpolation parameter .lambda. is below the
threshold, and the corresponding mixture component Gaussian density is
categorized as belonging to Set B. In this case, context modulation is
performed at step 126 on the data of the state of the phone unit for
adapting the parameters of the mixture component density, where the
parameters of context modulation have been estimated between sets A and B.
Context modulation supplements the adaptation data by producing
context-modulated adaptation data for mixture components in set B. This is
illustrated in oval 128. The context-modulated data are then used in step
124 to adapt the parameters of component Gaussian densities in set B
through Bayesian estimation.
As a result of Bayesian estimation the adapted mixture density phone models
M2 and M3 are developed. Note that these models are fed back to the
Viterbi segmentation process 116, so that future Viterbi segmentation can
take advantage of the adapted mixture density phone models thus produced.
In layman's terms, the adapted mixture density phone models are modified
so that they better match the speech characteristics of the individual
speaker q. Context modulation is used in step 126 to "fill in the gaps" in
the data set, should the data set be insufficient for direct adaptation in
step 124. Context modulation takes into account that the component
Gaussian densities in an acoustically normalized mixture density of a
state of a given phone unit mainly models the context dependencies of
allophones, where the pronunciation of a phone unit is pronounced
differently, depending on the neighboring phone units. Through context
modulation, the adaptation data of a phone unit from different context of
neighboring phones can be used to adapt the parameters of a specific
component Gaussian density which models the allophone spectra of certain
similar contexts.
Implementation Details
The speaker-induced spectral variation sources are decomposed into two
categories, one acoustic and the other phone specific. The acoustic source
is attributed to speakers' physical individualities which cause spectral
variations independent of phone units; the phone-specific source is
attributed to speakers' idiosyncrasies which cause phone-dependent
spectral variations; each variation source is modeled by a linear
transformation system. Spectral biases from the acoustic variation source
are estimated using unsupervised maximum likelihood estimation proposed by
Cox et al. for speaker adaptation in isolated vowel and word recognition
(Cox et al. 1989).
Acoustic normalization is performed by removing such spectral variations
from the speech spectra of individual speakers. The phone-specific
spectral variations are handled by phone model adaptation, where the
parameters of speaker-independent Gaussian mixture density phone models
are adapted via Bayesian estimation. The derivations for the unsupervised
maximum likelihood estimation of spectral bias and Bayesian estimation of
Gaussian mixture density parameters are both cast in the mathematical
framework of the EM algorithm (Dempster et al. 1977).
The baseline speaker-independent continuous speech recognition system is
based on hidden Markov models of phone units: each phone model has three
tied-states, and each state is modeled by a Gaussian mixture density. For
enhancing adaptation effect when the adaptation data is limited, context
dependency of allophones are modeled by context modulation between pairs
of mixture components within each Gaussian mixture density (Zhao, 1993b).
The proposed adaptation technique is shown effective in improving the
recognition accuracy of the baseline speaker independent continuous speech
recognition system which was trained from the TIMIT database (Lamel et al.
1986). The evaluation experiments are performed on a subset of the TIMIT
database and on speech data collected in our laboratory.
This description of implementation details is presented in six sections,
including a general description of the self-learning adaptation method, a
detailed description of the statistical methods for acoustic normalization
and phone model adaptation, experimental results and a summary.
Self-Learning Adaptation
The acoustic and phone-specific variation sources are modeled as two
cascaded linear transformations on the spectra of a standard speaker.
Considering a speaker q, let H.sup.(q) and L.sub.i.sup.(q) be the linear
transformations representing the acoustic and the ith phone-specific
sources, respectively, for i=1, 2, . . . , M. Let X.sub.i,t.sup.(q) and
X.sub.i,t.sup.(o) be a pair of spectra of phone unit i at time t from the
speaker q and the standard speaker o. The composite mapping from the two
linear transformations is then
X.sub.i,t.sup.(q) =H.sup.(q) L.sub.i.sup.(q) X.sub.i,t.sup.(o),
.A-inverted.i,t (1)
In the logarithmic spectral domain, using lower case variables, the
multiplicative mappings become additive biases, i.e.
x.sub.i,t.sup.(q) =h.sup.(q) +l.sub.i.sup.(q) +x.sub.i,t.sup.(o),
.A-inverted.i,t. (2)
In the present adaptation technique, the acoustic bias h.sup.(q) is
explicitly estimated, whereas the phone-specific bias l.sub.i.sup.(q) 's
are handled implicitly via the adaptation of phone model parameters. The
subtraction of h.sup.(q) from x.sub.i,t.sup.(q) is called acoustic
normalization, yielding the acoustically normalized spectra
x.sub.i,t.sup.(q) =x.sub.i,t.sup.(q) -h.sup.(q), .A-inverted.i,t. In the
case that an unmatched recording condition introduces a linear
transformation distortion D, this distortion in the logarithmic domain, d,
is absorbed by the bias vector h.sup.(q) =h.sup.(q) +d.
The baseline recognition system uses both instantaneous and dynamic
spectral features (Zhao 1993a). As can be observed from Equation (2),
dynamic spectral features are not affected by the spectral bias h.sup.(q)
due to the inherent spectral difference computation in their extraction.
The dynamic features, on the other hand, could be affected by the
phone-specific bias l.sub.i.sup.(q) 's at the boundaries of phone units.
Presently, only the instantaneous spectral features and their models are
considered for adaptation. This separate treatment of the instantaneous
and dynamic spectral models is facilitated by the block-diagonal
covariance structure defined for the Gaussian mixture density phone models
(Zhao, 1993a), one block for instantaneous features, and the other for
dynamic features. For more details, see the work by Zhao (Zhao, 1993a).
Assuming a speaker speaks one sentence at a time during the course of using
the recognizer, then for each input sentence, speaker adaptation is
implemented in two sequential steps. Referring to FIG. 3, the first step
is carried out before recognizing the sentence, where the spectral bias of
the speaker's acoustic characteristics is estimated from the spectra of
the current sentence and the speech spectra of the same sentence are
subsequently normalized. The second step is carried out after recognizing
the sentence, where the parameters of phone models are adapted using
Bayesian estimation. In the second step, the adaptation data for each
phone unit is prepared via Viterbi segmentation of the spectral sequence
of the recognized sentence, and the segmentation is supervised by the
recognized word string. The adapted phone models are then used to
recognize the next sentence utterance from the speaker.
Speaker Normalization
Assuming the phone model parameters of the standard speaker are estimated
from the speech data of speakers in the training set, the phone models are
unimodal Gaussian densities N(.mu..sub.i, C.sub.i), i=1, 2, . . . M. For a
speaker q, a sentence utterance consists of the spectral sequence
x.sup.(q) ={x.sub.t.sup.(q), t=1, . . . , T.sup.(q) }. In the context of
the EM algorithm the spectral vector x.sub.t.sup.(q) 's are called the
observable data, and their phone label i.sub.t 's are the unobservable
data. The complete data set consists of both the observable and
unobservable data (x.sub.1.sup.(q),x.sub.x.sup.(q), . . .
,x.sub.T.sup.(q),i.sub.1,i.sub.2, . . . ,i.sub.T). Using upper case
variables X.sup.(q) and I to denote the random variables for the
observable and unobservable data, respectively, the estimation of
h.sup.(q) is made through the iterative maximization of the expected value
of the conditional log likelihood of the complete data. Assuming an
initial value h.sub.0.sup.(q), the iterative estimation formula is then:
##EQU1##
where
##EQU2##
If the posterior probability P(i.sub.t
=i.vertline.x.sub.t.sup.(q),h.sub.n.sup.(q))'s are each approximated by
the decision operation
##EQU3##
and the covariance matrices of the Gaussian densities are taken as the
unit matrix, the estimated spectral bias h.sup.(q) becomes simply the
average spectral deviations between the sentence spectra and the
corresponding mean vectors of the labeled phone models, i.e.
##EQU4##
In this study, Equation (5) is used for estimation of spectral biases and
the initial condition is set as h.sub.0.sup.(q) =0.
It is advantageous to perform acoustic normalization on both training and
test data, where removing spectral biases from training spectra makes the
phone models more efficient in capturing statistical variations of
allophones. To construct phone models characterizing a standard speaker,
the training data are first used to estimate a set of unimodal Gaussian
density phone models. Using these models as reference, a spectral bias
vector is estimated for each sentence utterance from each speaker, and the
estimated spectral bias is subsequently removed from the sentence spectra.
Gaussian mixture density phone models are trained from the acoustically
normalized training data.
An alternative method of estimating a spectral bias for each speaker is to
iteratively update the estimate as more data from the speaker become
available. Although in general using more data produces more reliable
estimates, it has been observed in the experiments that the iterative
estimation scheme became sluggish in keeping up with random changes in a
speaker's voice characteristic, and in this instance it led to inferior
recognition results.
Phone Model Adaptation
For phone model adaptation, the acoustically normalized speech spectra are
segmented into states of phone units according to the recognized word
sequence. For each state of phone unit, the parameters of the Gaussian
mixture density are adapted via Bayesian estimation (Lee, 1990; Lee,
1993). In order to enhance the effect of adaptation when the amount of
adaptation data is limited, context modulation (Zhao, 1993b) is employed
for adapting the Gaussian component densities which have insufficient
adaptation data.
Bayesian Estimation of Gaussian Mixture Density Parameters
Considering a size-M Guassian mixture density, the mean vectors and
covariance matrices of the component densities are denoted .theta..sub.i
=(.mu., C.sub.i), .A-inverted.i. The mixture weights are .alpha..sub.i
.gtoreq.0, .A-inverted.i and
##EQU5##
.alpha..sub.i =1. Denoting .THETA.={.theta..sub.1,.theta..sub.2, . . .
,.theta..sub.M } and A={.alpha..sub.1,.alpha..sub.2, . . . ,.alpha..sub.M
}, the likelihood or a feature vector x.sub.t (the notation
x.sub.t.sup.(q) is dropped for simplicity of derivation and the feature
dimension is assumed as L) is computed as:
##EQU6##
with f(x.sub.t .vertline..theta..sub.i).about.N(.mu..sub.i, C.sub.i),
.A-inverted.i. The prior distributions of .theta..sub.i, i=1, 2, . . . , M
are assumed to be independent, and the mixture weights .alpha..sub.i 's
are taken as constant. The prior mean and covariance .mu..sub.o.sup.(i)
and C.sub.o.sup.(i) are the speaker-independent estimates from a training
sample size N.sub.i, .A-inverted.i. Defining the precision matrix r.sub.i
=C.sub.i.sup.-1, the joint distribution of mean and precision matrix
(.mu..sub.i,r.sub.i) is taken as a conjugate prior distribution (Degroot,
1970). Specifically, the conditional distribution of .mu..sub.i given
r.sub.i is Gaussian with mean .mu..sub.o.sup.(i) and precision matrix
.nu.r.sub.i, .nu. being a scaling constant, and the marginal distribution
of r.sub.i is a Wishart distribution with .rho. degree of freedom and a
scaling matrix .tau..sub.i =N.sub.i C.sub.o.sup.(i), i.e.
##EQU7##
where .varies. signifies "proportional to." Since the prior mean and
covariance are estimated from N.sub.i data samples, the precision scale
.nu. and the degree of freedom .rho. are both assigned the value of
training sample size N.sub.i (Degroot, 1970).
There is a set of observable feature data x={x.sub.1,x.sub.2, . . .
,x.sub.T } and a set of unobservable data {i.sub.1, i.sub.2, . . . ,
i.sub.T }, i.sub.t being the mixture index for x.sub.t, .A-inverted.t. The
estimation of .THETA. is, therefore, again formulated in the framework of
the EM algorithm. The difference to the EM formulation previously
referenced is that the conditional expectation is taken with respect to
the posterior likelihood of the complete data set (X,I), i.e.
##EQU8##
The initial .THETA..sup.(0) are speaker-independent model parameters. The
maximization of the expectation is decoupled for individual .theta..sub.i
's and leads to the posterior estimate of mean
.mu..sub.i.sup.(n+1) =(1-.lambda..sub.i.sup.(n)).mu..sub.o.sup.(i)
+.lambda..sub.i.sup.(n) .mu..sub.x.sup.(i)(n) (9)
and covariance (with approximation .rho.-L=N.sub.i)
C.sub.i.sup.(n+1) =(1-.lambda..sub.i.sup.(n))C.sub.o.sup.(i)
+.lambda..sub.i.sup.(n) C.sub.x.sup.(i)(n) +.lambda..sub.i.sup.(n)
(1-.lambda..sub.i.sup.(n))(.mu..sub.x.sup.(i)(n)
-.mu..sub.o.sup.(i))(.mu..sub.x.sup.(i)(n) -.mu..sub.o.sup.(i))(10)
where .lambda..sub.i.sup.(n) is the interpolation parameter,
.mu..sub.x.sup.(i)(n) and C.sub.x.sup.(i)(n) are sample mean and
covariance of the adaptation data. Denoting the posterior probability
P(i.sub.t =i.vertline.x.sub.t,.theta..sub.i.sup.(n)) by
.gamma..sub.t,i.sup.(n), i.e.,
##EQU9##
parameters .lambda..sub.i.sup.(n), .mu..sub.x.sup.(i)(n) and
C.sub.x.sup.(i)(n) are computed as
##EQU10##
Enhancement of Adaptation Effect
When a user initially starts using a recognizer, the amount of feedback
adaptation data is limited and most mixture components have only a small
amount or no adaptation data. In this scenario, the Gaussian component
densities lacking adaptation data are adapted using context-modulated
data. In the logarithmic domain, the relation between spectra of two
allophones a and b is, x.sub.a,t =x.sub.b,t +.xi. with .xi. a context
modulation vector (CMV). When each Gaussian component density in a mixture
is conceptualized as modeling spectra of a generalized allophone context,
a CMV can be estimated between each pair of mixture components using the
respective training data. Denoting the mapping of training spectra in the
ith mixture component, X.sub.t, .A-inverted.t to the jth mixture component
by c.sub.i,j (x.sub.t)=x.sub.t +.xi..sub.i,j, the CMV .xi..sub.i,j is
estimated by maximizing the joint likelihood of c.sub.i,j (x.sub.t),
.A-inverted.t, under the Gaussian density model .theta..sub.j
=(.mu..sub.j, C.sub.j), i.e.
##EQU11##
It is straightforward to derive that the estimate is .xi..sub.i,j
=.mu..sub.j -.mu..sub.i, which is the difference between the mean vectors
of the jth and ith component Gaussian densities.
Based on the CMVs, the adaptation data clustered to individual Gaussian
component densities in a mixture can be mapped to a specific component
density for adapting its parameters. There are two potential problems with
this method. First, the component densities in a mixture are spaced apart
by different distances. The linear transformation model of context
modulation could be inappropriate for component density pairs which are
separated by large distances. Second, after a speaker uses a recognizer
for an extended period of time, the amount of adaptation data in a state
of a phone unit could become large, and using all these data for adapting
a specific Gaussian component density might lead to over-adaptation. In
the following, two cases are considered. In the first case, the
context-modulated adaptation data are straightforwardly used to adapt the
parameters of a specific Gaussian component density. In the second case,
constraints on adaptation are introduced by applying weights and threshold
to the first case to handle the above-mentioned two potential problems.
Unconstrained Adaptation
The interpolation parameter .lambda..sub.i defined in Equation (12)
measures the amount of adaptation data for the ith mixture component,
.A-inverted.i. Taking a threshold .eta.<1, a decision is made that if
.lambda..sub.i .gtoreq..eta., the parameters of the ith mixture component
are directly adapted using Equations (9) and (10), otherwise the
parameters are adapted using the context-modulated data. Assuming the jth
mixture component has. insufficient adaptation data, i.e., .lambda..sub.j
<.eta., the model parameters .theta..sub.j =(.mu..sub.j, C.sub.j) can also
be estimated from an EM formulation. Denote the mapping of adaptation data
x={x.sub.1, x.sub.2, . . . , x.sub.T } from the individual Gaussian
component densities to the jth mixture component by C.sub.j (x), then
##EQU12##
Further define the weighting coefficients (note the use of the constant
##EQU13##
The posterior estimate of mean .mu..sub.j.sup.(n+1) is derived as
##EQU14##
As seen from Equation (16), the sum of the weighting coefficients
##EQU15##
serves as an interpolation parameter, and the estimate
.mu..sub.j.sup.(n+1) is the shift of the original mean .mu..sub.o.sup.(j)
by the vector
##EQU16##
Making use of the vector .delta..sup.(j)(n), the posterior estimate of
covariance matrix is derived as
##EQU17##
Constrained Adaptation
For taking into account the distances between Gaussian component density
pairs, the EM formulation of Equation (14) is modified to weigh the
likelihood of each feature vector by a factor less than or equal to one,
i.e.
##EQU18##
where the factor .nu..sub.j,i is an inverse function of the Euclidean
distance d.sub.i,j =.parallel..mu..sub.i -.mu..sub.j .parallel. and is
defined as
##EQU19##
In equation (19), the numerator in the case j.noteq.i is for normalizing
the largest value of .nu..sub.j,i to one, i.e.
##EQU20##
The purpose of the normalization is for achieving a larger adaptation
effect than without the normalization. It is easy to derive that the
estimation formulas for .mu..sub.j.sup.(n+1) and C.sub.j.sup.(n+1) remain
in the same form as in Equations (16) and (17), but the coefficient
.beta..sub.i.sup.(j)(n) 's are changed to
##EQU21##
To avoid over-adaptation the value of
##EQU22##
is checked against a threshold .epsilon.<1. If
##EQU23##
the weighting factor is modified to .xi..sub.i .nu..sub.j,i where
.xi..sub.i =1 for i=j, otherwise .xi..sub.i =.xi.<1, .A-inverted.i. The
value .xi. is determined by setting
##EQU24##
which leads to
##EQU25##
EXPERIMENTS
Experiments were performed on the TIMIT database and speech data collected
in our laboratory (STL) in the manner described below. The baseline
speaker-independent HMM phone models were trained from 325 speakers and
717 sentences from the TIMIT database. The TIMIT speech data were
down-sampled from 16 KHz to 10.67 KHz. The cepstrum coefficients of
Perceptually-Based Linear Prediction (PLP) analysis (8th order) (Hermansky
et al. 1985) and log energy were used as instantaneous features and their
1st order 50 msec temporal regression coefficients as dynamic features.
The task vocabulary size was 853, and the grammar perplexities were 104
and 105 for the TIMIT and STL test sets, respectively. The TIMIT test set
| | |