|
Description  |
|
|
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to information processing apparatuses that
integrate a plurality of feature parameters, and in particular, to an
information processing apparatus in which, when speech recognition based
on speech and on an image of lips observed when the speech was made is
performed, the information processing apparatus increases speech
recognition performance by integrating audio and image feature parameters
so that the parameters can be processed in optimal form.
2. Description of the Related Art
By way of example, speech is recognized by extracting feature parameters
from the speech, and comparing the feature parameters with normal
parameters (normal patterns) used as a reference.
When speech recognition based on only speech is performed, there is a
certain limit to increasing the recognition factor. Accordingly, it is
possible that the speech recognition be performed based not only on the
speech but also on a captured image of lips of the speaker.
In this case, it is also possible to integrate feature parameters extracted
from the speech and feature parameters extracted from the lip image to
form so-called "integrated parameters" and to use the integrated
parameters to perform speech recognition. The assignee of the present
patent application has proposed Japanese Patent Application No. 10-288038
(which was not open to the public when the present patent application was
filed) as a type of speech recognition that generates integrated
parameters by integrating feature parameters extracted from speech and
feature parameters extracted from a lip image and that uses the integrated
parameters to perform speech recognition.
With reference to FIGS. 1 to 16, Japanese Patent Application No. 10-288038
is described below.
FIG. 1 shows an example of a speech recognition apparatus that performs
speech recognition based on integrated parameters obtained by integrating
feature parameters based on a plurality of input data.
In addition to speed data (as a speech from a user) to be recognized, image
data obtained by capturing an image of the user's lips when the user
spoke, noise data on noise in an environment where the user spoke, and
data useful in recognizing the user's speech (speech), such as a signal in
accordance with the operation of an input unit for inputting a place where
the user speaks in the case where the speech recognition apparatus is
provided with the input unit, are sequentially input in time series to the
speech recognition apparatus. The speech recognition apparatus takes these
types of data into consideration, as required, when performing speech
recognition.
Specifically, the speech data, the lip-image data, the noise data, and
other data, which are in digital form, are input to a parameter unit 1.
The parameter unit 1 includes signal processors 11.sub.1 to 11.sub.N
(where N represents the number of data signals input to the parameter unit
1). The speech data, the lip-image data, the noise data, and other data
are processed by the signal processors 11.sub.1 to 11.sub.N corresponding
thereto, whereby extraction of feature parameters representing each type
of data, etc., is performed. The feature parameters extracted by the
parameter unit 1 are supplied to an integrated parameter generating unit
2.
In the parameter unit 1 shown in FIG. 1, the signal processor (lip-signal
processor) 11.sub.1 processes the lip-image data, the signal processors
(audio-signal processors) 11.sub.2 to 11.sub.N-1 process the speech data,
and the signal processor (audio-signal processor) 11.sub.N processes the
noise data, etc. The feature parameters of the speech (sound) data such as
the speech data and the noise data include, for example, linear prediction
coefficients, cepstrum coefficients, power, line spectrum pairs, and zero
cross. The feature parameters of the lip-image data include, for example,
parameters (e.g., the longer diameter and shorter diameter of an ellipse)
defining an ellipse approximating the shape of the lips.
The integrated parameter generating unit 2 includes an intermedia
normalizer 21 and an integrated parameter generator 22, and generates
integrated parameters by integrating the feature parameters of the signals
from the parameter unit 1.
In other words, the intermedia normalizer 21 normalizes the feature
parameters of the signals from the parameter unit 1 so that they can
processed having the same weight, and outputs the normalized parameters to
the integrated parameter generator 22. The integrated parameter generator
22 integrates (combines) the normalized feature parameters of the signals
from the intermedia normalizer 21, thereby generating integrated
parameters, and outputs the integrated parameters to the matching unit 3.
The matching unit 3 compares the integrated feature parameters and normal
patterns (a model to be recognized), and outputs the matching results to a
determining unit 4. In other words, the matching unit 3 includes a
distance-transition matching unit 31 and a spatial distribution matching
unit 32. The distance-transition matching unit 31 uses a
distance-transition model (described below) to perform the matching of the
integrated feature parameters by using a distance-transition method
(described below), and outputs the matching results to the determining
unit 4. The spatial distribution matching unit 32 performs the matching of
the integrated feature parameters by using a spatial distribution method
(described below), and outputs the matching results to the determining
unit 4.
The determining unit 4 recognizes the user's speech (sound), based on
outputs from the matching unit 3, i.e., the matching results from the
distance-transition matching unit 31 and the spatial distribution matching
unit 32, and outputs the result of recognition, e.g., a word. Accordingly,
in the determining unit 4, what is processed by speech recognition is a
word. In addition, for example, a phoneme, etc., can be processed by
speech recognition.
With reference to the flowchart shown in FIG. 2, processing by the speech
recognition apparatus (shown in FIG. 1) is described below.
When the speech data, the lip-image data, the noise data, etc., are input
to the speech recognition apparatus, they are supplied to the parameter
unit 1.
In step S1, the parameter unit 1 extracts feature parameters from the
supplied data, and outputs them to the integrated parameter generating
unit 2.
In step S2, the intermedia normalizer 21 (in the integrated parameter
generating unit 2) normalizes the feature parameters from the parameter
unit 1, and outputs the normalized feature parameters to the integrated
parameter generator 22.
In step S3, the integrated parameter generator 22 generates integrated
feature parameters by integrating the normalized feature parameters from
the intermedia normalizer 21. The integrated feature parameters are
supplied to the distance-transition matching unit 31 and the spatial
distribution matching unit 32 in the matching unit 3.
In step S4, the distance-transition matching unit 31 performs the matching
of the integrated feature parameters by using the distance-transition
method, and the spatial distribution matching unit 32 performs the
matching of the integrated feature parameters by using the spatial
distribution method. Both matching results are supplied to the determining
unit 4.
In step S5, based on the matching results from the matching unit 3, the
determining unit 4 recognizes the speech data (the user's speech). After
outputting the result of (speech) recognition, the determining unit 4
terminates its process.
As described above, the intermedia normalizer 21 (shown in FIG. 1)
normalizes the feature parameters of the signals from the parameter unit 1
so that they can be processed having the same weight. The normalization is
performed by multiplying each feature parameter by a normalization
coefficient. This normalization coefficient is found by performing
learning (normalization-coefficient learning process). FIG. 3 shows an
example of a learning apparatus for performing the learning.
For brevity of description, a type of learning is described below that
finds normalization coefficients for setting the feature parameters of the
speech and the image as two different media (e.g., feature parameters of
speech and feature parameters of lips observed when the speech was made)
to have the same weight.
In FIG. 3, image feature parameter P.sub.i,j and speech feature parameter
V.sub.i,j, which are code-vector learning parameters (codebook-creating
data) for creating a codebook for use in vector quantization, are supplied
to a tentative normalizer 51. The tentative normalizer 51 tentatively
normalizes image feature parameter P.sub.i,j and speech feature parameter
V.sub.i,j by using normalization coefficients from a normalization
coefficient controller 55, and supplies the normalized feature parameters
to a codebook creator 52. In other words, in order to use the weight of
image feature parameter P.sub.i,j as a reference and to set the weight of
speech feature parameter V.sub.i,j to equal the reference, speech feature
parameter V.sub.i,j is multiplied by normalization coefficient a from the
normalization coefficient controller 55. Accordingly, it can be considered
that image feature parameter P.sub.i,j is multiplied by 1 as normalization
coefficient .alpha..
In FIG. 3, suffix "i" indicating the row of feature parameter P.sub.i,j or
V.sub.i,j represents a time (frame) at which the feature parameter
P.sub.i,j or V.sub.i,j was extracted, and suffix "j" indicating the column
of feature parameter P.sub.i,j or V.sub.i,j represents the order
(dimensions) of the feature parameter P.sub.i,j or V.sub.i,j. Therefore,
(P.sub.i,j, P.sub.i,2, . . . , P.sub.i,L, V.sub.i,1, V.sub.i,2, . . . ,
V.sub.i,M) are feature parameters (feature vectors) at time i. Expression
P.sup.(k).sub.i,j formed by adding a suffix in parentheses to image
feature parameter P.sub.i,j represents a feature parameter generated from
different learning data if "k" differs. This also applies to the suffix
(k) of expression V.sup.(k).sub.i.
The codebook creator 52 creates a codebook for use in vector quantization
by a vector quantizer 54, using code-vector learning parameters P.sub.i,j
and V.sub.i,j, and supplies it to the vector quantizer 54.
In the codebook creator 52, the codebook is created in accordance with,
e.g., the LBG (Linde, Buzo, Gray) algorithm. However, another type of
algorithm other than the LBG algorithm may be employed.
The LBG algorithm is so-called "batch learning algorithm", and locally
converges code vectors (representative vectors) constituting the codebook
in optimal positions by repeatedly performing Voronois division that
optimally divides a feature parameter space in accordance with the
distance between a feature parameter as a learning sample (learning data)
and each code vector (a proper initial value is first given), and
repeatedly updating the code vectors to the centroids of partial regions
of a feature parameter space which are obtained by the Voronois division.
Here, when a set of learning samples is represented by x.sub.j (j=0, 1, . .
. , J-1), and a set of code vectors is represented by Y={y.sub.0, y.sub.1,
. . . , y.sub.N-1 }, learning-sample set x.sub.j is divided into N subsets
S.sub.i (i=0, 1, . . . , N-1) by code-vector set Y in the Voronois
division. In other words, when the distance between learning-sample set
x.sub.j and code vector y.sub.i is represented by d (x.sub.j, y.sub.i),
and the following expression holds with respect to all of t (t=0, 1, . . .
, N-1) that does not equal i,
d(x.sub.j,y.sub.i)<d(x.sub.j,y.sub.t) (1)
it is determined that learning-sample x.sub.j is (x.sub.j, S.sub.i)
belonging to subset x.sub.j.
In addition, when centroids C (v.sub.0, v.sub.1, . . . , v.sub.M-1) with
respect to vectors v.sub.0, v.sub.1, . . . , v.sub.M-1 are defined by the
following expression:
##EQU1##
code vector y.sub.i is updated in accordance with the following expression
y.sub.i =C({S.sub.i }) (3)
In the expression (2), the right side "argmin { }" means vector v that
minimizes the value in { }. A so-called "clustering technique" using
expression (3) is called "k-means clustering". The details of the LGB
algorithm are described in, for example, "Speech and Image Engineering"
written by Kazuo Nakata and Satoshi Minami, published by Shokodo in 1987,
pp. 29-31.
In the learning apparatus shown in FIG. 3, suffix i,j that indicates the
row of element S.sub.i,j and T.sub.i,j in the codebook output by the
codebook creator 52 represents the j-th element of the code vector
corresponding to code #i. Thus, expression (S.sub.i,1, S.sub.i,2, . . . ,
S.sub.i,L, T.sub.i,1, T.sub.i,2, . . . , T.sub.i,M) represents a code
vector corresponding to code #i. Element S.sub.i,j of the code vector
corresponds to the image, and element T.sub.i,j corresponds to the speech.
A tentative normalizer 53 is supplied with image feature parameter
P.sub.i,j and speech feature parameter V.sub.i,j (in this example it is
assumed that both types of parameters are obtained from an image and
speech different from those for the code-vector learning parameters) as
normalization-coefficient learning parameters for learning normalization
coefficient .alpha.. Similarly to the tentative normalizer 51, the
tentative normalizer 53 tentatively normalizes image feature parameter
P.sub.i,j and speech feature parameter V.sub.i,j by using the
normalization coefficients from the normalization coefficient controller
55, and supplies the normalized parameters to the vector quantizer 54. In
other words, among image feature parameter P.sub.i,j and speech feature
parameter V.sub.i,j as normalization-coefficient learning parameters,
speech feature parameter V.sub.i,j is multiplied by normalization
coefficient a from the normalization coefficient controller 55 by the
tentative normalizer 53, and the tentative normalizer 53 outputs the
product to the vector quantizer 54.
The tentative normalizer 53 is supplied with a plurality of sets of
normalization-coefficient learning parameters. The tentative normalizer 53
performs normalization with respect to each of the
normalization-coefficient learning parameters.
The vector quantizer 54 performs vector quantization on the normalized
normalization-coefficient learning parameters supplied from the tentative
normalizer 53, using the latest codebook supplied from the codebook
creator 52, and supplies quantization errors caused by the vector
quantization to the normalization coefficient controller 55.
In other words, the vector quantizer 54 calculates, for the image and
speech, a distance between each code vector of the codebook and each
normalized normalization-coefficient learning parameter, and supplies the
calculated shortest distance as a quantization error to the normalization
coefficient controller 55. Specifically, the distance between image
feature parameter P.sub.i,j among the normalized normalization-coefficient
learning parameters, and image-related element S.sub.i,j of the code
vector, is calculated, and the calculated shortest distance is supplied as
an image-related quantization error to the normalization coefficient
controller 55. At the same time, the distance between speech feature
parameter .alpha.V.sub.i,j among the normalized normalization-coefficient
learning parameters, and speech-related element T.sub.i,j of the code
vector, is calculated, and the calculated shortest distance is supplied as
a speech-related quantization error to the normalization coefficient
controller 55.
The normalization coefficient controller 55 accumulates, with respect to
all the normalization-coefficient learning parameters, image- and
speech-related quantization errors supplied from the vector quantizer 54,
and changes normalization coefficient .alpha. to be supplied to the vector
quantizers 51 and 53 so that both accumulated values are equal.
With respect to the flowchart shown in FIG. 4, a normalization-coefficient
learning process performed by the learning apparatus shown in FIG. 3 is
described below.
In the learning apparatus shown in FIG. 3, at first, code-vector learning
parameters are supplied to the vector quantizer 51, and
normalization-coefficient learning parameters are supplied to the vector
quantizer 53. In addition, initial normalization coefficient a is supplied
from the normalization coefficient controller 55 to the vector quantizers
51 and 53.
In step S21, the vector quantizer 51 tentatively normalizes the code-vector
learning parameters by multiplying speech feature parameter V.sub.i,j
among the code-vector learning parameters by normalization coefficient a
from the normalization coefficient controller 55, and supplies the
tentatively normalized parameters to the codebook creator 52.
When receiving the normalized code-vector learning parameters from the
vector quantizer 51, the codebook creator 52 uses the received parameters
in step S22 to create, based on the LBG algorithm, a codebook used when
the vector quantizer 54 performs vector quantization. The codebook creator
52 supplies the created codebook to the vector quantizer 54.
In step S23, the tentative normalizer 53 tentatively normalizes the
normalization-coefficient learning parameters by multiplying speech
feature parameter V.sub.i,j among the normalization-coefficient learning
parameters by normalization coefficient .alpha. from the normalization
coefficient controller 55, and supplies the tentatively normalized
parameters to the vector quantizer 54.
When receiving the latest codebook from the codebook creator 52, and
receiving the latest normalized normalization-coefficient learning
parameters from the tentative normalizer 53, the vector quantizer 54 uses
the codebook from the codebook creator 52 in step S24 to perform vector
quantization for the image and the speech. The vector quantizer 54
supplies the image- and speech-related quantization errors to the
normalization coefficient controller 55.
In other words, in step S24, the vector quantizer 54 calculates a distance
between image feature parameter P.sub.i,j (among the normalized
normalization-coefficient learning parameters) and image-related element
S.sub.i,j of the code vector, and supplies the calculated shortest
distance as an image-related quantization error to the normalization
coefficient controller 55. The vector quantizer 54 also calculates a
distance between speech feature parameter .alpha.V.sub.i,j (among the
normalized normalization-coefficient learning parameters) and
speech-related element T.sub.i,j of the code vector, and supplies the
calculated shortest distance as a speech-related quantization error to the
normalization coefficient controller 55.
As described, the vector quantizer 53 is supplied with the
normalization-coefficient learning parameters. Thus, the vector quantizer
54 is also supplied with a plurality of sets of normalized
normalization-coefficient learning parameters. The vector quantizer 54
successively finds, for each of the normalized normalization-coefficient
learning parameters, the above-described image- and speech-related
quantization errors, and supplies them to the normalization coefficient
controller 55.
In step S24, the normalization coefficient controller 55 accumulates, for
all the normalization-coefficient learning parameters, the image- and
speech-related quantization errors supplied from the vector quantizer 54,
thereby finding image-related quantization-error-accumulated value D.sub.P
and speech-related quantization-error-accumulated value D.sub.V. The
obtained image-related quantization-error-value D.sub.P and speech-related
quantization-error-accumulated value D.sub.V are supplied and stored in
the normalization coefficient controller 55.
In step S25, the normalization coefficient controller 55 determines whether
image-related quantization-error-accumulated value D.sub.P and
speech-related quantization-error-accumulated value D.sub.V have been
obtained with respect to all the values of normalization coefficient
.alpha.. In other words, in this example, accumulated values D.sub.P and
D.sub.V are found by, for example, initially setting normalization
coefficient a at 0.001, and changing (increasing (in this example))
normalization coefficient .alpha. by 0.001 between 0.001 and 2.000. In
step S25, the normalization coefficient controller 55 determines, for the
image and the speech, whether quantization-error-accumulated values
D.sub.P and D.sub.V have been found with respect to normalization
coefficient .alpha. having the range.
If the normalization coefficient controller 55 has determined in step S25
that quantization-error-accumulated values D.sub.P and D.sub.V have not
been found with all the values of normalization coefficient .alpha., the
normalization coefficient controller 55 changes normalization coefficient
.alpha. in step S26, as described above, and supplies it to the tentative
normalizers 51 and 53. After that, the normalization coefficient
controller 55 proceeds back to step S21, and uses the changed values of
normalization coefficient .alpha. to repeatedly perform the same
processing.
If the normalization coefficient controller 55 has determined in step S25
that quantization-error-accumulated values D.sub.P and D.sub.V have been
found with all the values of normalization coefficient .alpha., it
proceeds to step S27, and calculates the absolute value .vertline.D.sub.P
-D.sub.V.vertline. of the difference between image-related quantization
error D.sub.P and speech-related quantization error D.sub.V (stored in
step S24) with respect to each value of normalization coefficient .alpha..
The normalization coefficient controller 55 also detects the value of
normalization coefficient .alpha. that gives the minimum value of
difference absolute value .vertline.D.sub.P -D.sub.V.vertline.. In other
words, the normalization coefficient controller 55 ideally detects
normalization coefficient .alpha. in the case where image-related
quantization error D.sub.P and speech-related quantization error D.sub.V
are identical. The normalization coefficient controller 55 proceeds to
step S28, and terminates the process after outputting normalization
coefficient .alpha. giving the minimum value of absolute value
.vertline.D.sub.P -D.sub.V.vertline., the output normalization coefficient
.alpha. set for performing normalization so that image feature parameter
P.sub.i,j and speech feature parameter V.sub.i,j can be treated having the
same weight.
As described above, a codebook is created by normalizing code-vector
learning parameters as integrated parameters composed of image and speech
feature parameters, and using the normalized code-vector learning
parameters, while performing the steps of tentatively normalizing
normalization-coefficient learning parameters as integrated parameters
composed of image and speech feature parameters, finding accumulated
values of image- and speech-related quantization errors (minimum values of
distances with the code vectors) by using the created codebook to perform
vector quantization on each of image and speech feature parameters among
the normalized normalization-coefficient learning parameters, and changing
normalization coefficients so that image- and speech-related accumulated
values are equal. Thereby, normalization coefficients for performing
normalization so that feature parameters of different media such as image
and speech can be treated having the same weight can be found.
As a result, when speech recognition is performed by using normalization
coefficients to normalize feature parameters extracted from speech and
feature parameters extracted from an image of lips of the speaker,
integrating the feature parameters, and using the integrated parameters,
the recognition is greatly affected by either the speech or the image.
This can prevent an increase in the recognition factor from being
hindered.
In addition, effects of the feature parameters (of the media) which
constitute the integrated parameters, on the recognition factor, can be
easily verified.
In the above-described case, the weights of the image feature parameters
are used as a reference (set to be 1), and normalization coefficient
.alpha. for setting the weights of the speech feature parameters to be
identical to those of the image feature parameters is found. Therefore,
the intermedia normalizer 21 (shown in FIG. 1) outputs the image feature
parameters without performing any processing, while it normalizes the
speech feature parameters by multiplying the speech feature parameters by
the normalization coefficient .alpha. found as described above, and
outputs the normalized speech feature parameters.
Although the learning that finds normalization coefficient .alpha. for
setting the weights of the feature parameters of two types (image and
speech) to be equal has been described with reference to FIG. 3, a type of
learning can be performed that finds normalization coefficients for
equalizing the weights of feature parameters of three or more types or the
weights of feature parameters of media other than the image and the
speech.
The above-described normalization coefficient learning can be applied
regardless of the type and order of feature parameters because it is not
dependent on the type and order of feature parameters.
FIG. 5 shows an example of the distance-transition matching unit 31 shown
in FIG. 1.
From the integrated parameter generating unit 2 (shown in FIG. 1), for
example, integrated parameters generated when a word was pronounced are in
time series supplied to a time-domain normalizer 61. The time-domain
normalizer 61 performs time-domain normalization on the supplied,
integrated parameters.
When a speech time in which a word was pronounced is represented by t, a
time change of an element among integrated parameters generated when the
word was pronounced is as shown in, for example, FIG. 6A. Speech time t in
FIG. 6A varies depending on each speech, even if the same person
pronounced the same word. Accordingly, the time-domain normalizer 61
performs time-domain normalization so that speech time t is uniformly set
to be time T.sub.C, as shown in FIG. 6B. Assuming that the speech
recognition apparatus (shown in FIG. 1) performs word recognition, time
T.sub.C is set to be sufficiently longer than a general speech time
required when a word to be recognized is pronounced. Thus, the time-domain
normalizer 61 changes the integrated parameter shown in FIG. 6A so that it
is so-called "extended" in the time-domain direction. The technique of the
time-domain normalization is not limited to that shown in FIGS. 6A and 6B.
The time-domain-normalized parameters are supplied from the time-domain
normalizer 61 to a vector quantizer 62. The vector quantizer 62
sequentially performs vector quantization on the time-domain-normalized
integrated parameters, using a codebook stored in a codebook storage unit
63, and sequentially supplies a distance calculator 64 with codes as the
vector quantizer results, that is, codes corresponding to code vectors
nearest to the integrated parameters.
The codebook storage unit 63 stores the codebook, which is used when the
vector quantizer 62 performs vector quantization on the integrated
parameters.
The distance calculator 62 accumulates, in units of time, each distance
between a distance-transition model of the word to be recognized and a
code vector obtained when a code series output by the vector quantizer 62
is observed, and supplies the accumulated value to a sorter 66.
A distance-transition-model storage unit 65 stores a distance-transition
model representing distances between time-series integrated parameters
(normal series) of the word to be recognized, which are as shown in FIG.
7, and the code vectors of the codebook stored in the codebook storage
unit 63. In other words, the distance-transition-model storage unit 65
stores a distance-transition model (as shown in FIG. 7) that is obtained
by learning (described below) for each word to be recognized.
In the example shown in FIG. 7, the codebook stored in the codebook storage
unit 63 has J+1 code vectors C.sub.0 to C.sub.J.
The sorter 66 selects upper Nb values (where Nb represents a natural
number) in increasing order among distance-accumulated values on the
distance-transition model of each word to be recognized, and outputs them,
as a result of matching between the integrated parameters and the
distance-transition model, to the determining unit 4.
The above-described, distance-transition matching unit 31 performs matching
based on a distance-transition method. A matching process based on this
distance-transition method is described below with reference to the
flowchart shown in FIG. 8.
When receiving time-series integrated parameters corresponding to the
pronunciation of a word from the integrated parameter generating unit 2
(shown in FIG. 1), the time-domain normalizer 61 performs time-domain
normalization on the integrated parameters in step S31, and outputs the
time-domain-normalized parameters to the vector quantizer 62. In step S32,
the vector quantizer 62 sequentially performs vector quantization on the
time-domain-normalized parameters supplied from the time-domain normalizer
61 by referring to the codebook stored in the codebook storage unit 63,
and sequentially outputs a code series corresponding to code vectors
having the shortest distances with the integrated parameters, as the
vector-quantization results to the distance calculator 64.
In step S33, the distance calculator 64 accumulates each distance between
the distance-transition model of the word to be recognized and each code
vector obtained when the code series output by the vector quantizer 62 is
observed.
In other words, when among the code series output by the vector quantizer
62, a code at time t is represented by S.sub.t (t=0, 1, . . . , T.sub.C),
the distance calculator 64 finds the distance between the code and code
vector C.sub.j (j=0, 1, . . . , J) corresponding to code s.sub.0 initially
output by the vector quantizer 62 by referring to the distance-transition
model. Specifically, when code s.sub.0 corresponds to, for example, code
vector C.sub.0, the distance at time #0, which is on the curve indicating
the distance transition from code vector C.sub.0, is found in FIG. 7.
The distance calculator 64 calculates the distance at time #1 to code
vector C.sub.j corresponding to code s.sub.1 secondly output by the vector
quantizer 62 by referring to the distance-transition model. Similarly, the
distance calculator 64 sequentially finds distances up to the distance at
time #T.sub.C to code vector C.sub.j corresponding to code STC finally
output by the vector quantizer 62 by referring to the distance-transition
model, and calculates an accumulated value of the distances.
After calculating accumulated values of distances for all
distance-transition models stored in the distance-transition-model storage
unit 62, the distance calculator 64 outputs the accumulated values to the
sorter 66, and proceeds to step S34.
In step S34, the sorter 66 selects upper Nb values in increasing order
among the accumulated values of distances on the distance-transition
models of words to be recognized, and proceeds to step S35. In step S35,
the sorter 66 outputs, to the determining unit 4, the selected values as a
result of matching between the integrated parameters and the
distance-transition models.
With reference to FIG. 9, a learning apparatus for performing learning that
finds the distance-transition models to be stored in the
distance-transition-model storage unit 62 (shown in FIG. 5) is described
below.
A time-domain normalizer 71 is supplied with time-series, learning
integrated parameters. The time-domain normalizer 71 performs time-domain
normalization on the learning integrated parameters, similarly to the
time-domain normalizer 61 (shown in FIG. 5), and supplies the normalized
parameters to a distance calculator 72.
In other words, the time-domain normalizer 71 is supplied with, for
example, a plurality of sets of time-series, learning integrated
parameters for finding a distance-transition model of a word. The
time-domain normalizer 71 performs time-domain normalization on each of
the learning integrated parameters, and processes the normalized
parameters to generate one learning integrated parameters. Specifically, a
plurality of learning integrated parameters (Nc learning integrated
parameters in FIG. 10) (on a word) that do not always have the same
duration are supplied to the time-domain normalizer 71, as shown in column
(A), FIG. 10. The time-domain normalizer 71 performs time-domain
normalization on the supplied parameters so that each of their durations
is set to be time T.sub.C, as shown in column (B), FIG. 10. The
time-domain normalizer 71 calculates, for example, the mean of values
sampled at the same time from the time-domain-normalized parameters, as
shown by the graph (C) in FIG. 10, and generates one learning | | |