|
|
|
| United States Patent | 5131043 |
| Link to this page | http://www.wikipatents.com/5131043.html |
| Inventor(s) | Fujii; Satoru (Sagamihara, JP);
Niyada; Katsuyuki (Sagamihara, JP) |
| Abstract | Linear prediction coefficients of a speech signal including unknown words
are derived for each of successive periodic frame intervals. For every
frame over the duration of an individual phoneme of the speech signal, the
degree of similarity of stored coefficients of known words and derived
coefficients of the unknown words are calculated so that at the end of the
individual phonemes, the degree of similarity is calculated. Phoneme
segmentation data are derived in response to the speech signal and
combined with the calculated degree of similarity over the individual
phoneme to derive phoneme strings of the speech signal. The derived and
stored phoneme strings are compared to indicate the words stored in a word
dictionary having the greatest similarity with the derived phoneme
strings. |
|
|
|
Title Information  |
|
|
|
|
|
Drawing from US Patent 5131043 |
|
|
Method of and apparatus for speech recognition wherein decisions are
made based on phonemes |
|
|
|
|
|
| Publication Date |
July 14, 1992 |
|
|
|
|
|
| Filing Date |
November 20, 1989 |
|
|
|
|
|
|
|
|
|
|
|
| Parent Case |
This application is a continuation of application Ser. No. 06/647,186,
filed Sep. 4, 1984, now abandoned. |
|
| Priority Data |
Sep 05, 1983[JP]58-163537
Jul 27, 1984[JP]59-157813
Aug 16, 1984[JP]59-170659 |
|
|
|
|
|
|
|
|
|
|
|
Title Information  |
|
|
Claims  |
|
|
What is claimed is:
1. A method for recognizing speech comprising:
(a) performing a linear prediction analysis of plural phonemes including
the vowels and a nasal sound to calculate p.sup.th order LPC cepstrum
coefficients in response to periodic frame derived for plural word
utterances by plural speakers;
(b) in response to the calculated LPC cepstrum coefficients calculating a
covariance matrix W that is a function of all the phonemes and a mean
value m.sub.i for each of the particular phonemes,
where
i represents the particular phoneme;
(c) deriving a weighting coefficient
##EQU25##
where j=1,2 . . . p
.delta..sup.jj' =value of element jj' of inverse matrix W.sup.-1 of
covariance matrix W;
(d) deriving the values a.sub.ij, .delta..sup.jj', m.sub.ij', and
m.sub.i.sup.t W.sup.-1 m.sub.i for each of said phonemes as coefficient
values for the phonemes;
(e) in response to known phoneme sounds being uttered by a speaker deriving
the value of an LPC cepstrum coefficient for each phoneme;
(f) storing these LPC cepstrum coefficients with the previously stored
coefficient values of the corresponding phonemes to derive standard
patterns for the phonemes;
(g) during a recognition mode while replicas of unknown words including the
phonemes are derived:
(i) performing phoneme segmentation of each unknown word and
(ii) for each segmented phoneme determining the similarity of LPC cepstrum
coefficients of each segmented phoneme of the unknown words with the
stored coefficient values of the standard patterns for the phonemes in
accordance with
##EQU26##
where t is a matrix transportation factor; (h) selecting the standard
phoneme most similar to the uttered phoneme in response to the value of
L.sub.i ;
(i) combining the selected standard phonemes to form a phoneme string for
an uttered word; and
(j) comparing the formed phoneme string for an uttered word with stored
phoneme strings for known words to determined which of the known words is
the uttered word.
2. The method of claim 1 wherein the plural speakers are divided into
plural groups each including multiple speakers, further including:
calculating the mean value of the LPC cepstrum coefficients for each
phoneme of each group,
from the calculated mean values calculating the inverse matrix for each
group,
calculating a weighting coefficient as
##EQU27##
for the j.sup.th order of each phoneme i of each group (n), where
.delta..sup.ij is the value of element j, j' of inverse matrix W.sup.-1 of
covariance matrix W,
calculating an average distance of each phoneme (i) of each group (n) as
d.sub.i.sup.(n) =m.sub.i.sup.(n)t W.sup.-1(n) m.sub.i.sup.(n)
storing the values of
a.sub.ij.sup.(n) and d.sub.i.sup.(n) for each group,
selecting one of the groups prior to the recognition mode by performing
for each stored group a similarity calculation with a known uttered word
in accordance with
##EQU28##
determining a center frame of each phoneme of each uttered unknown word,
calculating the sum L.sup.(n) of center frame similarity l.sub.i.sup.(n)
for each phoneme of group n as
##EQU29##
where K=number of stored phonemes
N=number of center frames in group n;
comparing the values of L.sup.(n) for the different groups to select the
group to which the speaker of the unknown uttered word is a member,
during the recognition step comparing the LPC cepstrum coefficients of the
speaker of the unknown uttered words only with the LPC cepstrum
coefficients of the selected group.
3. The method of claim 2 wherein the center frame of each phoneme is
selected from the frame in the center of each phoneme.
4. The method of claim 2 wherein the center frame of each phoneme is
selected from the frame having the greatest similarity.
5. The method of claim 1 wherein the plural speakers are divided into
plural groups each including multiple speakers, further including:
calculating the mean value of the LPC cepstrum coefficients for each
phoneme of each group,
from the calculated mean values of all of groups n calculating a covariance
matrix R common to all of the uttered known phonemes of the n groups,
deriving a weighting coefficient with respect to the j.sup.th order of the
LPC cepstrum coefficients for each phoneme i of group n as
##EQU30##
where .nu.jj' is the value of element j, j' of inverse matrix R.sup.-1 of
covariance matrix R
deriving an average distance to phoneme i of group n as
d.sub.i.sup.(n) =m.sub.i.sup.(n)t R.sup.-1 m.sub.i.sup.(n)
where t is a matrix transpose,
storing the values of a.sub.ij.sup.(n) and d.sub.i.sup.(n) for each of the
n groups,
storing the values of
a.sub.ij.sup.(n) and d.sub.i.sup.(n) for each group,
selecting one of the groups prior to the recognition mode by performing for
each stored group a similarity calculation with a known uttered word in
accordance with
##EQU31##
determining a center frame of each phoneme of each uttered unknown word,
calculating the sum L.sup.(n) of center frame similarity l.sub.i.sup.(n)
for each phoneme of group n as
##EQU32##
where N=number of center frame in group n, selecting the two groups
having the largest value of L, whereby the groups r and s having the
largest and next largest values of L respectively have values of L.sup.(i)
and L.sup.(s),
deriving a numerical indication of the relative values of L.sup.(i) and
L.sup.(s),
in response to the numerical indication having values in first and second
ranges selecting groups r and s respectively,
during the recognition step comparing the LPC cepstrum coefficients of the
speaker of the unknown uttered words only with the LPC cepstrum
coefficients of the selected group.
6. The method of claim 5 wherein the numerical indication is derived as
R.sub.e =L.sup.(r) -L.sup.(s),
selecting group r in response to R.sub.e being positive and in excess of a
predetermined value,
selecting group s in response to R.sub.e being negative and in excess of
the predetermined value,
selecting groups r and s for LPC cepstrum coefficient similarity in
response to R.sub.e being less in absolute value than the threshold.
7. The method of claim 5 wherein a pair of the numerical indications are
derived as R.sub.e.sup.(n) and R.sub.e.sup.(s), where
##EQU33##
selecting group r in response to R.sub.e.sup.(n) exceeding a predetermined
threshold,
selecting group s in response to R.sub.e.sup.(s) exceeding the
predetermined threshold and
in response to neither R.sub.e.sup.(r) nor R.sub.e.sup.(s) exceeding the
threshold determining which of L.sup.(r) or L.sup.(s) is greater, and
selecting the group (r or s) having the greater value of L.sup.(r) or
L.sup.(s). |
|
|
|
|
Claims  |
|
|
Description  |
|
|
BACKGROUND OF THE INVENTION
This invention relates generally to speech recognition apparatus and
method, and more particularly to a speech apparatus and method using
phoneme recognition.
Apparatus for and methods of speech recognition wherein spoken words are
automatically recognized are extremely useful for supplying computers and
other devices with data and instructions. In the prior art,
pattern-matching is frequently used for word recognition. According to the
pattern-matching method, there are prepared and prestored in a memory
various standard patterns for all words to be recognized. The degree of
similarity between an input unknown pattern and the standard patterns is
computed to determine the input pattern data having the greatest
similarity to the stored pattern. In this pattern-matching method, it is
necessary to prepare standard patterns for all words to be recognized.
Hence, new standard patterns must be supplied and stored by the apparatus
when the apparatus is to recognize the words spoken by different people.
If several hundred words are to be recognized, time-consuming and
troublesome operations are performed to register all these words spoken by
each speaker. Furthermore, a memory used for storing such spoken words is
required to have an extremely large capacity. Moreover, when this method
is used for a large number of words, a long time period is required to
match an input pattern and the standard patterns.
Another method of obtaining the similarity between words prestored in a
word dictionary uses phonemes. Input sounds are recognized as a
combination of phonemes. In phoneme matching, the capacity of the memory
used as the word dictionary is small, the time required for pattern
matching comparison is short, and the contents of the word dictionary can
be readily changed. For instance, since the sound "AKAI" can be expressed
by way of a simple form of "a k a i" with three different phonemes /a/,
/k/ and /i/ being combined, a number of spoken words emitted from
unspecific speakers is easily handled.
In speech recognition for unspecific speakers, the characteristics of
sounds drastically change depending on sex distinction and age difference.
A problem with prior art phoneme devices is how to generalize various
sound characteristics so as to recognize words spoken by unspecific
persons.
In the case of recognition with a phoneme unit, phoneme standard patterns
are subjected to a large dispersion due to sex distinction and age
difference; for instance, in the case of a vowel /a/, there is a great
difference in the shape of spectrum patterns in a spectrum diagram between
male and female speakers.
In prior art devices this problem is solved by preparing plural standard
patterns for each phoneme; each pattern corresponds to the phoneme for
plural speakers. A calculation is performed for all the standard patterns
and an input sound to determine which standard pattern is most similar to
the input sound. However, this conventional technique suffers from the
following drawbacks:
(1) The speech recognition must be expensive to perform high speed
calculations for a large number of similarity calculations.
(2) Recognition rate is somewhat low since similarity is calculated by
finding a phoneme having the greatest similarity to all the standard
patterns; the number of similar phonemes is large, therefore, causing
increased confusion between phonemes.
(3) The recognition rate is very low if a speaker utters sounds which do
not correspond to any of the prepared standard patterns.
SUMMARY OF THE INVENTION
The present invention has been developed to remove the above-described
drawbacks of conventional speech recognition apparatus.
It is, therefore, an object of the present invention to provide a new and
improved speech recognition apparatus which is capable of handling words
spoken by unspecific speakers, wherein the apparatus is not adversely
influenced by changes in the speakers or acoustic environment so that high
recognition rate is obtained in a stable manner.
Another object of the present invention is to provide a speech recognition
apparatus which is capable of selecting a most suitable standard pattern
group using unknown input sounds so that there is a high word recognition
rate from unspecific speakers wherein the number of similarity
calculations is remarkable reduced, leading to fast processing.
A further object of the present invention is to provide speech recognition
apparatus capable of recognizing sounds from unspecific speakers with high
recognition rate even if utterances from a speaker are not in prepared
standard patterns.
According to a feature of the present invention, standard patterns are
divided into several groups, one of which is automatically selected by
analyzing some spoken words. Then the standard patterns of a selected
group are automatically corrected.
In accordance with the present invention, a method of recognizing speech
comprises: performing a linear prediction analysis of plural phonemes
including the vowels and a nasal sound to calculate p.sup.th order LPC
cepstrum coefficients in response to periodic frames derived for plural
word utterances by plural speakers. In response to the calculated LPC
cepstrum coefficients there is calculated a covariance matrix W that is a
function of all the phonemes and a mean value m.sub.i for each of the
particular phonemes, where i represents the particular phoneme. A
weighting coefficient is derived in accordance with
##EQU1##
where j=1,2 . . . p
.delta..sup.jj' =value of element jj' of inverse matrix W.sup.-1 of
covariance matrix W.
The values a.sub.ij, .delta..sup.jj', m.sub.ij', and m.sub.i.sup.t W.sup.-1
m.sub.i for each of said phonemes are derived as coefficient values for
the phonemes. In response to known phoneme sounds being uttered by a
speaker, the value of an LPC cepstrum coefficient for each phoneme is
derived. These LPC cepstrum coefficients are stored with the previously
stored coefficient values of the corresponding phonemes to derive standard
patterns for the phonemes. During a recognition mode while replicas of
unknown words including the phonemes are derived: (i) phoneme segmentation
of each unknown word is performed and (ii) for each segmented phoneme the
similarity of LPC cepstrum coefficients of each segmented phoneme of the
unknown words with the stored coefficient values of the standard patterns
for the phonemes is determined in accordance with
##EQU2##
where t is a matrix transportation factor. The standard phoneme most
similar to the uttered phoneme is selected in response to the value of
L.sub.i. The selected standard phonemes are combined to form a phoneme
string for an uttered word. The formed phoneme string for an uttered word
is compared with stored phoneme strings for known words to determine which
of the known words is the uttered word.
In a preferred embodiment, the plural speakers are divided into plural
groups each including multiple speakers and the mean value of the LPC
cepstrum coefficients for each phoneme of each group is calculated. From
the calculated mean values the inverse matrix for each group is
calculated. A weighting coefficient is calculated as
##EQU3##
for the j.sup.th order of each phoneme (i) of each group (n), where
.delta..sup.ij is the value of element j, j' of inverse matrix W.sup.-1 of
covariance matrix W. An average distance of each phoneme i of each group
(n) is calculated as
d.sub.i.sup.(n) =m.sub.i.sup.(n)t W.sup.-1(n) m.sub.i.sup.(n).
The values of a.sub.ij.sup.(n) and d.sub.i.sup.(n) are stored for each
group.
One of the groups prior to the recognition mode is selected by performing
for each stored group a similarity calculation with a known uttered word
in accordance with
##EQU4##
A center frame of each phoneme of each uttered unknown word is determined.
The sum L.sup.(n) of center frame similarity l.sub.i.sup.(n) for each
phoneme of group n is calculated as
##EQU5##
where K=number of stored phonemes and N=number of center frames in group
n.
The values of L.sup.(n) for the different groups are compared to select the
group to which the speaker of the unknown uttered word is a member. During
the recognition step the cepstrum PLC coefficients of the speaker of the
unknown uttered words are compared only with the cepstrum LPC coefficients
of the selected group.
In one embodiment, the center frame of each phoneme is selected from the
frame in the center of each phoneme. In another embodiment, the center
frame of each phoneme is selected from the frame having the greatest
similarity.
In a further embodiment, the plural speakers are divided into plural groups
each including multiple speakers. In this case, the mean value of the LPC
cepstrum coefficients for each phoneme of each group is calculated. From
the calculated mean values of all of groups n, a covariance matrix R
common to all of the uttered known phonemes of the n groups is calculated.
A weighting coefficient with respect to the j.sup.th order of the LPC
cepstrum coefficients for each phoneme i of group n is derived as
##EQU6##
where jj' is the value of element j, j' of inverse matrix R.sup.-1 of
covariance matrix R. An average distance to phoneme i of group n is
derived as
d.sub.i.sup.(n) =m.sub.i.sup.(n)t R.sup.-1 m.sub.i.sup.(n)
where t is a matrix transpose.
The values of a.sub.ij.sup.(n) and d.sub.i.sup.(n) for each of the n groups
are stored as are the values of a.sub.ij.sup.(n) and d.sub.i.sup.(n) for
each group. One of the groups is selected prior to the recognition mode by
performing for each stored group is similarity calculation with a known
uttered word in accordance with
##EQU7##
A center frame of each phoneme of each uttered unknown word is determined.
The sum L.sup.(n) of center frame similarity l.sub.i.sup.(n) for each
phoneme of group n is calculated as
##EQU8##
where N=number of center frames in group n. The two groups having the
largest value L are selected whereby the groups r and s having the largest
and next largest values of L respectively have values of L.sup.(i) and
L.sup.(s). A numerical indication of the relative values of L.sup.(i) and
L.sup.(s) is derived. In response to the numerical indication having
values in first and second ranges, groups r and s are respectively
selected. During the recognition step the cepstrum PLC coefficients of the
speaker of the unknown uttered words are compared only with the cepstrum
LPC coefficients of the selected group.
In one embodiment the numerical indication is derived as
R.sub.e =L.sup.(r) -L.sup.(s).
Group r is selected in response to R.sub.e being positive and in excess of
a predetermined value. Group s is selected in response to R.sub.e being
negative and in excess of the predetermined value. Groups r and s for LPC
cepstrum coefficient similarity are selected in response to R.sub.e being
less in absolute value than the threshold.
BRIEF DESCRIPTION OF THE DRAWINGS
The objects and features of the present invention will become more readily
apparent from the following detailed description of the preferred
embodiments taken in conjunction with the accompanying drawings in which:
FIG. 1 is a block diagram of a conventional speech recognition apparatus of
the phoneme recognition type;
FIG. 2 is a schematic block diagram of a first embodiment of the speech
recognition apparatus according to the present invention;
FIG. 3 is an explanatory graph of recognition rate for different speakers,
obtained according to the present invention;
FIG. 4 is an explanatory graph of speech recognition results according to
the present invention, as a function of standard deviation;
FIG. 5 is a schematic block diagram of a second embodiment of the speech
recognition apparatus according to the present invention;
FIG. 6 is an automatic selection flowchart of standard pattern groups in an
embodiment of the present invention;
FIG. 7 is an automatic correction flowchart of standard pattern groups in
another embodiment of the present invention;
FIG. 8 is a speech recognition flowchart according to the present
invention;
FIG. 9 is a graph wherein phoneme recognition rate according to the present
invention is compared with that of a conventional example;
FIG. 10 is a schematic block diagram of a third speech recognition
apparatus embodiment according to the present invention; and
FIG. 11 is a speech recognition flowchart for the embodiment illustrated in
FIG. 10.
The same or corresponding elements and parts are designated as like
reference numerals throughout the drawings.
DETAILED DESCRIPTION OF THE INVENTION
Prior to describing the embodiments of the present invention and to provide
a better understanding thereof, an example of a conventional phoneme
recognition type speech recognition apparatus is described with reference
to FIG. 1.
A standard pattern storage 11 stores groups of phoneme or syllable standard
patterns. The standard patterns are produced by dividing sound data from
plural speakers by a cluster analysis or the like. For simplicity of
description, it is assumed that standard pattern group 1 includes male
data, while standard pattern group 2 includes female data, such that six
standard patterns are provided for each group.
A speech signal transduced by microphone 1 is A/D converted by A/D
converter 2; the A/D converted data are fed to signal processing circuit 3
and to segmentation portion 5. In signal processing circuit 3, necessary
pre-emphasis is performed and window calculation is executed; and the
result of the calculation is fed to linear prediction analysis processor
4. In segmentation portion 5, the A/D converted data are band pass
filtered, calculations are performed thereon, sound periods are detected,
voiced and unvoiced features are determined and consonants are segmented.
The results of these operations are transmitted from portion 5 to main
memory 7 where they are stored. Similarity calculating portion 6
calculates the degree of similarity between standard patterns for groups
1, 2 etc. stored in memory 11 and LPC parameters derived by linear
prediction analysis processor 4. Standard patterns of standard pattern
group 1 stored in the memory 11 are transmitted to the similarity
calculating portion 6 so that similarity calculation is executed for
respective frames; the similarity calculation results are stored in main
memory 7. The similarity calculation is then performed between the
standard patterns of group 2 and the LPC parameters. Main processor 8
determines the phoneme or syllable in memory 7 having the greatest
similarity to a phoneme or syllable in memory 11. From the determined
greatest similarity and the result from segmentation portion 5, processor
8 then produces a phoneme or syllable string. Then the produced string is
compared with the contents of word dictionary 12 to derive a recognized
word that is fed to output portion 9.
As described at the beginning of the specification, in this conventional
technique, the number of standard patterns to be prepared in advance is
large, leading to a low recognition rate; this prior art method and
apparatus requires an extremely large amount of calculation.
Reference is now made to FIG. 2, a schematic functional block diagram of a
first embodiment according to the present invention. A sound or speech
signal transduced by microphone 31 is A/D converted by A/D converter 21
into 12-bit digital data using 12 KHz sampling pulses. The digital data
from A/D converter 21 are subjected to pre-emphasis and a Humming window
of 20 msec in signal processing circuit 22, and then a linear prediction
analysis processor 23 calculates LPC cepstrum coefficients every 10 msec.
The LPC cepstrum coefficients obtained by the linear prediction analysis
processor 23 are fed to a similarity calculation portion 24 where the
degree of similarity to respective phonemes is calculated for every frame;
the results of the similarity calculations are stored in main memory 27.
Coefficient memory 25 stores for respective phonemes weighting
coefficients that are compared in calculator portion 24 with the LPC
cepstrum coefficients derived from processor 23.
Band-pass filter 26 responds to digital data from A/D converter 21 to
calculate band level of three or more channels and overall range power
level; the data derived by filter 26 are stored in main memory 27 as
segmentation data. Main processor 28 detects sound periods and segments
each phoneme in response to data fed from similarity calculating portion
24 and band-pass filter 26 to main memory 27. Processor 28 responds to
data read out of memory 27 to derive a phoneme string by determining the
phoneme derived from processor 23 having the greatest similarity in LPC
cepstrum coefficient during every phoneme period with the LPC cepstrum
coefficients stored in memory 25. The duration of the phoneme period is
determined by processor 28 in response to the output of filter 26, as
stored in memory 27. The degree of LPC coefficient similarity for each is
determined by processor 28 by comparing the signals stored in memory 27
resulting from the outputs of similarity calculating portion 24. The
phoneme string produced by the main processor 28 is then compared with
words stored in word dictionary memory 29 where words are expressed in
terms of phoneme strings. As a result of the comparison, the word in
dictionary 29 having the greatest similarity with the phoneme string
derived by main processor 28 is determined and fed to output portion 30.
Although it is possible to recognize words spoken by unspecific persons
with only the above-described structure, since the contents of the
coefficient memory 25 corresponding to the standard pattern are fixed,
apparatus having only the above-described structure is apt to suffer from
a low recognition rate. To solve this problem and in accordance with the
invention, therefore, learning portion 32 is provided. Learning portion 32
produces learning data in response to LPC cepstrum coefficients derived
from linear prediction analysis portion 23 and the recognition result
derived from output portion 30, i.e. the word in dictionary 29 recognized
by processor 28 as being closest to the phoneme string calculated by the
processor in response to the output of memory 27. More specifically,
learning portion 32 calculates discriminating coefficients for each
phoneme, which is most suitable for a present speaker on the basis of
variance and covariance obtained in advance, and feeds the calculated
weighting coefficients to coefficient memory 25.
The operation of the speech recognition apparatus according to the present
invention is further described in detail wit reference to FIG. 2. Prior to
performing speech recognition some data are prepared as follows: A number
of words spoken by a number of speakers are transduced by microphone 31 so
that vowels /a/, /o/, /u/, /i/, /e/ and a nasal sound are derived from A/D
converter 21. Then a linear prediction analysis is performed every 10 msec
by linear prediction analysis processor 23 using obtained sound data to
calculate p.sup.th order LPC cepstrum coefficients. Using the LPC cepstrum
coefficients, a covariance matrix W that is a function of all the phonemes
and a mean value m.sub.i for each phoneme (where i represents the phoneme
type) are derived by processor 23. With this result, a weighting
coefficient a.sub.ij (j=1, 2, . . . , p) is derived as:
##EQU9##
where element (j, j') of inverse matrix W.sup.-1 of covariance matrix W is
expressed by .delta..sup.jj'.
Then the values of a.sub.ij, m.sub.ij', .delta..sup.jj', m.sub.i.sup.t
W.sup.-1 m.sub.i, described infra, are derived for each phoneme as
standard patterns to be stored in coefficient memory 25.
Then in response to a speaker uttering known sounds such as /a/, /i/, /u/,
/e/, /o/, during a learning mode, LPC cepstrum coefficients are derived by
linear prediction analysis processor 23 for the known sounds. Signals
representing the LPC cepstrum coefficients as derived from processor 23
are fed to learning portion 32 which controls loading of memory 25. On the
other hand, during a recognition mode, similarity calculating portion 24
determines the similarity of the LPC cepstrum coefficients derived from
processor 23 with standard patterns prestored in coefficient memory 25.
Similarity calculating portion 24 determines the similarity between the
output of processor 23 and signals stored in memory 25 as a function of
Mahalanobis' distance D.sub.i.sup.2, which is expressed as:
##EQU10##
wherein t represents transposition matrix; and
x represents the LPC cepstrum coefficients of the input signal as derived
from processor 23.
Since a first term is constant with respect to phoneme i, similarity
L.sub.i may be simply expressed by:
##EQU11##
Therefore, similarity can be calculated using Formula (4); a signal
representing the calculation result is fed to main memory 27, and a
phoneme string is produced by main processor 28. Next, a value for a
phoneme position to be earned, on a time base, is fed back from output
portion 30 to learning portion 32 to derive the mean value of the LPC
cepstrum coefficients of the phoneme to be learned. The above steps are
repeated as many times as required for different types of sounds required
to be recognized by the machine. Mean values of respective phonemes, to
which suitable weights are given, are added to the original mean values
(m.sub.ij') that are derived without learning. The resulting sums
represent new mean values for respective p | | |