WikiPatents - Community Patent Review
Create Free Account  |  License or Sell Your Patent  |  WikiPatents Marketplace  |  WikiPatents Blog
Username:  Password:  
    
Advanced Search
Method of and apparatus for speech recognition wherein decisions are made based on phonemes    
United States Patent5131043   
Link to this pagehttp://www.wikipatents.com/5131043.html
Inventor(s)Fujii; Satoru (Sagamihara, JP); Niyada; Katsuyuki (Sagamihara, JP)
AbstractLinear prediction coefficients of a speech signal including unknown words are derived for each of successive periodic frame intervals. For every frame over the duration of an individual phoneme of the speech signal, the degree of similarity of stored coefficients of known words and derived coefficients of the unknown words are calculated so that at the end of the individual phonemes, the degree of similarity is calculated. Phoneme segmentation data are derived in response to the speech signal and combined with the calculated degree of similarity over the individual phoneme to derive phoneme strings of the speech signal. The derived and stored phoneme strings are compared to indicate the words stored in a word dictionary having the greatest similarity with the derived phoneme strings.



 Title Information Submit all comments and votes
 
Patent Text Patent PDF Print Page Summary File History
Plain text PDF images Print Summary File History
Drawing from US Patent 5131043
Method of and apparatus for speech recognition wherein decisions are

     made based on phonemes - US Patent 5131043 Drawing
Method of and apparatus for speech recognition wherein decisions are made based on phonemes
Inventor     Fujii; Satoru (Sagamihara, JP); Niyada; Katsuyuki (Sagamihara, JP)
Owner/Assignee     Matsushita Electric Industrial Co., Ltd. (Osaka, JP)
Patent assignment
All assignments
Publication Date     July 14, 1992
Application Number     07/441,225
PAIR File History     Application Data   Transaction History
Image File Wrapper   Patent Term   Fees
Litigation
Filing Date     November 20, 1989
US Classification     704/254 704/236
Int'l Classification     G10L 005/04
Examiner     Shaw; Dale M.
Assistant Examiner     Knepper; David D.
Attorney/Law Firm     Lowe, Price, LeBlanc & Becker
Address
Parent Case     This application is a continuation of application Ser. No. 06/647,186, filed Sep. 4, 1984, now abandoned.
Priority Data     Sep 05, 1983[JP]58-163537 Jul 27, 1984[JP]59-157813 Aug 16, 1984[JP]59-170659
USPTO Field of Search     381/41 381/42 381/43 364/513 364/513.5
Patent Tags     speech recognition wherein decisions are made based phonemes
   
Enter a comma (,) or semicolon (;) between multiple tag words/phrases.
Describe this patent:
 Amusing   
 Clever   
 Complex   
 Efficient   
 Historic   
 Important   
 Innovative   
 Interesting   
 Practical   
 Simple   
[no votes]
Patent WIKI

Share information and news about this patent, including information and news about the technology, inventors, company, ligation and licensing.

 References Submit all comments and votes
 
*references marked with an asterisk below are user-added references
 U.S. References
 
Add a new US reference:  
ReferenceRelevancyCommentsReferenceRelevancyComments
4792976
Watari
704/243
Dec,1988

[0 after 0 votes]
4761815
Hitchcock
704/253
Aug,1988

[0 after 0 votes]
4625287
Matsuura
704/254
Nov,1986

[0 after 0 votes]
4624010
Takebayashi
704/249
Nov,1986

[0 after 0 votes]
4624011
Watanabe
704/254
Nov,1986

[0 after 0 votes]
4618984
Das
704/244
Oct,1986

[0 after 0 votes]
4601054
Watari
704/238
Jul,1986

[0 after 0 votes]
4592086
Watari
704/238
May,1986

[0 after 0 votes]
4592085
Watari
704/254
May,1986

[0 after 0 votes]
4590605
Hataoka
704/245
May,1986

[0 after 0 votes]
4571697
Watanabe
704/238
Feb,1986

[0 after 0 votes]
4555796
Sakoe
704/241
Nov,1985

[0 after 0 votes]
4513436
Nose
704/243
Apr,1985

[0 after 0 votes]
4489434
Moshier
704/239
Dec,1984

[0 after 0 votes]
4467437
Tsuruta
704/241
Aug,1984

[0 after 0 votes]
4446531
Tanaka
708/424
May,1984

[0 after 0 votes]
4412098
An
704/236
Oct,1983

[0 after 0 votes]
4400828
Pirz
704/238
Aug,1983

[0 after 0 votes]
4394538
Warren
704/252
Jul,1983

[0 after 0 votes]
4092493
Rabiner
704/237
May,1978

[0 after 0 votes]
4038503
Moshier
704/234
Jul,1977

[0 after 0 votes]
 Foreign References
 Other References
 Market Review Submit all comments and votes
   
Market Size
Estimate the gross annual revenues of the relevant market sector:
> $10B
$5B - $10B
$2B - $5B
$500M - $2B
$100M - $500M
$10M - $100M
$1M - $10M
$500K - $1M
$100K - $500K
< $100K
[No votes]
$0
 
$0   $2.5B   $5B   $7.5B   $10B
Market Share
Estimate the percentage of the relevant market sector this invention will capture:
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Reasonable Royalty
What percentage of gross sales should the inventor or assignee be paid?
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Public's "Guesstimation" of Royalty Value
Market SizeN/A[No votes]
xMarket ShareN/A[No votes]
xReasonable RoyaltyN/A[No votes]

N/A

License Availablity
If you are NOT the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
License Availablity
If you ARE the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
Competitive Advantage
Does this invention have a significant competitive advantage over similar technologies?
Yes

No



[No votes]
Most helpful competitive advantage comment
[No comments]

Commercial Alternatives
Are there viable commercial alternatives for this invention?
Yes

No



[No votes]
Most helpful commercial alternative comment
[No comments]

 Technical Review Submit all comments and votes
 Claims Submit all comments and votes
 


What is claimed is:

1. A method for recognizing speech comprising:

(a) performing a linear prediction analysis of plural phonemes including the vowels and a nasal sound to calculate p.sup.th order LPC cepstrum coefficients in response to periodic frame derived for plural word utterances by plural speakers;

(b) in response to the calculated LPC cepstrum coefficients calculating a covariance matrix W that is a function of all the phonemes and a mean value m.sub.i for each of the particular phonemes,

where

i represents the particular phoneme;

(c) deriving a weighting coefficient ##EQU25## where j=1,2 . . . p

.delta..sup.jj' =value of element jj' of inverse matrix W.sup.-1 of covariance matrix W;

(d) deriving the values a.sub.ij, .delta..sup.jj', m.sub.ij', and m.sub.i.sup.t W.sup.-1 m.sub.i for each of said phonemes as coefficient values for the phonemes;

(e) in response to known phoneme sounds being uttered by a speaker deriving the value of an LPC cepstrum coefficient for each phoneme;

(f) storing these LPC cepstrum coefficients with the previously stored coefficient values of the corresponding phonemes to derive standard patterns for the phonemes;

(g) during a recognition mode while replicas of unknown words including the phonemes are derived:

(i) performing phoneme segmentation of each unknown word and

(ii) for each segmented phoneme determining the similarity of LPC cepstrum coefficients of each segmented phoneme of the unknown words with the stored coefficient values of the standard patterns for the phonemes in accordance with ##EQU26## where t is a matrix transportation factor; (h) selecting the standard phoneme most similar to the uttered phoneme in response to the value of L.sub.i ;

(i) combining the selected standard phonemes to form a phoneme string for an uttered word; and

(j) comparing the formed phoneme string for an uttered word with stored phoneme strings for known words to determined which of the known words is the uttered word.

2. The method of claim 1 wherein the plural speakers are divided into plural groups each including multiple speakers, further including:

calculating the mean value of the LPC cepstrum coefficients for each phoneme of each group,

from the calculated mean values calculating the inverse matrix for each group,

calculating a weighting coefficient as ##EQU27## for the j.sup.th order of each phoneme i of each group (n), where .delta..sup.ij is the value of element j, j' of inverse matrix W.sup.-1 of covariance matrix W,

calculating an average distance of each phoneme (i) of each group (n) as

d.sub.i.sup.(n) =m.sub.i.sup.(n)t W.sup.-1(n) m.sub.i.sup.(n)

storing the values of

a.sub.ij.sup.(n) and d.sub.i.sup.(n) for each group,

selecting one of the groups prior to the recognition mode by performing for each stored group a similarity calculation with a known uttered word in accordance with ##EQU28## determining a center frame of each phoneme of each uttered unknown word, calculating the sum L.sup.(n) of center frame similarity l.sub.i.sup.(n) for each phoneme of group n as ##EQU29## where K=number of stored phonemes

N=number of center frames in group n;

comparing the values of L.sup.(n) for the different groups to select the group to which the speaker of the unknown uttered word is a member,

during the recognition step comparing the LPC cepstrum coefficients of the speaker of the unknown uttered words only with the LPC cepstrum coefficients of the selected group.

3. The method of claim 2 wherein the center frame of each phoneme is selected from the frame in the center of each phoneme.

4. The method of claim 2 wherein the center frame of each phoneme is selected from the frame having the greatest similarity.

5. The method of claim 1 wherein the plural speakers are divided into plural groups each including multiple speakers, further including:

calculating the mean value of the LPC cepstrum coefficients for each phoneme of each group,

from the calculated mean values of all of groups n calculating a covariance matrix R common to all of the uttered known phonemes of the n groups,

deriving a weighting coefficient with respect to the j.sup.th order of the LPC cepstrum coefficients for each phoneme i of group n as ##EQU30## where .nu.jj' is the value of element j, j' of inverse matrix R.sup.-1 of covariance matrix R

deriving an average distance to phoneme i of group n as

d.sub.i.sup.(n) =m.sub.i.sup.(n)t R.sup.-1 m.sub.i.sup.(n)

where t is a matrix transpose,

storing the values of a.sub.ij.sup.(n) and d.sub.i.sup.(n) for each of the n groups,

storing the values of

a.sub.ij.sup.(n) and d.sub.i.sup.(n) for each group,

selecting one of the groups prior to the recognition mode by performing for each stored group a similarity calculation with a known uttered word in accordance with ##EQU31## determining a center frame of each phoneme of each uttered unknown word, calculating the sum L.sup.(n) of center frame similarity l.sub.i.sup.(n) for each phoneme of group n as ##EQU32## where N=number of center frame in group n, selecting the two groups having the largest value of L, whereby the groups r and s having the largest and next largest values of L respectively have values of L.sup.(i) and L.sup.(s),

deriving a numerical indication of the relative values of L.sup.(i) and L.sup.(s),

in response to the numerical indication having values in first and second ranges selecting groups r and s respectively,

during the recognition step comparing the LPC cepstrum coefficients of the speaker of the unknown uttered words only with the LPC cepstrum coefficients of the selected group.

6. The method of claim 5 wherein the numerical indication is derived as R.sub.e =L.sup.(r) -L.sup.(s),

selecting group r in response to R.sub.e being positive and in excess of a predetermined value,

selecting group s in response to R.sub.e being negative and in excess of the predetermined value,

selecting groups r and s for LPC cepstrum coefficient similarity in response to R.sub.e being less in absolute value than the threshold.

7. The method of claim 5 wherein a pair of the numerical indications are derived as R.sub.e.sup.(n) and R.sub.e.sup.(s), where ##EQU33## selecting group r in response to R.sub.e.sup.(n) exceeding a predetermined threshold,

selecting group s in response to R.sub.e.sup.(s) exceeding the predetermined threshold and

in response to neither R.sub.e.sup.(r) nor R.sub.e.sup.(s) exceeding the threshold determining which of L.sup.(r) or L.sup.(s) is greater, and

selecting the group (r or s) having the greater value of L.sup.(r) or L.sup.(s).
 Description Submit all comments and votes
 


BACKGROUND OF THE INVENTION

This invention relates generally to speech recognition apparatus and method, and more particularly to a speech apparatus and method using phoneme recognition.

Apparatus for and methods of speech recognition wherein spoken words are automatically recognized are extremely useful for supplying computers and other devices with data and instructions. In the prior art, pattern-matching is frequently used for word recognition. According to the pattern-matching method, there are prepared and prestored in a memory various standard patterns for all words to be recognized. The degree of similarity between an input unknown pattern and the standard patterns is computed to determine the input pattern data having the greatest similarity to the stored pattern. In this pattern-matching method, it is necessary to prepare standard patterns for all words to be recognized. Hence, new standard patterns must be supplied and stored by the apparatus when the apparatus is to recognize the words spoken by different people. If several hundred words are to be recognized, time-consuming and troublesome operations are performed to register all these words spoken by each speaker. Furthermore, a memory used for storing such spoken words is required to have an extremely large capacity. Moreover, when this method is used for a large number of words, a long time period is required to match an input pattern and the standard patterns.

Another method of obtaining the similarity between words prestored in a word dictionary uses phonemes. Input sounds are recognized as a combination of phonemes. In phoneme matching, the capacity of the memory used as the word dictionary is small, the time required for pattern matching comparison is short, and the contents of the word dictionary can be readily changed. For instance, since the sound "AKAI" can be expressed by way of a simple form of "a k a i" with three different phonemes /a/, /k/ and /i/ being combined, a number of spoken words emitted from unspecific speakers is easily handled.

In speech recognition for unspecific speakers, the characteristics of sounds drastically change depending on sex distinction and age difference. A problem with prior art phoneme devices is how to generalize various sound characteristics so as to recognize words spoken by unspecific persons.

In the case of recognition with a phoneme unit, phoneme standard patterns are subjected to a large dispersion due to sex distinction and age difference; for instance, in the case of a vowel /a/, there is a great difference in the shape of spectrum patterns in a spectrum diagram between male and female speakers.

In prior art devices this problem is solved by preparing plural standard patterns for each phoneme; each pattern corresponds to the phoneme for plural speakers. A calculation is performed for all the standard patterns and an input sound to determine which standard pattern is most similar to the input sound. However, this conventional technique suffers from the following drawbacks:

(1) The speech recognition must be expensive to perform high speed calculations for a large number of similarity calculations.

(2) Recognition rate is somewhat low since similarity is calculated by finding a phoneme having the greatest similarity to all the standard patterns; the number of similar phonemes is large, therefore, causing increased confusion between phonemes.

(3) The recognition rate is very low if a speaker utters sounds which do not correspond to any of the prepared standard patterns.

SUMMARY OF THE INVENTION

The present invention has been developed to remove the above-described drawbacks of conventional speech recognition apparatus.

It is, therefore, an object of the present invention to provide a new and improved speech recognition apparatus which is capable of handling words spoken by unspecific speakers, wherein the apparatus is not adversely influenced by changes in the speakers or acoustic environment so that high recognition rate is obtained in a stable manner.

Another object of the present invention is to provide a speech recognition apparatus which is capable of selecting a most suitable standard pattern group using unknown input sounds so that there is a high word recognition rate from unspecific speakers wherein the number of similarity calculations is remarkable reduced, leading to fast processing.

A further object of the present invention is to provide speech recognition apparatus capable of recognizing sounds from unspecific speakers with high recognition rate even if utterances from a speaker are not in prepared standard patterns.

According to a feature of the present invention, standard patterns are divided into several groups, one of which is automatically selected by analyzing some spoken words. Then the standard patterns of a selected group are automatically corrected.

In accordance with the present invention, a method of recognizing speech comprises: performing a linear prediction analysis of plural phonemes including the vowels and a nasal sound to calculate p.sup.th order LPC cepstrum coefficients in response to periodic frames derived for plural word utterances by plural speakers. In response to the calculated LPC cepstrum coefficients there is calculated a covariance matrix W that is a function of all the phonemes and a mean value m.sub.i for each of the particular phonemes, where i represents the particular phoneme. A weighting coefficient is derived in accordance with ##EQU1## where j=1,2 . . . p

.delta..sup.jj' =value of element jj' of inverse matrix W.sup.-1 of covariance matrix W.

The values a.sub.ij, .delta..sup.jj', m.sub.ij', and m.sub.i.sup.t W.sup.-1 m.sub.i for each of said phonemes are derived as coefficient values for the phonemes. In response to known phoneme sounds being uttered by a speaker, the value of an LPC cepstrum coefficient for each phoneme is derived. These LPC cepstrum coefficients are stored with the previously stored coefficient values of the corresponding phonemes to derive standard patterns for the phonemes. During a recognition mode while replicas of unknown words including the phonemes are derived: (i) phoneme segmentation of each unknown word is performed and (ii) for each segmented phoneme the similarity of LPC cepstrum coefficients of each segmented phoneme of the unknown words with the stored coefficient values of the standard patterns for the phonemes is determined in accordance with ##EQU2## where t is a matrix transportation factor. The standard phoneme most similar to the uttered phoneme is selected in response to the value of L.sub.i. The selected standard phonemes are combined to form a phoneme string for an uttered word. The formed phoneme string for an uttered word is compared with stored phoneme strings for known words to determine which of the known words is the uttered word.

In a preferred embodiment, the plural speakers are divided into plural groups each including multiple speakers and the mean value of the LPC cepstrum coefficients for each phoneme of each group is calculated. From the calculated mean values the inverse matrix for each group is calculated. A weighting coefficient is calculated as ##EQU3## for the j.sup.th order of each phoneme (i) of each group (n), where .delta..sup.ij is the value of element j, j' of inverse matrix W.sup.-1 of covariance matrix W. An average distance of each phoneme i of each group (n) is calculated as

d.sub.i.sup.(n) =m.sub.i.sup.(n)t W.sup.-1(n) m.sub.i.sup.(n).

The values of a.sub.ij.sup.(n) and d.sub.i.sup.(n) are stored for each group.

One of the groups prior to the recognition mode is selected by performing for each stored group a similarity calculation with a known uttered word in accordance with ##EQU4## A center frame of each phoneme of each uttered unknown word is determined. The sum L.sup.(n) of center frame similarity l.sub.i.sup.(n) for each phoneme of group n is calculated as ##EQU5## where K=number of stored phonemes and N=number of center frames in group n.

The values of L.sup.(n) for the different groups are compared to select the group to which the speaker of the unknown uttered word is a member. During the recognition step the cepstrum PLC coefficients of the speaker of the unknown uttered words are compared only with the cepstrum LPC coefficients of the selected group.

In one embodiment, the center frame of each phoneme is selected from the frame in the center of each phoneme. In another embodiment, the center frame of each phoneme is selected from the frame having the greatest similarity.

In a further embodiment, the plural speakers are divided into plural groups each including multiple speakers. In this case, the mean value of the LPC cepstrum coefficients for each phoneme of each group is calculated. From the calculated mean values of all of groups n, a covariance matrix R common to all of the uttered known phonemes of the n groups is calculated. A weighting coefficient with respect to the j.sup.th order of the LPC cepstrum coefficients for each phoneme i of group n is derived as ##EQU6## where jj' is the value of element j, j' of inverse matrix R.sup.-1 of covariance matrix R. An average distance to phoneme i of group n is derived as

d.sub.i.sup.(n) =m.sub.i.sup.(n)t R.sup.-1 m.sub.i.sup.(n)

where t is a matrix transpose.

The values of a.sub.ij.sup.(n) and d.sub.i.sup.(n) for each of the n groups are stored as are the values of a.sub.ij.sup.(n) and d.sub.i.sup.(n) for each group. One of the groups is selected prior to the recognition mode by performing for each stored group is similarity calculation with a known uttered word in accordance with ##EQU7##

A center frame of each phoneme of each uttered unknown word is determined. The sum L.sup.(n) of center frame similarity l.sub.i.sup.(n) for each phoneme of group n is calculated as ##EQU8## where N=number of center frames in group n. The two groups having the largest value L are selected whereby the groups r and s having the largest and next largest values of L respectively have values of L.sup.(i) and L.sup.(s). A numerical indication of the relative values of L.sup.(i) and L.sup.(s) is derived. In response to the numerical indication having values in first and second ranges, groups r and s are respectively selected. During the recognition step the cepstrum PLC coefficients of the speaker of the unknown uttered words are compared only with the cepstrum LPC coefficients of the selected group.

In one embodiment the numerical indication is derived as

R.sub.e =L.sup.(r) -L.sup.(s).

Group r is selected in response to R.sub.e being positive and in excess of a predetermined value. Group s is selected in response to R.sub.e being negative and in excess of the predetermined value. Groups r and s for LPC cepstrum coefficient similarity are selected in response to R.sub.e being less in absolute value than the threshold.

BRIEF DESCRIPTION OF THE DRAWINGS

The objects and features of the present invention will become more readily apparent from the following detailed description of the preferred embodiments taken in conjunction with the accompanying drawings in which:

FIG. 1 is a block diagram of a conventional speech recognition apparatus of the phoneme recognition type;

FIG. 2 is a schematic block diagram of a first embodiment of the speech recognition apparatus according to the present invention;

FIG. 3 is an explanatory graph of recognition rate for different speakers, obtained according to the present invention;

FIG. 4 is an explanatory graph of speech recognition results according to the present invention, as a function of standard deviation;

FIG. 5 is a schematic block diagram of a second embodiment of the speech recognition apparatus according to the present invention;

FIG. 6 is an automatic selection flowchart of standard pattern groups in an embodiment of the present invention;

FIG. 7 is an automatic correction flowchart of standard pattern groups in another embodiment of the present invention;

FIG. 8 is a speech recognition flowchart according to the present invention;

FIG. 9 is a graph wherein phoneme recognition rate according to the present invention is compared with that of a conventional example;

FIG. 10 is a schematic block diagram of a third speech recognition apparatus embodiment according to the present invention; and

FIG. 11 is a speech recognition flowchart for the embodiment illustrated in FIG. 10.

The same or corresponding elements and parts are designated as like reference numerals throughout the drawings.

DETAILED DESCRIPTION OF THE INVENTION

Prior to describing the embodiments of the present invention and to provide a better understanding thereof, an example of a conventional phoneme recognition type speech recognition apparatus is described with reference to FIG. 1.

A standard pattern storage 11 stores groups of phoneme or syllable standard patterns. The standard patterns are produced by dividing sound data from plural speakers by a cluster analysis or the like. For simplicity of description, it is assumed that standard pattern group 1 includes male data, while standard pattern group 2 includes female data, such that six standard patterns are provided for each group.

A speech signal transduced by microphone 1 is A/D converted by A/D converter 2; the A/D converted data are fed to signal processing circuit 3 and to segmentation portion 5. In signal processing circuit 3, necessary pre-emphasis is performed and window calculation is executed; and the result of the calculation is fed to linear prediction analysis processor 4. In segmentation portion 5, the A/D converted data are band pass filtered, calculations are performed thereon, sound periods are detected, voiced and unvoiced features are determined and consonants are segmented. The results of these operations are transmitted from portion 5 to main memory 7 where they are stored. Similarity calculating portion 6 calculates the degree of similarity between standard patterns for groups 1, 2 etc. stored in memory 11 and LPC parameters derived by linear prediction analysis processor 4. Standard patterns of standard pattern group 1 stored in the memory 11 are transmitted to the similarity calculating portion 6 so that similarity calculation is executed for respective frames; the similarity calculation results are stored in main memory 7. The similarity calculation is then performed between the standard patterns of group 2 and the LPC parameters. Main processor 8 determines the phoneme or syllable in memory 7 having the greatest similarity to a phoneme or syllable in memory 11. From the determined greatest similarity and the result from segmentation portion 5, processor 8 then produces a phoneme or syllable string. Then the produced string is compared with the contents of word dictionary 12 to derive a recognized word that is fed to output portion 9.

As described at the beginning of the specification, in this conventional technique, the number of standard patterns to be prepared in advance is large, leading to a low recognition rate; this prior art method and apparatus requires an extremely large amount of calculation.

Reference is now made to FIG. 2, a schematic functional block diagram of a first embodiment according to the present invention. A sound or speech signal transduced by microphone 31 is A/D converted by A/D converter 21 into 12-bit digital data using 12 KHz sampling pulses. The digital data from A/D converter 21 are subjected to pre-emphasis and a Humming window of 20 msec in signal processing circuit 22, and then a linear prediction analysis processor 23 calculates LPC cepstrum coefficients every 10 msec. The LPC cepstrum coefficients obtained by the linear prediction analysis processor 23 are fed to a similarity calculation portion 24 where the degree of similarity to respective phonemes is calculated for every frame; the results of the similarity calculations are stored in main memory 27. Coefficient memory 25 stores for respective phonemes weighting coefficients that are compared in calculator portion 24 with the LPC cepstrum coefficients derived from processor 23.

Band-pass filter 26 responds to digital data from A/D converter 21 to calculate band level of three or more channels and overall range power level; the data derived by filter 26 are stored in main memory 27 as segmentation data. Main processor 28 detects sound periods and segments each phoneme in response to data fed from similarity calculating portion 24 and band-pass filter 26 to main memory 27. Processor 28 responds to data read out of memory 27 to derive a phoneme string by determining the phoneme derived from processor 23 having the greatest similarity in LPC cepstrum coefficient during every phoneme period with the LPC cepstrum coefficients stored in memory 25. The duration of the phoneme period is determined by processor 28 in response to the output of filter 26, as stored in memory 27. The degree of LPC coefficient similarity for each is determined by processor 28 by comparing the signals stored in memory 27 resulting from the outputs of similarity calculating portion 24. The phoneme string produced by the main processor 28 is then compared with words stored in word dictionary memory 29 where words are expressed in terms of phoneme strings. As a result of the comparison, the word in dictionary 29 having the greatest similarity with the phoneme string derived by main processor 28 is determined and fed to output portion 30.

Although it is possible to recognize words spoken by unspecific persons with only the above-described structure, since the contents of the coefficient memory 25 corresponding to the standard pattern are fixed, apparatus having only the above-described structure is apt to suffer from a low recognition rate. To solve this problem and in accordance with the invention, therefore, learning portion 32 is provided. Learning portion 32 produces learning data in response to LPC cepstrum coefficients derived from linear prediction analysis portion 23 and the recognition result derived from output portion 30, i.e. the word in dictionary 29 recognized by processor 28 as being closest to the phoneme string calculated by the processor in response to the output of memory 27. More specifically, learning portion 32 calculates discriminating coefficients for each phoneme, which is most suitable for a present speaker on the basis of variance and covariance obtained in advance, and feeds the calculated weighting coefficients to coefficient memory 25.

The operation of the speech recognition apparatus according to the present invention is further described in detail wit reference to FIG. 2. Prior to performing speech recognition some data are prepared as follows: A number of words spoken by a number of speakers are transduced by microphone 31 so that vowels /a/, /o/, /u/, /i/, /e/ and a nasal sound are derived from A/D converter 21. Then a linear prediction analysis is performed every 10 msec by linear prediction analysis processor 23 using obtained sound data to calculate p.sup.th order LPC cepstrum coefficients. Using the LPC cepstrum coefficients, a covariance matrix W that is a function of all the phonemes and a mean value m.sub.i for each phoneme (where i represents the phoneme type) are derived by processor 23. With this result, a weighting coefficient a.sub.ij (j=1, 2, . . . , p) is derived as: ##EQU9## where element (j, j') of inverse matrix W.sup.-1 of covariance matrix W is expressed by .delta..sup.jj'.

Then the values of a.sub.ij, m.sub.ij', .delta..sup.jj', m.sub.i.sup.t W.sup.-1 m.sub.i, described infra, are derived for each phoneme as standard patterns to be stored in coefficient memory 25.

Then in response to a speaker uttering known sounds such as /a/, /i/, /u/, /e/, /o/, during a learning mode, LPC cepstrum coefficients are derived by linear prediction analysis processor 23 for the known sounds. Signals representing the LPC cepstrum coefficients as derived from processor 23 are fed to learning portion 32 which controls loading of memory 25. On the other hand, during a recognition mode, similarity calculating portion 24 determines the similarity of the LPC cepstrum coefficients derived from processor 23 with standard patterns prestored in coefficient memory 25. Similarity calculating portion 24 determines the similarity between the output of processor 23 and signals stored in memory 25 as a function of Mahalanobis' distance D.sub.i.sup.2, which is expressed as: ##EQU10## wherein t represents transposition matrix; and

x represents the LPC cepstrum coefficients of the input signal as derived from processor 23.

Since a first term is constant with respect to phoneme i, similarity L.sub.i may be simply expressed by: ##EQU11##

Therefore, similarity can be calculated using Formula (4); a signal representing the calculation result is fed to main memory 27, and a phoneme string is produced by main processor 28. Next, a value for a phoneme position to be earned, on a time base, is fed back from output portion 30 to learning portion 32 to derive the mean value of the LPC cepstrum coefficients of the phoneme to be learned. The above steps are repeated as many times as required for different types of sounds required to be recognized by the machine. Mean values of respective phonemes, to which suitable weights are given, are added to the original mean values (m.sub.ij') that are derived without learning. The resulting sums represent new mean values for respective p