WikiPatents - Community Patent Review
Create Free Account  |  License or Sell Your Patent  |  WikiPatents Marketplace  |  WikiPatents Blog
Username:  Password:  
    
Advanced Search
Information processing apparatus for integrating a plurality of feature parameters    
United States Patent6718299   
Link to this pagehttp://www.wikipatents.com/6718299.html
Inventor(s)Kondo; Tetsujiro (Tokyo, JP); Yoshiwara; Norifumi (Tokyo, JP)
AbstractAn information processing apparatus includes a feature parameter detector for detecting feature parameters based on a plurality of input data, a normalizer for normalizing the feature parameters detected by the feature parameter detector while maintaining their feature components, and an integration unit for integrating the feature parameters normalized by the normalizer. In the information processing apparatus, feature parameters from a plurality of input data are normalized based on learning normalization coefficients, and distances from each of the normalized feature parameters and to a normal parameter are calculated. Based on the calculated distances, time-series normalization coefficients for performing speech recognition are determined for the feature parameters. Therefore, optimal normalization coefficients for recognizing the feature parameters at each point of time can be obtained.
   














 Title Information Submit all comments and votes
 
Patent Text Patent PDF Print Page Summary File History
Plain text PDF images Print Summary File History
Drawing from US Patent 6718299
Information processing apparatus for integrating a plurality of feature

     parameters - US Patent 6718299 Drawing
Information processing apparatus for integrating a plurality of feature parameters
Inventor     Kondo; Tetsujiro (Tokyo, JP); Yoshiwara; Norifumi (Tokyo, JP)
Owner/Assignee     Sony Corporation (Tokyo, JP)
Patent assignment
All assignments
Publication Date     April 6, 2004
Application Number     09/478,061
PAIR File History     Application Data   Transaction History
Image File Wrapper   Patent Term   Fees
Litigation
Filing Date     January 5, 2000
US Classification     704/224 704/243
Int'l Classification     G01L 019/14
Examiner     To; Doris H.
Assistant Examiner     Opsasnick; Michael N.
Attorney/Law Firm     LLP, Frommer; William S. Frommer Lawrence & Haug Kessler; Gordon ,
Address
Parent Case    
Priority Data     Jan 07, 1999[JP]11-001789
USPTO Field of Search     704/224 704/243 704/244 704/276 704/201
Patent Tags     information processing integrating plurality feature parameters
   
Enter a comma (,) or semicolon (;) between multiple tag words/phrases.
Describe this patent:
 Amusing   
 Clever   
 Complex   
 Efficient   
 Historic   
 Important   
 Innovative   
 Interesting   
 Practical   
 Simple   
[no votes]
Patent WIKI

Share information and news about this patent, including information and news about the technology, inventors, company, ligation and licensing.

 References Submit all comments and votes
 
*references marked with an asterisk below are user-added references
 U.S. References
 
Add a new US reference:  
ReferenceRelevancyCommentsReferenceRelevancyComments
6125345
Modi

Sep,2000

[0 after 0 votes]
6006175
Holzrichter
704/208
Dec,1999

[0 after 0 votes]
5839103
Mammone
704/232
Nov,1998

[0 after 0 votes]
5729694
Holzrichter
705/17
Mar,1998

[0 after 0 votes]
5412738
Brunelli
382/115
May,1995

[0 after 0 votes]
 Foreign References
 Other References
 Market Review Submit all comments and votes
   
Market Size
Estimate the gross annual revenues of the relevant market sector:
> $10B
$5B - $10B
$2B - $5B
$500M - $2B
$100M - $500M
$10M - $100M
$1M - $10M
$500K - $1M
$100K - $500K
< $100K
[No votes]
$0
 
$0   $2.5B   $5B   $7.5B   $10B
Market Share
Estimate the percentage of the relevant market sector this invention will capture:
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Reasonable Royalty
What percentage of gross sales should the inventor or assignee be paid?
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Public's "Guesstimation" of Royalty Value
Market SizeN/A[No votes]
xMarket ShareN/A[No votes]
xReasonable RoyaltyN/A[No votes]

N/A

License Availablity
If you are NOT the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
License Availablity
If you ARE the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
Competitive Advantage
Does this invention have a significant competitive advantage over similar technologies?
Yes

No



[No votes]
Most helpful competitive advantage comment
[No comments]

Commercial Alternatives
Are there viable commercial alternatives for this invention?
Yes

No



[No votes]
Most helpful commercial alternative comment
[No comments]

 Technical Review Submit all comments and votes
 Claims Submit all comments and votes
 


What is claimed is:

1. An information processing apparatus comprising:

a feature parameter detector for detecting feature parameters based on a plurality of different types of input data, each of said different types of input data being obtained independently, at least one detected feature parameter being associated with each of said different types of input data;

a storage unit for storing normalization information;

a normalizer for normalizing the feature parameters detected by said feature parameter detector associated with said different types of input data, using the normalization information stored in said storage unit, and

an integration unit for integrating the feature parameters normalized by the normalizer,

wherein said normalization information in said storage unit is obtained by detecting distances between a normal parameter and each of normalized feature parameters associated with the different types of learning data; and

comparating the detected distances associated with said different types of learning data.

2. An information processing apparatus according to claim 1, further comprising a recognition unit for recognizing that at least one of said plurality of different types of input data corresponds or does not correspond to an object to be recognized.

3. An information processing apparatus according to claim 2, wherein said recognition unit comprises:

a vector quantizer for performing time-series vector-quantization on outputs from said integration unit;

a distance-transition-model storage unit for storing a plurality of distance-transition models; and

a matching unit for performing matching based on distances between the outputs from said integration unit and each distance-transition model.

4. An information processing apparatus according to claim 1, wherein said feature parameter detector detects time-series feature parameters.

5. An information processing apparatus according to claim 4, further comprising a normalization-information storage unit for storing time-series normalization information corresponding to the feature parameters, wherein said normalizer normalizes the feature parameters, based on the normalization information.

6. An information processing apparatus according to claim 5, wherein the normalization information is generated based on the feature parameters by performing learning beforehand.

7. An information processing apparatus according to claim 1, wherein said normalizer comprises:

a first normalizer for performing time-domain normalization on the feature parameters; and

a second normalizer for normalizing the feature parameters normalized by said first normalizer while maintaining the characteristics of the feature parameters.

8. An information processing apparatus according to claim 1, wherein said normalizer normalizes each of the feature parameters, based on time-series normalization information preset based on relationships among the feature parameters.

9. A learning apparatus comprising:

a normalizer for normalizing, based on first normalization information preset for a plurality of different types of time-series input data, each of said different types of input data being obtained independently, feature parameters of the different types of input data, at least one feature parameter being associated with each of said different types of input data;

a detector for detecting a distance between a normal parameter and each of the normalized feature parameters associated with said different types of input data;

a comparator for comparating the detected distances associated with said different types of input data and for outputting the result of the comparation, and

a normalization information generator for generating, based on the result of the comparation, second normalization information for each of the feature parameters.

10. A learning apparatus according to claim 9, wherein the first normalization information is set to be a value for setting the distance between one of the feature parameters and the normal parameter to be equal to the distance between another one of the feature parameters and the normal parameter.

11. A learning apparatus according to claim 10, wherein said normalization information generator generates the second time-series normalization information by providing a larger weight to one parameter having a shorter distance at each point of time.

12. A learning apparatus comprising:

a feature parameter detector for detecting feature parameters based on a plurality of different types of input data, each of said different types of input data being obtained independently, at least one detected feature parameter being associated with each of said different types of input data;

a first normalizer for normalizing the feature parameters detected by the feature parameter detector associated with said different types of input data among the feature parameters;

a second normalizer for normalizing the feature parameters normalized by the first normalizer based on the order thereof;

a matching unit for detecting distances between a normal parameter and each of the normalized feature parameters, normalized by said second normalizer, associated with said different types of input data;

a comparator for comparating the detected distances associated with said different types of input data and for outputting the result of the comparison, and

a normalization-information generator for generating normalization information based on the result of the comparison.

13. A learning apparatus according to claim 12, wherein said first normalizer performs normalization based on normalization information preset for the feature parameters among the feature parameters.

14. An information processing method comprising the steps of:

detecting feature parameters based on a plurality of different types of input data, each of said different types of input data being obtained independently, at least one detected feature parameter being associated with each of said different types of input data;

normalizing the detected feature parameters using normalization information; and

integrating the normalized feature parameters

wherein said normalization information is previously obtained by detecting distances between a normal parameter and each of normalized feature parameters associated with the different types of learning data; and comparating the detected distances associated with said different types of learning data.

15. An information processing method according to claim 14, further comprising the step of recognizing, based on outputs from the integrating step, that at least one of the input data corresponds or does not correspond to an object be recognized.

16. A learning method comprising the steps of:

normalizing, based on first normalization information preset for a plurality of time-series input data, feature parameters of the different types of input data, each of said different types of input data being obtained independently, at least one feature parameter being associated with each of said different types of input data;

detecting a distance between a normal parameter and each of the normalized feature parameters associated with said different types of input data;

comparating the detected distances associated with said different types of input data and for outputting the result of the comparison; and

generating, based on the result of the comparison, second normalization information for each of the feature parameters.

17. A learning method comprising the steps of:

detecting feature parameters based on a plurality of, different types of input data, each of said different types of input data being obtained independently, at least one detected feature parameter being associated with each of said different types of input data;

normalizing the detected feature parameters among the feature parameters associated with said different types of input data;

further normalizing the normalized feature parameters normalized based on the order thereof;

detecting distances between a normal parameter and each of the normalized feature parameters, normalized by said second normalizer, associated with said different types of input data;

comparating the detected distances associated with said different types of input data, performing a matching process on each of the further normalized feature parameters; and

generating, based on the result of the comparating, normalization information for each of the feature parameters.
 Description Submit all comments and votes
 


BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention relates to information processing apparatuses that integrate a plurality of feature parameters, and in particular, to an information processing apparatus in which, when speech recognition based on speech and on an image of lips observed when the speech was made is performed, the information processing apparatus increases speech recognition performance by integrating audio and image feature parameters so that the parameters can be processed in optimal form.

2. Description of the Related Art

By way of example, speech is recognized by extracting feature parameters from the speech, and comparing the feature parameters with normal parameters (normal patterns) used as a reference.

When speech recognition based on only speech is performed, there is a certain limit to increasing the recognition factor. Accordingly, it is possible that the speech recognition be performed based not only on the speech but also on a captured image of lips of the speaker.

In this case, it is also possible to integrate feature parameters extracted from the speech and feature parameters extracted from the lip image to form so-called "integrated parameters" and to use the integrated parameters to perform speech recognition. The assignee of the present patent application has proposed Japanese Patent Application No. 10-288038 (which was not open to the public when the present patent application was filed) as a type of speech recognition that generates integrated parameters by integrating feature parameters extracted from speech and feature parameters extracted from a lip image and that uses the integrated parameters to perform speech recognition.

With reference to FIGS. 1 to 16, Japanese Patent Application No. 10-288038 is described below.

FIG. 1 shows an example of a speech recognition apparatus that performs speech recognition based on integrated parameters obtained by integrating feature parameters based on a plurality of input data.

In addition to speed data (as a speech from a user) to be recognized, image data obtained by capturing an image of the user's lips when the user spoke, noise data on noise in an environment where the user spoke, and data useful in recognizing the user's speech (speech), such as a signal in accordance with the operation of an input unit for inputting a place where the user speaks in the case where the speech recognition apparatus is provided with the input unit, are sequentially input in time series to the speech recognition apparatus. The speech recognition apparatus takes these types of data into consideration, as required, when performing speech recognition.

Specifically, the speech data, the lip-image data, the noise data, and other data, which are in digital form, are input to a parameter unit 1. The parameter unit 1 includes signal processors 11.sub.1 to 11.sub.N (where N represents the number of data signals input to the parameter unit 1). The speech data, the lip-image data, the noise data, and other data are processed by the signal processors 11.sub.1 to 11.sub.N corresponding thereto, whereby extraction of feature parameters representing each type of data, etc., is performed. The feature parameters extracted by the parameter unit 1 are supplied to an integrated parameter generating unit 2.

In the parameter unit 1 shown in FIG. 1, the signal processor (lip-signal processor) 11.sub.1 processes the lip-image data, the signal processors (audio-signal processors) 11.sub.2 to 11.sub.N-1 process the speech data, and the signal processor (audio-signal processor) 11.sub.N processes the noise data, etc. The feature parameters of the speech (sound) data such as the speech data and the noise data include, for example, linear prediction coefficients, cepstrum coefficients, power, line spectrum pairs, and zero cross. The feature parameters of the lip-image data include, for example, parameters (e.g., the longer diameter and shorter diameter of an ellipse) defining an ellipse approximating the shape of the lips.

The integrated parameter generating unit 2 includes an intermedia normalizer 21 and an integrated parameter generator 22, and generates integrated parameters by integrating the feature parameters of the signals from the parameter unit 1.

In other words, the intermedia normalizer 21 normalizes the feature parameters of the signals from the parameter unit 1 so that they can processed having the same weight, and outputs the normalized parameters to the integrated parameter generator 22. The integrated parameter generator 22 integrates (combines) the normalized feature parameters of the signals from the intermedia normalizer 21, thereby generating integrated parameters, and outputs the integrated parameters to the matching unit 3.

The matching unit 3 compares the integrated feature parameters and normal patterns (a model to be recognized), and outputs the matching results to a determining unit 4. In other words, the matching unit 3 includes a distance-transition matching unit 31 and a spatial distribution matching unit 32. The distance-transition matching unit 31 uses a distance-transition model (described below) to perform the matching of the integrated feature parameters by using a distance-transition method (described below), and outputs the matching results to the determining unit 4. The spatial distribution matching unit 32 performs the matching of the integrated feature parameters by using a spatial distribution method (described below), and outputs the matching results to the determining unit 4.

The determining unit 4 recognizes the user's speech (sound), based on outputs from the matching unit 3, i.e., the matching results from the distance-transition matching unit 31 and the spatial distribution matching unit 32, and outputs the result of recognition, e.g., a word. Accordingly, in the determining unit 4, what is processed by speech recognition is a word. In addition, for example, a phoneme, etc., can be processed by speech recognition.

With reference to the flowchart shown in FIG. 2, processing by the speech recognition apparatus (shown in FIG. 1) is described below.

When the speech data, the lip-image data, the noise data, etc., are input to the speech recognition apparatus, they are supplied to the parameter unit 1.

In step S1, the parameter unit 1 extracts feature parameters from the supplied data, and outputs them to the integrated parameter generating unit 2.

In step S2, the intermedia normalizer 21 (in the integrated parameter generating unit 2) normalizes the feature parameters from the parameter unit 1, and outputs the normalized feature parameters to the integrated parameter generator 22.

In step S3, the integrated parameter generator 22 generates integrated feature parameters by integrating the normalized feature parameters from the intermedia normalizer 21. The integrated feature parameters are supplied to the distance-transition matching unit 31 and the spatial distribution matching unit 32 in the matching unit 3.

In step S4, the distance-transition matching unit 31 performs the matching of the integrated feature parameters by using the distance-transition method, and the spatial distribution matching unit 32 performs the matching of the integrated feature parameters by using the spatial distribution method. Both matching results are supplied to the determining unit 4.

In step S5, based on the matching results from the matching unit 3, the determining unit 4 recognizes the speech data (the user's speech). After outputting the result of (speech) recognition, the determining unit 4 terminates its process.

As described above, the intermedia normalizer 21 (shown in FIG. 1) normalizes the feature parameters of the signals from the parameter unit 1 so that they can be processed having the same weight. The normalization is performed by multiplying each feature parameter by a normalization coefficient. This normalization coefficient is found by performing learning (normalization-coefficient learning process). FIG. 3 shows an example of a learning apparatus for performing the learning.

For brevity of description, a type of learning is described below that finds normalization coefficients for setting the feature parameters of the speech and the image as two different media (e.g., feature parameters of speech and feature parameters of lips observed when the speech was made) to have the same weight.

In FIG. 3, image feature parameter P.sub.i,j and speech feature parameter V.sub.i,j, which are code-vector learning parameters (codebook-creating data) for creating a codebook for use in vector quantization, are supplied to a tentative normalizer 51. The tentative normalizer 51 tentatively normalizes image feature parameter P.sub.i,j and speech feature parameter V.sub.i,j by using normalization coefficients from a normalization coefficient controller 55, and supplies the normalized feature parameters to a codebook creator 52. In other words, in order to use the weight of image feature parameter P.sub.i,j as a reference and to set the weight of speech feature parameter V.sub.i,j to equal the reference, speech feature parameter V.sub.i,j is multiplied by normalization coefficient a from the normalization coefficient controller 55. Accordingly, it can be considered that image feature parameter P.sub.i,j is multiplied by 1 as normalization coefficient .alpha..

In FIG. 3, suffix "i" indicating the row of feature parameter P.sub.i,j or V.sub.i,j represents a time (frame) at which the feature parameter P.sub.i,j or V.sub.i,j was extracted, and suffix "j" indicating the column of feature parameter P.sub.i,j or V.sub.i,j represents the order (dimensions) of the feature parameter P.sub.i,j or V.sub.i,j. Therefore, (P.sub.i,j, P.sub.i,2, . . . , P.sub.i,L, V.sub.i,1, V.sub.i,2, . . . , V.sub.i,M) are feature parameters (feature vectors) at time i. Expression P.sup.(k).sub.i,j formed by adding a suffix in parentheses to image feature parameter P.sub.i,j represents a feature parameter generated from different learning data if "k" differs. This also applies to the suffix (k) of expression V.sup.(k).sub.i.

The codebook creator 52 creates a codebook for use in vector quantization by a vector quantizer 54, using code-vector learning parameters P.sub.i,j and V.sub.i,j, and supplies it to the vector quantizer 54.

In the codebook creator 52, the codebook is created in accordance with, e.g., the LBG (Linde, Buzo, Gray) algorithm. However, another type of algorithm other than the LBG algorithm may be employed.

The LBG algorithm is so-called "batch learning algorithm", and locally converges code vectors (representative vectors) constituting the codebook in optimal positions by repeatedly performing Voronois division that optimally divides a feature parameter space in accordance with the distance between a feature parameter as a learning sample (learning data) and each code vector (a proper initial value is first given), and repeatedly updating the code vectors to the centroids of partial regions of a feature parameter space which are obtained by the Voronois division.

Here, when a set of learning samples is represented by x.sub.j (j=0, 1, . . . , J-1), and a set of code vectors is represented by Y={y.sub.0, y.sub.1, . . . , y.sub.N-1 }, learning-sample set x.sub.j is divided into N subsets S.sub.i (i=0, 1, . . . , N-1) by code-vector set Y in the Voronois division. In other words, when the distance between learning-sample set x.sub.j and code vector y.sub.i is represented by d (x.sub.j, y.sub.i), and the following expression holds with respect to all of t (t=0, 1, . . . , N-1) that does not equal i,

d(x.sub.j,y.sub.i)<d(x.sub.j,y.sub.t) (1)

it is determined that learning-sample x.sub.j is (x.sub.j, S.sub.i) belonging to subset x.sub.j.

In addition, when centroids C (v.sub.0, v.sub.1, . . . , v.sub.M-1) with respect to vectors v.sub.0, v.sub.1, . . . , v.sub.M-1 are defined by the following expression: ##EQU1##

code vector y.sub.i is updated in accordance with the following expression

y.sub.i =C({S.sub.i }) (3)

In the expression (2), the right side "argmin { }" means vector v that minimizes the value in { }. A so-called "clustering technique" using expression (3) is called "k-means clustering". The details of the LGB algorithm are described in, for example, "Speech and Image Engineering" written by Kazuo Nakata and Satoshi Minami, published by Shokodo in 1987, pp. 29-31.

In the learning apparatus shown in FIG. 3, suffix i,j that indicates the row of element S.sub.i,j and T.sub.i,j in the codebook output by the codebook creator 52 represents the j-th element of the code vector corresponding to code #i. Thus, expression (S.sub.i,1, S.sub.i,2, . . . , S.sub.i,L, T.sub.i,1, T.sub.i,2, . . . , T.sub.i,M) represents a code vector corresponding to code #i. Element S.sub.i,j of the code vector corresponds to the image, and element T.sub.i,j corresponds to the speech.

A tentative normalizer 53 is supplied with image feature parameter P.sub.i,j and speech feature parameter V.sub.i,j (in this example it is assumed that both types of parameters are obtained from an image and speech different from those for the code-vector learning parameters) as normalization-coefficient learning parameters for learning normalization coefficient .alpha.. Similarly to the tentative normalizer 51, the tentative normalizer 53 tentatively normalizes image feature parameter P.sub.i,j and speech feature parameter V.sub.i,j by using the normalization coefficients from the normalization coefficient controller 55, and supplies the normalized parameters to the vector quantizer 54. In other words, among image feature parameter P.sub.i,j and speech feature parameter V.sub.i,j as normalization-coefficient learning parameters, speech feature parameter V.sub.i,j is multiplied by normalization coefficient a from the normalization coefficient controller 55 by the tentative normalizer 53, and the tentative normalizer 53 outputs the product to the vector quantizer 54.

The tentative normalizer 53 is supplied with a plurality of sets of normalization-coefficient learning parameters. The tentative normalizer 53 performs normalization with respect to each of the normalization-coefficient learning parameters.

The vector quantizer 54 performs vector quantization on the normalized normalization-coefficient learning parameters supplied from the tentative normalizer 53, using the latest codebook supplied from the codebook creator 52, and supplies quantization errors caused by the vector quantization to the normalization coefficient controller 55.

In other words, the vector quantizer 54 calculates, for the image and speech, a distance between each code vector of the codebook and each normalized normalization-coefficient learning parameter, and supplies the calculated shortest distance as a quantization error to the normalization coefficient controller 55. Specifically, the distance between image feature parameter P.sub.i,j among the normalized normalization-coefficient learning parameters, and image-related element S.sub.i,j of the code vector, is calculated, and the calculated shortest distance is supplied as an image-related quantization error to the normalization coefficient controller 55. At the same time, the distance between speech feature parameter .alpha.V.sub.i,j among the normalized normalization-coefficient learning parameters, and speech-related element T.sub.i,j of the code vector, is calculated, and the calculated shortest distance is supplied as a speech-related quantization error to the normalization coefficient controller 55.

The normalization coefficient controller 55 accumulates, with respect to all the normalization-coefficient learning parameters, image- and speech-related quantization errors supplied from the vector quantizer 54, and changes normalization coefficient .alpha. to be supplied to the vector quantizers 51 and 53 so that both accumulated values are equal.

With respect to the flowchart shown in FIG. 4, a normalization-coefficient learning process performed by the learning apparatus shown in FIG. 3 is described below.

In the learning apparatus shown in FIG. 3, at first, code-vector learning parameters are supplied to the vector quantizer 51, and normalization-coefficient learning parameters are supplied to the vector quantizer 53. In addition, initial normalization coefficient a is supplied from the normalization coefficient controller 55 to the vector quantizers 51 and 53.

In step S21, the vector quantizer 51 tentatively normalizes the code-vector learning parameters by multiplying speech feature parameter V.sub.i,j among the code-vector learning parameters by normalization coefficient a from the normalization coefficient controller 55, and supplies the tentatively normalized parameters to the codebook creator 52.

When receiving the normalized code-vector learning parameters from the vector quantizer 51, the codebook creator 52 uses the received parameters in step S22 to create, based on the LBG algorithm, a codebook used when the vector quantizer 54 performs vector quantization. The codebook creator 52 supplies the created codebook to the vector quantizer 54.

In step S23, the tentative normalizer 53 tentatively normalizes the normalization-coefficient learning parameters by multiplying speech feature parameter V.sub.i,j among the normalization-coefficient learning parameters by normalization coefficient .alpha. from the normalization coefficient controller 55, and supplies the tentatively normalized parameters to the vector quantizer 54.

When receiving the latest codebook from the codebook creator 52, and receiving the latest normalized normalization-coefficient learning parameters from the tentative normalizer 53, the vector quantizer 54 uses the codebook from the codebook creator 52 in step S24 to perform vector quantization for the image and the speech. The vector quantizer 54 supplies the image- and speech-related quantization errors to the normalization coefficient controller 55.

In other words, in step S24, the vector quantizer 54 calculates a distance between image feature parameter P.sub.i,j (among the normalized normalization-coefficient learning parameters) and image-related element S.sub.i,j of the code vector, and supplies the calculated shortest distance as an image-related quantization error to the normalization coefficient controller 55. The vector quantizer 54 also calculates a distance between speech feature parameter .alpha.V.sub.i,j (among the normalized normalization-coefficient learning parameters) and speech-related element T.sub.i,j of the code vector, and supplies the calculated shortest distance as a speech-related quantization error to the normalization coefficient controller 55.

As described, the vector quantizer 53 is supplied with the normalization-coefficient learning parameters. Thus, the vector quantizer 54 is also supplied with a plurality of sets of normalized normalization-coefficient learning parameters. The vector quantizer 54 successively finds, for each of the normalized normalization-coefficient learning parameters, the above-described image- and speech-related quantization errors, and supplies them to the normalization coefficient controller 55.

In step S24, the normalization coefficient controller 55 accumulates, for all the normalization-coefficient learning parameters, the image- and speech-related quantization errors supplied from the vector quantizer 54, thereby finding image-related quantization-error-accumulated value D.sub.P and speech-related quantization-error-accumulated value D.sub.V. The obtained image-related quantization-error-value D.sub.P and speech-related quantization-error-accumulated value D.sub.V are supplied and stored in the normalization coefficient controller 55.

In step S25, the normalization coefficient controller 55 determines whether image-related quantization-error-accumulated value D.sub.P and speech-related quantization-error-accumulated value D.sub.V have been obtained with respect to all the values of normalization coefficient .alpha.. In other words, in this example, accumulated values D.sub.P and D.sub.V are found by, for example, initially setting normalization coefficient a at 0.001, and changing (increasing (in this example)) normalization coefficient .alpha. by 0.001 between 0.001 and 2.000. In step S25, the normalization coefficient controller 55 determines, for the image and the speech, whether quantization-error-accumulated values D.sub.P and D.sub.V have been found with respect to normalization coefficient .alpha. having the range.

If the normalization coefficient controller 55 has determined in step S25 that quantization-error-accumulated values D.sub.P and D.sub.V have not been found with all the values of normalization coefficient .alpha., the normalization coefficient controller 55 changes normalization coefficient .alpha. in step S26, as described above, and supplies it to the tentative normalizers 51 and 53. After that, the normalization coefficient controller 55 proceeds back to step S21, and uses the changed values of normalization coefficient .alpha. to repeatedly perform the same processing.

If the normalization coefficient controller 55 has determined in step S25 that quantization-error-accumulated values D.sub.P and D.sub.V have been found with all the values of normalization coefficient .alpha., it proceeds to step S27, and calculates the absolute value .vertline.D.sub.P -D.sub.V.vertline. of the difference between image-related quantization error D.sub.P and speech-related quantization error D.sub.V (stored in step S24) with respect to each value of normalization coefficient .alpha.. The normalization coefficient controller 55 also detects the value of normalization coefficient .alpha. that gives the minimum value of difference absolute value .vertline.D.sub.P -D.sub.V.vertline.. In other words, the normalization coefficient controller 55 ideally detects normalization coefficient .alpha. in the case where image-related quantization error D.sub.P and speech-related quantization error D.sub.V are identical. The normalization coefficient controller 55 proceeds to step S28, and terminates the process after outputting normalization coefficient .alpha. giving the minimum value of absolute value .vertline.D.sub.P -D.sub.V.vertline., the output normalization coefficient .alpha. set for performing normalization so that image feature parameter P.sub.i,j and speech feature parameter V.sub.i,j can be treated having the same weight.

As described above, a codebook is created by normalizing code-vector learning parameters as integrated parameters composed of image and speech feature parameters, and using the normalized code-vector learning parameters, while performing the steps of tentatively normalizing normalization-coefficient learning parameters as integrated parameters composed of image and speech feature parameters, finding accumulated values of image- and speech-related quantization errors (minimum values of distances with the code vectors) by using the created codebook to perform vector quantization on each of image and speech feature parameters among the normalized normalization-coefficient learning parameters, and changing normalization coefficients so that image- and speech-related accumulated values are equal. Thereby, normalization coefficients for performing normalization so that feature parameters of different media such as image and speech can be treated having the same weight can be found.

As a result, when speech recognition is performed by using normalization coefficients to normalize feature parameters extracted from speech and feature parameters extracted from an image of lips of the speaker, integrating the feature parameters, and using the integrated parameters, the recognition is greatly affected by either the speech or the image. This can prevent an increase in the recognition factor from being hindered.

In addition, effects of the feature parameters (of the media) which constitute the integrated parameters, on the recognition factor, can be easily verified.

In the above-described case, the weights of the image feature parameters are used as a reference (set to be 1), and normalization coefficient .alpha. for setting the weights of the speech feature parameters to be identical to those of the image feature parameters is found. Therefore, the intermedia normalizer 21 (shown in FIG. 1) outputs the image feature parameters without performing any processing, while it normalizes the speech feature parameters by multiplying the speech feature parameters by the normalization coefficient .alpha. found as described above, and outputs the normalized speech feature parameters.

Although the learning that finds normalization coefficient .alpha. for setting the weights of the feature parameters of two types (image and speech) to be equal has been described with reference to FIG. 3, a type of learning can be performed that finds normalization coefficients for equalizing the weights of feature parameters of three or more types or the weights of feature parameters of media other than the image and the speech.

The above-described normalization coefficient learning can be applied regardless of the type and order of feature parameters because it is not dependent on the type and order of feature parameters.

FIG. 5 shows an example of the distance-transition matching unit 31 shown in FIG. 1.

From the integrated parameter generating unit 2 (shown in FIG. 1), for example, integrated parameters generated when a word was pronounced are in time series supplied to a time-domain normalizer 61. The time-domain normalizer 61 performs time-domain normalization on the supplied, integrated parameters.

When a speech time in which a word was pronounced is represented by t, a time change of an element among integrated parameters generated when the word was pronounced is as shown in, for example, FIG. 6A. Speech time t in FIG. 6A varies depending on each speech, even if the same person pronounced the same word. Accordingly, the time-domain normalizer 61 performs time-domain normalization so that speech time t is uniformly set to be time T.sub.C, as shown in FIG. 6B. Assuming that the speech recognition apparatus (shown in FIG. 1) performs word recognition, time T.sub.C is set to be sufficiently longer than a general speech time required when a word to be recognized is pronounced. Thus, the time-domain normalizer 61 changes the integrated parameter shown in FIG. 6A so that it is so-called "extended" in the time-domain direction. The technique of the time-domain normalization is not limited to that shown in FIGS. 6A and 6B.

The time-domain-normalized parameters are supplied from the time-domain normalizer 61 to a vector quantizer 62. The vector quantizer 62 sequentially performs vector quantization on the time-domain-normalized integrated parameters, using a codebook stored in a codebook storage unit 63, and sequentially supplies a distance calculator 64 with codes as the vector quantizer results, that is, codes corresponding to code vectors nearest to the integrated parameters.

The codebook storage unit 63 stores the codebook, which is used when the vector quantizer 62 performs vector quantization on the integrated parameters.

The distance calculator 62 accumulates, in units of time, each distance between a distance-transition model of the word to be recognized and a code vector obtained when a code series output by the vector quantizer 62 is observed, and supplies the accumulated value to a sorter 66.

A distance-transition-model storage unit 65 stores a distance-transition model representing distances between time-series integrated parameters (normal series) of the word to be recognized, which are as shown in FIG. 7, and the code vectors of the codebook stored in the codebook storage unit 63. In other words, the distance-transition-model storage unit 65 stores a distance-transition model (as shown in FIG. 7) that is obtained by learning (described below) for each word to be recognized.

In the example shown in FIG. 7, the codebook stored in the codebook storage unit 63 has J+1 code vectors C.sub.0 to C.sub.J.

The sorter 66 selects upper Nb values (where Nb represents a natural number) in increasing order among distance-accumulated values on the distance-transition model of each word to be recognized, and outputs them, as a result of matching between the integrated parameters and the distance-transition model, to the determining unit 4.

The above-described, distance-transition matching unit 31 performs matching based on a distance-transition method. A matching process based on this distance-transition method is described below with reference to the flowchart shown in FIG. 8.

When receiving time-series integrated parameters corresponding to the pronunciation of a word from the integrated parameter generating unit 2 (shown in FIG. 1), the time-domain normalizer 61 performs time-domain normalization on the integrated parameters in step S31, and outputs the time-domain-normalized parameters to the vector quantizer 62. In step S32, the vector quantizer 62 sequentially performs vector quantization on the time-domain-normalized parameters supplied from the time-domain normalizer 61 by referring to the codebook stored in the codebook storage unit 63, and sequentially outputs a code series corresponding to code vectors having the shortest distances with the integrated parameters, as the vector-quantization results to the distance calculator 64.

In step S33, the distance calculator 64 accumulates each distance between the distance-transition model of the word to be recognized and each code vector obtained when the code series output by the vector quantizer 62 is observed.

In other words, when among the code series output by the vector quantizer 62, a code at time t is represented by S.sub.t (t=0, 1, . . . , T.sub.C), the distance calculator 64 finds the distance between the code and code vector C.sub.j (j=0, 1, . . . , J) corresponding to code s.sub.0 initially output by the vector quantizer 62 by referring to the distance-transition model. Specifically, when code s.sub.0 corresponds to, for example, code vector C.sub.0, the distance at time #0, which is on the curve indicating the distance transition from code vector C.sub.0, is found in FIG. 7.

The distance calculator 64 calculates the distance at time #1 to code vector C.sub.j corresponding to code s.sub.1 secondly output by the vector quantizer 62 by referring to the distance-transition model. Similarly, the distance calculator 64 sequentially finds distances up to the distance at time #T.sub.C to code vector C.sub.j corresponding to code STC finally output by the vector quantizer 62 by referring to the distance-transition model, and calculates an accumulated value of the distances.

After calculating accumulated values of distances for all distance-transition models stored in the distance-transition-model storage unit 62, the distance calculator 64 outputs the accumulated values to the sorter 66, and proceeds to step S34.

In step S34, the sorter 66 selects upper Nb values in increasing order among the accumulated values of distances on the distance-transition models of words to be recognized, and proceeds to step S35. In step S35, the sorter 66 outputs, to the determining unit 4, the selected values as a result of matching between the integrated parameters and the distance-transition models.

With reference to FIG. 9, a learning apparatus for performing learning that finds the distance-transition models to be stored in the distance-transition-model storage unit 62 (shown in FIG. 5) is described below.

A time-domain normalizer 71 is supplied with time-series, learning integrated parameters. The time-domain normalizer 71 performs time-domain normalization on the learning integrated parameters, similarly to the time-domain normalizer 61 (shown in FIG. 5), and supplies the normalized parameters to a distance calculator 72.

In other words, the time-domain normalizer 71 is supplied with, for example, a plurality of sets of time-series, learning integrated parameters for finding a distance-transition model of a word. The time-domain normalizer 71 performs time-domain normalization on each of the learning integrated parameters, and processes the normalized parameters to generate one learning integrated parameters. Specifically, a plurality of learning integrated parameters (Nc learning integrated parameters in FIG. 10) (on a word) that do not always have the same duration are supplied to the time-domain normalizer 71, as shown in column (A), FIG. 10. The time-domain normalizer 71 performs time-domain normalization on the supplied parameters so that each of their durations is set to be time T.sub.C, as shown in column (B), FIG. 10. The time-domain normalizer 71 calculates, for example, the mean of values sampled at the same time from the time-domain-normalized parameters, as shown by the graph (C) in FIG. 10, and generates one learning