A speech recognition method for detecting and recognizing one or more keywords in a continuous audio signal is disclosed. Each keyword is represented by a keyword template representing one or more target patterns, and each target pattern comprises statistics of each of at least one spectrum selected from plural short-term spectra generated according to a predetermined system for processing of the incoming audio. The incoming audio spectra are compared with the target patterns of the keyword templates and candidate keywords are selected according to a predetermined decision process. In post-decision processing, concatentation techniques, based upon a likelihood ratio test, for rejecting false alarms are disclosed. Post-decision processing can include also a prosodic test to enhance the effectiveness of the recognition apparatus.
In a speech encoder a Fourier transform of the speech is provided. The Fourier transform is equalized by normalizing the spectrum coefficients to a curve which approximates the shape of the spectrum. Both the curve and the equalized spectrum are encoded. Preferably, only a baseband of the normalized spectrum is encoded and that baseband is repeated in the decoder. The spectrum is normalized by scaling different regions (subbands) of the spectrum differently to flatten the spectrum.
The present invention provides a speech recognition apparatus for recognizing speech by converting voice signal into digital data. The apparatus performs a two-stage Dynamic Programming (DP) operation comprising a pre-stage and post-stage operation. The pre-stage DP operation is conducted with regard to the inputted digital voice data selected along a time axis and standard pattern data formed in advance in correspondence with voice and stored in a memory, thereby selecting several standard pattern data as the candidates. The post-stage DP operation is further conducted with regard to the inputted digital voice signals and the selected candidates of standard pattern data. Thus, the DP operation time of the present apparatus is shortened and a work area of a memory is decreased.
A model-training module generates mixture Gaussian density models from speech training data for continuous, or isolated word HMM-based speech recognition systems. Speech feature sequences are labeled into segments of states of speech units using Viterbi-decoding based optimized segmentation algorithm. Each segment is modeled by a Gaussian density, and the parameters are estimated by sample mean and sample covariance. A mixture Gaussian density is generated for each state of each speech unit by merging the Gaussian densities of all the segments with the same corresponding label. The resulting number of mixture components is proportional to the dispension and sample size of the training data. A single, fully merged, Gaussian density is also generated for each state of each speech unit. The covariance matrices of the mixture components are selectively smoothed by a measure of relative sharpness of the Gaussian density. The weights of the mixture components are set uniformly initially, and are reestimated using a segmental-average procedure. The weighting coefficients, together with the Gaussian densities, then become the models of speech units for use in speech recognition.
The disclosure relates to the recognition of sequences of multidimensional images and, notably, of image signals. The disclosed device includes, for each of said sequences to be recognized, a first circuit for the correlation of vectors representing the signal with a masking vector determined from the vectors representing the sequence to be recognized, producing a series of values corresponding to the degree of similarity of the two correlated vectors, a second circuit for the correlation of a sequence of the series of values with a reference sequence determined from the vectors forming said sequence to be recognized, producing values that correspond to the degree of similarity of the two correlated sequences, and a circuit for deciding on the validity of the recognition, by comparison of the values corresponding to the degree of similarity of the two correlated sequences with a threshold value.
In a feature data processing apparatus, one of two designated reference density vectors di and dj, to which a feature vector x corresponding to a feature element is closer, is determined from dij of the equation: This value is calculated for all the combinations of two reference feature vectors di and dj selected from a reference feature vector group dk (k=0 to n-1), thereby obtaining classification data. The classification data is determined, using a logical formula or a reference table. An average value of the components is calculated from the classification data and the components of all the feature vectors. The above operation is repeated until the calculated average and the components of the reference density vector converge within a predetermined allowance, thereby precisely classifying feature data.