A speech recognition method for detecting and recognizing one or more keywords in a continuous audio signal is disclosed. Each keyword is represented by a keyword template representing a plurality of target patterns, and each target pattern comprises statistics of each of a plurality of spectra selected from plural short-term spectra generated according to a predetermined system for processing of the incoming audio. The spectra are processed to enhance the separation between the spectral pattern classes during later analysis. The processed audio spectra are grouped into multi-frame spectral patterns and are compared by means of likelihood statistics with the target patterns of the keyword templates. A concatenation technique employing a loosely set detection threshold makes it very unlikely that a correct pattern will be rejected.
A speech encoder is disclosed, in which, of the DCT coefficients after the discrete cosine transformation, a coefficient which has a large absolute value and exerts great influence on the tone quality is selected and encoded and zeros are inserted into the other unselected coefficients, so that selective encoding is carried out which does not seriously deteriorate the tone quality even when the coding rate is 8 kbps or below. In another arrangement, about three to 16 different selection patterns (vector patterns per frame) are used for the selective coding and a pattern which minimizes the coding error is selected and encoded to ensure optimum coding.
In a speech recognition method and apparatus, according to the present invention, feature vectors produced by an analysing unit of a speech recognition device are modified for compensating the effects of noise. According to the invention, feature vectors are normalized using a sliding normalization buffer (31). By means of the method according to the invention, the performance of the speech recognition device improves in situations, wherein the speech recognition device's training phase has been carried out in a noise environment that differs from the noise environment of the actual speech recognition phase.
A process for analyzing a two-dimensional image, wherein the structural identity of stored reference patterns with image contents or portions is determined, irrespective of the position of said image content or portion in the image to be analyzed. The image is subjected to a two-dimensional Fourier transformation operation and the separated amplitude distribution or power distribution is compared to amplitude or power distributions in respect of the reference patterns in the Fourier range, while determining the respective probability of identity, the twist angle and the enlargement factor as between the reference pattern and the image content or portion. Storage and processing of the image and the reference patterns or the Fourier transforms thereof are effected in digital form. In order to locate an image content or portion in the original image, which is identical with a reference pattern, the respective reference pattern or the Fourier transform thereof is assimilated to said image content or portion, in respect of size and orientation, by inverse rotary extension, with the ascertained twist angle and enlargement factor, and finally the position or positions at which the reference pattern when converted in that way has maximum identity with a section of the image is established.
A method and apparatus for speech processing in a distributed speech recognition system having a front-end and a back-end. The speech processing steps in the front-end are as follows: extracting speech features from a speech signal and normalizing the speech features in order to alter the power of the noise component in the modulation spectrum in relation to the power of the signal component, especially with frequencies above 10 Hz. A low-pass filter is then used to filter the normalized modulation spectrum in order to improve the signal-to-noise ratio (SNR) in the speech signal. The combination of feature vector normalization and low-pass filtering is effective in noise removal, especially in a low SNR environment.
A speech recognition method and apparatus for detecting and recognizing one or more keywords in a continuous audio signal are disclosed. Each keyword is represented by a keyword template which corresponds to a sequence of plural target patterns, and each target pattern comprises statistics representing each of a plurality of spectra selected from plural short-term spectra generated according to a predetermined system for processing the incoming audio. The target patterns also have associated therewith minimum and maximum dwell times. The dwell time is the time interval during which a given target pattern can be said to match incoming frame patterns. The spectra are processed to enhance the separation between the spectral pattern classes during later analysis. The processed audio spectra are grouped into multi-frame spectral patterns and each multi-frame spectral pattern is compared by means of likelihood statistics with the target patterns of keyword templates. Each formed multi-frame pattern is then forced to contribute to the total word score for each keyword as represented by the keyword template. Thus the keyword recognition method requires all input patterns to contribute to the word score of a keyword candidate, using the minimum and maximum dwell times for testing whether a target pattern can still match an input pattern, and wherein the frame rate of the audio spectra must be less than one-half the minimum dwell time of a target pattern. A concatentation technique employing a loosely set detection threshold makes it very unlikely that a correct pattern will be rejected. A method for forming the target patterns is also described.