A time domain speech recognition system is disclosed wherein a speech signal is infinitely clipped in order to derive its zero crossover pattern. A pitch pulse detector generates standardized marker pulses in synchronism with the glottal pressure pulses occurring during voiced sounds. Using these marker pulses as trigger signals, a sampling gate samples the infinitely clipped speech signal in synchronism with the glottal pulses. In the absence of a voiced signal, sampling is performed at a pseudo random rate. The zero crossing samples obtained are normalized with respect to voice pitch and are classified as belonging to a particular set of speech sounds called phonemes in accordance with a number of parameters including number and length of the zero crossover intervals, the length of the pitch pulse interval and the relative number of changes in the duration of the crossover intervals in a pitch pulse interval.
A method and apparatus are disclosed for recognizing spoken commands uttered by a user and for generating responsive control signals once the command is recognized. In accordance with this disclosure the audio signal is converted into a series of count bytes representing the time between the audio signal zero crossings, and all the count bytes of the full command are then segmented into equal temporal groups and sorted within each segment into a set of frequency class intervals which are based on a computation of substantially equal byte activity in all the words comprising the command lexicon. In this manner, lower and higher frequency groups are selected for equal significance. The uttered words are then compared against stored words similarly transformed according to segment and frequency interval and if the comparison conditions are satisfied the command is executed; if not, an indication is provided to the user to repeat the command. Segmenting produces a segment period.
Speech likeliness or a degree of speech is determined with a simple configuration or with a small amount of processing, and speech parts are separated from an input sound signal. The input sound signal is subjected to a waveform slicing process in frame units. The increase and decrease rate of a half wavelength in the frame is computed. The rate of a zero cross in the frame is computed. The increase and decrease rate of a half wavelength is computed by determining the rate of the portion where the upward half-wavelength or the downward half-wavelength of the waveform of the input sound signal changes to increase and decrease alternately or to decrease and increase alternately. The degree of speech is determined using each rate. Speech processing for separating or accentuating/attenuating speech and background noise in accordance with the degree of speech is performed on the sound signal for each frame.
A method and apparatus are disclosed for recognizing spoken commands uttered by a user and for generating responsive control signals once the command is recognized. In accordance with this disclosure the audio signal is converted into a series of count bytes representing the time between the audio signal zero crossings, and all the count bytes of the full command are then segmented into equal temporal groups histogram and sorted within each segment into a set of frequency class intervals which are based on a computation of substantially equal byte activity in all the words comprising the command lexicon. In this manner, lower and higher frequency groups are selected for equal significance. The uttered words are then compared against stored words similarly transformed according to segment and frequency interval and if the comparison conditions are satisfied the command is executed; if not, an indication is provided to the user to repeat the command.
There is provided an automatic voice recognition system which utilizes time encoded speech. Through the determination of zero crossing information and waveform parameters of an input voice signal, a stream of time encoded speech symbols is obtained. The stream of time encoded speech symbols is then converted into a matrix format for comparison with reference matrices formatted from time encoded symbols of reference words thereby to provide an output signal indicative of the content of the input voice signal.
An unknown segment such as a spoken digit in a continuous speech signal is recognized as a previously identified speech segment by deriving a set of test linear prediction characteristic signals from the voiced interval of the unknown segment. The test signals are time aligned to the average voiced interval of repetitions of each of a plurality of identified speech segments for which average reference voiced interval linear prediction characteristic signals were previously generated. The correspondence between the aligned test signals and the reference signals is determined. The unknown speech segment is identified as the reference segment having the closest correspondence with the unknown segment. Features of the invention include: voiced-region parameter signals, and classification and consistency detection; and determining means and variances of voiced-region parameters from a plurality of speakers utilized for the correspondence arrangements.