|
Description  |
|
|
BACKGROUND OF THE INVENTION
Most existing speech recognition systems pre-process input speech prior to
actual processing needed for speech recognition without using knowledge of
the speaker. The prior art systems create a spectrum by a number of
techniques, such as linear predictive coding, bandpass filtering,
transforms (particularly Fast Fourier Transforms), and time domain
analysis, such as zero crossing counts. These technologies have varying
disadvantages, but are done in a way that does not include any information
about the speech characteristics of the speaker and therefore use no
speaker-specific parameters which are estimated from an independent body
of speech.
Bandpass filters use fixed frequency bands. For example, Lokerson (U.S.
Pat. No. 4,039,754) uses three bandpass filters of ranges 336-742 Hz,
574-2226 Hz, and 1750-3710 Hz to correspond to typical ranges of the
first, second, and third formants of speech. Thus for example, the second
filter in a set of bandpass filters will have a different meaning for a
speaker who has a high first formant than for a speaker who has a lower
first formant. Since the formants are energy peaks of the speech and
depend upon the physical makeup of the speaker, the locations of these
energy peaks will vary from speaker to speaker. Therefore, the locations
of these frequency peaks will vary from speaker to speaker, and will
appear in different bands from one speaker to another.
Further, a set of fixed bandpass filters must have a fixed range of
coverage. Therefore, the set must have a minimum band which covers the
lowest frequency range that it expects to be able to treat and a maximum
band which covers the highest frequency range which it expects to treat.
Because this range of values is determined without reference to a specific
speaker, some bands will be of minimal, if any, value for any single
speaker. This adds noise to the analysis since these bands are not
meaningful for the particular speaker and waste system resources.
Linear Predictive Coding (LPC) is a method of approximating the spectrum of
a signal by fitting that spectrum with a representation characterized by a
fixed number of parameters. For example, a tenth-order LPC implementation
might be used in a typical speech processing application, allowing ten
parameters to fit to the spectrum over every time interval. A difficulty
in utilizing LPC when the recognition technique is based upon typical
pattern recognition technology i: that a given LPC coefficient does not
have the same meaning from speaker to speaker or even from speech frame to
speech frame of the same speaker. For example, the second LPC coefficient
may at one time fit one portion of the spectrum and at another time
another portion of the spectrum. Thus, it is very difficult to interpret
an LPC coefficient as having a specific meaning even when utilized with a
single speaker. The variation in LPC coefficients from speaker to speaker
is even greater.
Transforms such as Fast Fourier Transforms or Hadamard transforms can be
viewed as a series of equally spaced and narrow bandpass filters. The
disadvantages of ;sing such transforms are similar to that of bandpass
filters, but to some degree magnified because there are more such filters.
Pitch tracking is used in some speech processing systems. Pitch tracking
detects the pitch period information that can be used in speech
recognition as has been proposed by Lea, Trends in Speech Recognition,
Prentice Hall, 1980, pp. 166-205. Pitch information can also be used to
smooth some of the data by removing the modulation of those parameters by
the pitch frequency. Pitch tracking can further be used to
"pitch-synchronize" the data so that the data that is utilized in a speech
recognition system is a set of parameters for each pitch period rather
than for an arbitrary time period.
Pitch tracking for creating pitch-synchronous data is motivated in part by
the following logic. The pitch period of a speaker is determined by the
characteristics of the speaker's vocal cords. For a given speaker, the
pitch period can vary by a factor of four from the lowest to highest pitch
period depending upon the sound being spoken, the stress placed upon the
word, and the position in the sentence of the word. From speaker to
speaker, the average pitch also varies greatly. For example, it is well
known that females on the average have a shorter pitch period than males.
This variability in pitch makes it impossible to pick a single time period
for analyzing the spectrum of the data which always includes exactly one
pitch period. Spectral analysis in equal time intervals creates distortion
in the spectrum and in some cases averages out information that is
important. Further, the amount of data created by a fixed sampling period
will be unrelated to the information in the signal. For a low pitch, the
spectrum can be calculated less frequently and yet contain all the
relevant information in the signal. For a high pitch, the information must
typically be sampled more frequently to contain all the relevant
information in the signal. This accounts in part for some recognition
systems having more difficulty with female voices than with male voices.
Approaches to pitch tracking have varied greatly, but they all suffer from
one major defect that seriously reduces their effectiveness. Because they
assume no knowledge of the speaker, they must be adaptive or highly
general in order to cover the wide range of pitch that can and will be
encountered. In attempting to maintain such generality, they are typically
either less accurate or more computational, hence slower, than is
acceptable.
SUMMARY OF THE INVENTION
The invention disclosed herein is a speech processing method and apparatus
which improves the suitability of the parameters of speech derived for
speech recognition. While parts of the approach utilized have
applicability to other aspects of speech processing, such as speech
compression, the purpose of the present invention is to provide a more
accurate and cost-effective speech recognition system.
The present invention processes an independent body of speech during an
enrollment process and creates a set of parameters representing the
speaker's pitch, the frequency spectrum of the speech as a function of
time, and certain measurements of the speech signal in the time-domain. A
particular objective of the invention is to make these parameters have the
same meaning from speaker to speaker. Thus after the pre-processing
performed by this invention, the parameters would look much the same for
the same word independent of the speaker. In this manner, variations in
the speech signal caused by the physical makeup of a speaker's throat,
mouth, lips, teeth, and nasal cavity would be, at least in part, reduced
by the preprocessing.
The advantage of the speaker normalization created by the pre-processing
accrues in several areas. One of these areas is that the parameters have a
more consistent interpretation. For example, one of the parameters may be
the energy in the first formant (energy peak) of the speech irrespective
of the speaker. This consistency of parameters is an important requirement
to enable the use of many pattern recognition techniques to their full
effectiveness (see Meisel, Computer-Oriented Approaches to Pattern
Recognition, Academic Press, 1972, p. 26).
Secondly, consistent interpretation of the features allows a more natural
use of expert knowledge in discriminating speech sounds. Since a meaning
can be attributed to the parameters, an expert can translate his
conceptual criteria ("a loud first formant") more directly into a critrion
on the parameters ("the first parameter must have a large value").
A third advantage of speaker-normalized parameters is that variations from
speaker to speaker are reduced. Thus many people who have the same speech
characteristics (e.g., the same physical makeup) will appear nearly
identical in the characteristics of their speech after pre-processing into
appropriately normalized parameters. In this manner, the number of
differences from speaker to speaker that must be handled by a recognition
process using normalized parameters is smaller. One can then create a
small number of recognizers as a set of tables that are stored in the
recognition system, and not have to represent certain types of speaker
variation in those tables. This makes it practical to analyze typical
speakers "off-line" on a larger computer system and have the results be
useful for a large number of speakers whose data was not analyzed,
allowing speech recognition with short enrollment periods.
The parameter table which best matches the speaker can be selected during a
one-time enrollment process based upon a small amount of data. The number
of tables required so that at least one table works well for a given user
will depend on the degree of speaker normalization achieved. Since it is
the intent of the parameter normalization to remove the variability from
speaker to speaker caused by physical differences and speaking habits, the
commercial practicability of some types of speech recognition applications
is enhanced. Yet a fourth advantage of speaker-normalization is that, by
removing speaker-dependent information, the data rate required for the
speech representation is reduced, thereby decreasing the computational
load on the speech recognizer.
An important class of applications are those which require a large
vocabulary, where enrollment of all words in the vocabulary is not
practical or advantageous. These applications include general
speech-to-text transcription, data base access, and a speech interface to
a natural-language or artificial-intelligence program which is
text-intensive or highly interactive. The subject invention assumes that a
small amount of enrollment speech can be gathered; this will then be
extrapolated to the larger vocabulary.
The present invention produces these speech parameters in a form where the
amount of data is reduced to that meaningfully required for analysis and
where certain artifacts introduced by the pitch modulation of the signal
are reduced. In particular, the invention utilizes a pitch smoothing
approach which removes the modulation of the spectral and temporal
parameters created by the pitch period of the speaker. Further, the
invention creates pitch synchronous data, in which there is a set of
parameters created for each pitch period rather than for an arbitrary
period of time; an arbitrary sampling period could be either too short or
too long for a specific speaker.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a block diagram showing a complete system in which the invented
preprocessor is utilized.
FIG. 2 is a block overview diagram showing the present invention.
FIG. 3 is a detailed block diagram of the data acquisition processing
performed by present invention.
FIG. 4 is a block diagram of the spectral analyzer of the invention.
FIG. 5 is a block diagram of the temporal analyzer of the invention.
FIG. 6 is a block diagram of the pitch analyzer of the invention.
FIG. 7 is a block diagram of the pitch synchronizer of the present
invention.
FIG. 8 is a block diagram of the gain enrollment processing performed by
the present invention.
FIG. 9 is a block diagram of the spectral and pitch enrollment performed by
the present invention.
FIG. 10 is a block diagram of the peak normalization enrollment performed
by the present invention.
FIG. 11 is a block diagram of the pre-emphasis enrollment performed by the
present invention.
DETAILED DESCRIPTION OF THE INVENTION
Introduction
A pre-processing system is disclosed which takes a speech signal and
produces spectral parameters, temporal parameters, and pitch estimates
which are input to a speech recognition module which utilizes the
parameters to perform recognition of what was said in the speech. The
recognition can be done by template matching or by a number of alternative
approaches. By performing the pre-processing as disclosed herein, the
recognition problem is made easier for most technologies which might be
used in the speech recognition module.
The present invention discloses a pre-processor which provides parameters
suitable for accurate speech recognition. In particular, these parameters
are produced such that they have the same meaning from speaker to speaker.
The subject invention accomplishes this by modifying the pre-processing
based upon the specific characteristics of the speaker determined in an
enrollment process. Other technologies use fixed pre-processing without
regard to speaker. The rest of the speech recognition algorithm must then
make up for the irregularities in the interpretation of the parameters.
Enrollment is a process in which a small amount of speech data from the
user is used in a one-time (or repeated) analysis to extract user
parameters. The analysis can be done off-line; that is, it need not be
done in real-time or in the recognizer itself. The data is gathered and
then analyzed to create a optimal set of parameters to use in
pre-processing for the particular speaker.
A general approach which can be used if the system which does the analysis
has a means of buffering the raw analog or digitized speech data is to
simply collect and store the raw data. The user parameters can then be
extracted by analysis of this data to find the optimal parameters in ways
that will be discussed in more detail herein.
In the case where the raw analog data cannot be buffered, it is
pre-processed by the system pre-processor. This case requires a different
approach. The pre-processor is initially set for a nominal set of values,
perhaps characterized by "typical" male or "typical" female. The speech
data collected using those nominal values is analyzed in order to extract
the correct user-specific parameters. The preferred embodiment uses the
second approach.
The speech signal can be normalized to make maximum use of the dynamic
range of the system by a variable gain. The gain control can be a simple
automatic gain control circuit such as is found in many electronic devices
such as portable tape recorders; such a circuit provides short-term
dynamic adjustment of the gain. Such gain control algorithms adjust the
speech amplitude based upon very short term (less than one-second) time
constants; they can distort the speech waveform because of the short-term
transients they create. The gain can advantageously be controlled by a
more sophisticated algorithm in which a longer term analysis of the speech
signal (specifically over an entire sentence) is performed in order to set
the gain. This approach is the preferred embodiment. If this latter
approach is used, a problem arises with the gain setting for the first few
sentences spoken.
An advantage of a speaker enrollment procedure for preprocessing is that
initial gain can be chosen during enrollment. Knowledge of the speaker's
typical speaking volume minimizes the likelihood of gain-induced errors in
the first few sentences spoken.
In the present invention, pitch detection is used in order to remove the
pitch modulation. Because the component in the speech signal which
represents the pitch is dominant, even with relatively sharp bandpass
filters or other frequency analyzers, one will see the pitch component in
these signals, sometimes dramatically. Furthermore, due to the radically
different resonance characteristics of the vocal tract between the
open-glottis and closed-glottis phases of the pitch period, in ordinary
frequency analyzers the spectrum shifts up and down each pitch period. By
smoothing those signals in a manner so as to take out the amplitude
modulation and the frequency shifting imposed by pitch, much more reliable
spectral parameters can be estimated.
The spectral parameters can be sampled using the pitch signal as well as to
give information at the most relevant rate for the specific individual.
This is relatively uncommon in speech recognition, but pitch-synchronous
spectrum analysis is a well known approach to speech analysis. See, for
example, Rabiner and Schafer, Digital Processing of Speech Signals,
Prentice-Hall, 1978, pp. 319-323.
Time domain analysis and spectral analysis are usually not utilized within
the same system, but in the present invention, such analyses are combined
to produce a positive benefit. Typical time domain analyses are zero
crossings and amplitude envelope detection, but much more subtle
time-domain analyses may be performed. See, for example, Baker, "A New
Time-Domain Analysis of Human Speech . . . ", Ph.D Dissertation,
Carnegie-Mellon, 1975. Again, pitch information may be used to both smooth
and sample the time features.
The sampling of the temporal and spectral parameters allows a reduced data
representation of the speech which is geared to the specific speaker. By
the specific methods disclosed herein, advantages in spectral analysis,
pitch extraction, time domain analysis, gain control, and signal smoothing
and sampling are possible.
To perform spectral analysis, as will be discussed in detail below, this
invention uses a bank of bandpass filters, realized by digital filters in
the preferred embodiment, designed to approximate the spectral
characteristics of an auditory critical band, and spaced roughly equally
on a critical band scale. (c.f. Zwicker & Terhardt, "Analytical
Expressions For Critical-Band Rate And Critical Bandwidth As A Function Of
Frequency", Journal of the Acoustical Society of America 68(5) pp.
1523-1525). The bandpass filters are implemented as resonators, a
recursive form modelling the short-term fading memory of the human
auditory system.
A standard practice in speech spectral analysis is to boost the
high-frequency components of the signal with a first-order filter in order
to compensate for the natural long-term spectral slope of speech. While
this invention also pre-emphasizes the speech signal, its design is novel
in using a higher-order filter to whiten the signal, thereby compensating
for more detailed deviations from a flat spectrum in the long term speech
spectrum. Since the average spectrum varies considerably from speaker to
speaker, this pre-emphasis filter is adjusted for a specific speaker
during enrollment.
The instantaneous magnitude or energy of the output of a recursive filter
is traditionally computed by a nonlinear, rectifying filter followed by a
lowpass smoothing filter. The lowpass filter has the undesirable effect of
temporally smearing the filter output with a resultant significant loss in
temporal resolution. A novel feature of this invention is in the use of a
special rectification process, total energy computation, which yields a
smooth but unsmeared magnitude measure and eliminates the need for lowpass
filtering. The remaining temporal variations in total magnitude during a
pitch period are due entirely to the bandwidth of that frequency band in
the signal and to changes in vocal tract configuration.
A fundamental feature of this invention is the automatic tuning of the
filterbank to the formant ranges of a particular speaker during
enrollment, based on statistical analysis of the output of an untuned
filterbank with uniform critical-band spacing, in order to normalize the
spectrum, thus reducing the speaker-dependence of the subsequent
filterbank output. Finally, the output of the filterbank is parametrized
by pitch-synchronous sampling.
To perform pitch detection, as will be discussed in detail below, this
invention uses a filter, but one which is unusual in not being a
conventional lowpass or bandpass filter designed to approximate a
rectangular filter. Since the fundamental frequency range of a single
speaker in ordinary speech varies by two octaves, no lowpass or bandpass
filter, no matter how rectangular, can reject all the higher harmonics for
the highest fundamental in the speaker's range without also admitting at
least the second, third, and fourth harmonics for the lowest fundamental
in the speaker's range. However, if the fundamental component is only
being extracted for period estimation by interpeak interval measurement,
the complete suppression of all higher harmonics is unnecessary; the
higher components must merely be attenuated to the extent that they are
unable to contribute independent peaks to the extracted fundamental
signal. In this connection, this invention utilizes a specially designed
filter which, rather than attempting to reject all higher harmonics,
merely attenuates them to an amplitude relative to the fundamental at
which they are incapable of contributing peaks to the signal. This is very
advantageous, since a rudimentary peak-picking algorithm then suffices to
give an accurate measure of the fundamental period, requiring none of the
usual preprocessing by thresholding and correlation, nor any of the
postprocessing cleanup familiar to those skilled in the art.
When implemented digitally with a finite word length, this special sloping
filter has a frequency range limited by the number of bits in the data
word. When this range is too small for the filter to function
satisfactorily for the full range of fundamental frequency ranges found
across different speakers, it is desirable to have the slope begin at the
bottom of the speaker's range. It is likewise desirable to be able to
adjust a highpass filter to the bottom of the speaker's range in order to
reject sub-pitch phenomena. Hence a novel feature of the invention is the
automatic tuning of the filter to the particular speaker, based on
automatic statistical analysis during enrollment of the speaker's
fundamental frequency as measured with an untuned filter.
A digital implementation also depends critically upon the absolute
amplitude range of the signal. If the signal is too high, it will be
clipped, causing harmonic distortion which further increases the range in
relative amplitude between the fundamental and a higher components; if the
signal is too low, it can be lost altogether. This problem is solved in
this invention by the use of an automatic gain control system which uses
feedback from the digital stage to the analog stage to maintain an optimal
signal level before digitization.
Any measure of the local fundamental period, in addition to its intrinsic
value as a speech parameter, can also serve an important secondary
function as a timing signal for pitch-synchronous parametrization and
smoothing of other acoustic variables. For this purpose it is advantageous
to know the exact epoch of each pitch period, to prevent blurring the
acoustic characteristics of adjacent periods. This invention accomplishes
this by using the output of the peak-picker on the pitch-filtered signal
to control a pitch-epoch detector on the original (broadband) waveform.
Temporal thresholds are employed in the epoch-detector which are, like the
pitch filter, automatically tuned to the pitch range of the speaker by
statistical analysis of the speech signal during enrollment, thus again
minimizing the need for a post-processor to correct the output.
By creating data for recognition which is somewhat speaker independent,
better use of system resources is made. Because the algorithms can be
tuned to a specific speaker, they can be made more efficient and more
accurate at the same time because they do not have to operate in such a
way as to be invulnerable to all conditions they might encounter with
varying speakers. For example, the pitch algorithm is both more accurate
and more efficient because it utilizes knowledge of the pitch range of the
specific speaker. The bands created for spectral analysis are all relevant
because they are adjusted to the range of the speaker; therefore, bands do
not exist which are outside the relevant range of interest for a specific
speaker. As a result, in the present invention, system resources used in
creating the bandpass parameters are efficiently utilized.
Because of the accuracy of the pitch algorithm, the data rate can be
adapted to be pitch synchronous with some confidence. The data is thus
adjusted to the specific speaker and at an optimal data rate for that
speaker. Since doing this effectively without creating problems requires
an accurate pitch algorithm, this is most effectively achieved by the
invention disclosed herein whereby the pitch algorithm is adjusted to the
specific speaker. Similarly the spectrum and certain time domain features
can be smoothed using the pitch information as long as it is accurate. In
the sense that the present invention provides a more accurate pitch
estimate, this data reduction is made into a practical reality rather than
simply a theoretical one.
Preferred Embodiment
Referring to FIG. 1, the invented preprocessing system is shown within
shadow lines 11 as part of a speech recognition system including a source
of speech (acoustic transducers) 13, a data acquisition section 15, an
acoustic analysis section 17 (the data acquisition and acoustic analysis
section comprising the invented preprocessor), a phonetic processor
section 19, a lexicosyntactic processing section 21 and a text output
section 23.
The specific elements and the processing performed by the invented
preprocessing system may better be explained by reference to FIG. 2
wherein the speech from acoustic transducers 13 is input to data
acquisition section 15 which produces digitized speech signals including
an oral energy component 24, nasal energy component 25 and oral amplitude
component 26. The oral amplitude component of the digitized speech signal
is then input into spectral analyzer 27, temporal analyzer 28 and pitch
analyzer 29. The spectral analyzer outputs spectral parameters 35.
Temporal analyzer 28 outputs temporal parameters 36. Similarly, pitch
analyzer 29 outputs pitch parameter 37 and a pitch-epoch timing signal 39.
The nasal and oral energy components along with spectral parameters 35,
temporal parameters 36, pitch parameter 37, and pitch epoch timing signal
39 are input to a pitch synchronizer 41 as will be described hereinbelow.
Data Acquisition Processing
The details of the data acquisition section 15 will now be described with
reference to FIG. 3. Speech which is to be processed by the speech
recognition system is converted into an electrical signal by oral
microphone 61 and nasal microphone 63 as described more fully in
co-pending U.S. application Ser. No. 698,577 filed Feb. 6, 1985 now
abandoned, which is the parent of U.S. application Ser. No. 928,643 filed
Nov. 5, 1986, now U.S. Pat. No. 4,718,096 which issued Jan. 5, 1988. The
electrical signals produced by oral microphone 61 and nasal microphone 63
are input to gain controls 64, 65 and 67 respectively which provide a
digitally controlled gain of the input signal providing a resolution of
three dB steps for gain controls 65 and 67 and 1.5 dB steps for gain
control in a range of 0-40 dB according to the following algorithm:
(a) While taking data (in enrollment or recognition), track is kept of the
highest signal level over a pitch period. This level is compared to three
thresholds: low, middle, and maximum. Over an utterance, three values are
accumulated: the number of times the level was over the maximum threshold,
the number of times the level was over the middle threshold but under the
| | |