|
Description  |
|
|
RELATED
APPLICATIONS
This application is related to a co-pending Patent Cooperation Treaty application number PCT/US03/39593, entitled "System and Method for Speech Processing Using Improved Independent Component Analysis", filed Dec. 11, 2003, which claims priority
to U.S. patent application Nos. 60/432,691 and 60/502,253, all of which are incorporated herein by reference.
FIELD OF THE INVENTION
The present invention relates to a system and process for separating an information signal from a noisy acoustic environment. More particularly, one example of the present invention processes noisy signals from a set of microphones to generate a
speech signal.
BACKGROUND
An acoustic environment is often noisy, making it difficult to reliably detect and react to a desired informational signal. In one particular example, a speech signal is generated in a noisy environment, and speech processing methods are used to
separate the speech signal from the environmental noise. Such speech signal processing is important in many areas of everyday communication, since noise is almost always present in real-world conditions. Noise is defined as the combination of all
signals interfering or degrading the speech signal of interest. The real world abounds from multiple noise sources, including single point noise sources, which often transgress into multiple sounds resulting in reverberation. Unless separated and
isolated from background noise, it is difficult to make reliable and efficient use of the desired speech signal. Background noise may include numerous noise signals generated by the general environment, signals generated by background conversations of
other people, as well as reflections and reverberation generated from each of the signals. In communication where users often talk in noisy environments, it is desirable to separate the user's speech signals from background noise. Speech communication
mediums, such as cell phones, speakerphones, headsets, cordless telephones, teleconferences, CB radios, walkie-talkies, computer telephony applications, computer and automobile voice command applications and other hands-free applications, intercoms,
microphone systems and so forth, can take advantage of speech signal processing to separate the desired speech signals from background noise.
Many methods have been created to separate desired sound signals from background noise signals, including simple filtering processes. Prior art noise filters identify signals with predetermined characteristics as white noise signals, and
subtract such signals from the input signals. These methods, while simple and fast enough for real time processing of sound signals, are not easily adaptable to different sound environments, and can result in substantial degradation of the speech signal
sought to be resolved. The predetermined assumptions of noise characteristics can be over-inclusive or under-inclusive. As a result, portions of a person's speech may be considered "noise" by these methods and therefore removed from the output speech
signals, while portions of background noise such as music or conversation may be considered non-noise by these methods and therefore included in the output speech signals.
In signal processing applications, typically one or more input signals are acquired using a transducer sensor, such as a microphone. The signals provided by the sensors are mixtures of many sources. Generally, the signal sources as well as
their mixture characteristics are unknown. Without knowledge of the signal sources other than the general statistical assumption of source independence, this signal processing problem is known in the art as the "blind source separation (BSS) problem".
The blind separation problem is encountered in many familiar forms. For instance, it is well known that a human can focus attention on a single source of sound even in an environment that contains many such sources, a phenomenon commonly referred to as
the "cocktail-party effect." Each of the source signals is delayed and attenuated in some time varying manner during transmission from source to microphone, where it is then mixed with other independently delayed and attenuated source signals, including
multipath versions of itself (reverberation), which are delayed versions arriving from different directions. A person receiving all these acoustic signals may be able to listen to a particular set of sound source while filtering out or ignoring other
interfering sources, including multi-path signals.
Considerable effort has been devoted in the prior art to solve the cocktail-party effect, both in physical devices and in computational simulations of such devices. Various noise mitigation techniques are currently employed, ranging from simple
elimination of a signal prior to analysis to schemes for adaptive estimation of the noise spectrum that depend on a correct discrimination between speech and non-speech signals. A description of these techniques is generally characterized in U.S. Pat.
No. 6,002,776 (herein incorporated by reference). In particular, U.S. Pat. No. 6,002,776 describes a scheme to separate source signals where two or more microphones are mounted in an environment that contains an equal or lesser number of distinct
sound sources. Using direction-of-arrival information, a first module attempts to extract the original source signals while any residual crosstalk between the channels is removed by a second module. Such an arrangement may be effective in separating
spatially localized point sources with clearly defined direction-of-arrival but fails to separate out a speech signal in a real-world spatially distributed noise environment for which no particular direction-of-arrival can be determined.
Methods, such as Independent Component Analysis ("ICA"), provide relatively accurate and flexible means for the separation of speech signals from noise sources. ICA is a technique for separating mixed source signals (components) which are
presumably independent from each other. In its simplified form, independent component analysis operates an "un-mixing" matrix of weights on the mixed signals, for example multiplying the matrix with the mixed signals, to produce separated signals. The
weights are assigned initial values, and then adjusted to maximize joint entropy of the signals in order to minimize information redundancy. This weight-adjusting and entropy-increasing process is repeated until the information redundancy of the signals
is reduced to a minimum. Because this technique does not require information on the source of each signal, it is known as a "blind source separation" method. Blind separation problems refer to the idea of separating mixed signals that come from
multiple independent sources.
Many popular ICA algorithms have been developed to optimize their performance, including a number which have evolved by significant modifications of those which only existed a decade ago. For example, the work described in A. J. Bell and T J
Sejnowski, Neural Computation 7:1129 1159 (1995), and Bell, A. J. U.S. Pat. No. 5,706,402, is usually not used in its patented form. Instead, in order to optimize its performance, this algorithm has gone through several recharacterizations by a number
of different entities. One such change includes the use of the "natural gradient", described in Amari, Cichocki, Yang (1996). Other popular ICA algorithms include methods that compute higher-order statistics such as cumulants (Cardoso, 1992; Comon,
1994; Hyvaerinen and Oja, 1997).
However, many known ICA algorithms are not able to effectively separate signals that have been recorded in a real environment which inherently include acoustic echoes, such as those due to room architecture related reflections. It is emphasized
that the methods mentioned so far are restricted to the separation of signals resulting from a linear stationary mixture of source signals. The phenomenon resulting from the summing of direct path signals and their echoic counterparts is termed
reverberation and poses a major issue in artificial speech enhancement and recognition systems. ICA algorithms may require long filters which can separate those time-delayed and echoed signals, thus precluding effective real time use.
Known ICA signal separation systems typically use a network of filters, acting as a neural network, to resolve individual signals from any number of mixed signals input into the filter network. That is, the ICA network is used to separate a set
of sound signals into a more ordered set of signals, where each signal represents a particular sound source. For example, if an ICA network receives a sound signal comprising piano music and a person speaking, a two port ICA network will separate the
sound into two signals: one signal having mostly piano music, and another signal having mostly speech.
Another prior technique is to separate sound based on auditory scene analysis. In this analysis, vigorous use is made of assumptions regarding the nature of the sources present. It is assumed that a sound can be decomposed into small elements
such as tones and bursts, which in turn can be grouped according to attributes such as harmonicity and continuity in time. Auditory scene analysis can be performed using information from a single microphone or from several microphones. The field of
auditory scene analysis has gained more attention due to -the availability of computational machine learning approaches leading to computational auditory scene analysis or CASA. Although interesting scientifically since it involves the understanding of
the human auditory processing, the model assumptions and the computational techniques are still in its infancy to solve a realistic cocktail party scenario.
Other techniques for separating sounds operate by exploiting the spatial separation of their sources. Devices based on this principle vary in complexity. The simplest such devices are microphones that have highly selective, but fixed patterns
of sensitivity. A directional microphone, for example, is designed to have maximum sensitivity to sounds emanating from a particular direction, and can therefore be used to enhance one audio source relative to others. Similarly, a close-talking
microphone mounted near a speaker's mouth may reject some distant sources. Microphone-array processing techniques are then used to separate sources by exploiting perceived spatial separation. These techniques are not practical because sufficient
suppression of a competing sound source cannot be achieved due to their assumption that at least one microphone contains only the desired signal, which is not practical in an acoustic environment.
A widely known technique for linear microphone-array processing is often referred to as "beamforming". In this method the time difference between signals due to spatial difference of microphones is used to enhance the signal. More particularly,
it is likely that one of the microphones will "look" more directly at the speech source, whereas the other microphone may generate a signal that is relatively attenuated. Although some attenuation can be achieved, the beamformer cannot provide relative
attenuation of frequency components whose wavelengths are larger than the array. These techniques are methods for spatial filtering to steer a beam towards a sound source and therefore putting a null at the other directions. Beamforming techniques make
no assumption on the sound source but assume that the geometry between source and sensors or the sound signal itself is known for the purpose of dereverberating the signal or localizing the sound source.
Another known technique is a class of active-cancellation algorithms, which is related to sound separation. However, this technique requires a "reference signal," i.e., a signal derived from only of one of the sources. Active noise-cancellation
and echo cancellation techniques make extensive use of this technique and the noise reduction is relative to the contribution of noise to a mixture by filtering a known signal that contains only the noise, and subtracting it from the mixture. This
method assumes that one of the measured signals consists of one and only one source, an assumption which is not realistic in many real life settings.
Techniques for active cancellation that do not require a reference signal are called "blind" and are of primary interest in this application. They are now classified, based on the degree of realism of the underlying assumptions regarding the
acoustic processes by which the unwanted signals reach the microphones. One class of blind active-cancellation techniques may be called "gain-based" or also known as "instantaneous mixing": it is presumed that the waveform produced by each source is
received by the microphones simultaneously, but with varying relative gains. (Directional microphones are most often used to produce the required differences in gain.) Thus, a gain-based system attempts to cancel copies of an undesired source in
different microphone signals by applying relative gains to the microphone signals and subtracting, but not applying time delays or other filtering. Numerous gain-based methods for blind active cancellation have been proposed; see Herault and Jutten
(1986), Tong et al. (1991), and Molgedey and Schuster (1994). The gain-based or instantaneous mixing assumption is violated when microphones are separated in space as in most acoustic applications. A simple extension of this method is to include a time
delay factor but without any other filtering, which will work under anechoic conditions. However, this simple model of acoustic propagation from the sources to the microphones is of limited use when echoes and reverberation are present. The most
realistic active-cancellation techniques currently known are "convolutive": the effect of acoustic propagation from each source to each microphone is modeled as a convolutive filter. These techniques are more realistic than gain-based and delay-based
techniques because they explicitly accommodate the effects of inter-microphone separation, echoes and reverberation. They are also more general since, in principle, gains and delays are special cases of convolutive filtering.
Convolutive blind cancellation techniques have been described by many researchers including Jutten et al. (1992), by Van Compernolle and Van Gerven (1992), by Platt and Faggin (1992), Bell and Sejnowski (1995), Torkkola (1996), Lee (1998) and by
Parra et al. (2000). The mathematical model predominantly used in the case of multiple channel observations through an array of microphones, the multiple source models can be formulated as follows:
.function..times..times..function..times..function..function. ##EQU00001## where the x(t) denotes the observed data, s(t) is the hidden source signal, n(t) is the additive sensory noise signal and a(t) is the mixing filter. The parameter m is
the number of sources, L is the convolution order and depends on the environment acoustics and t indicates the time index. The first summation is due to filtering of the sources in the environment and the second summation is due to the mixing of the
different sources. Most of the work on ICA has been centered on algorithms for instantaneous mixing scenarios in which the first summation is removed and the task is to simplified to inverting a mixing matrix a. A slight modification is when assuming no
reverberation, signals originating from point sources can be viewed as identical when recorded at different microphone locations except for an amplitude factor and a delay. The problem as described in the above equation is known as the multichannel
blind deconvolution problem. Representative work in adaptive signal processing includes Yellin and Weinstein (1996) where higher order statistical information is used to approximate the mutual information among sensory input signals. Extensions of ICA
and BSS work to convolutive mixtures include Lambert (1996), Torkkola (1997), Lee et al. (1997) and Parra et al. (2000).
ICA and BSS based algorithms for solving the multichannel blind deconvolution problem have become increasing popular due to their potential to solve the separation of acoustically mixed sources. However, there are still strong assumptions made
in those algorithms that limit their applicability to realistic scenarios. One of the most incompatible assumption is the requirement of having at least as many sensors as sources to be separated. Mathematically, this assumption makes sense. However,
practically speaking, the number of sources is typically changing dynamically and the sensor number needs to be fixed. In addition, having a large number of sensors is not practical in many applications. In most algorithms a statistical source signal
model is adapted to ensure proper density estimation and therefore separation of a wide variety of source signals. This requirement is computationally burdensome since the adaptation of the source model needs to be done online in addition to the
adaptation of the filters. Assuming statistical independence among sources is a fairly realistic assumption but the computation of mutual information is intensive and difficult. Good approximations are required for practical systems. Furthermore, no
sensor noise is usually taken into account which is a valid assumption when high end microphones are used. However, simple microphones exhibit sensor noise that has to be taken care of in order for the algorithms to achieve reasonable performance.
Finally most ICA formulations implicitly assume that the underlying source signals essentially originate from spatially localized point sources albeit with their respective echoes and reflections. This assumption is usually not valid for strongly
diffuse or spatially distributed noise sources like wind noise emanating from many directions at comparable sound pressure levels. For these types of distributed noise scenarios, the separation achievable with ICA approaches alone is insufficient.
What is desired is a simplified speech processing method that can separate speech signals from background noise in near real-time and that does not require substantial computing power, but still produces relatively accurate results and can adapt
flexibly to different environments.
SUMMARY OF THE INVENTION
Briefly, the present invention provides a process for generating an acoustically distinct information signal based on recordings in a noisy acoustic environment. The process uses a set of a least two spaced-apart transducers to capture noise and
information components. The transducer signals, which have both a noise and information component, are received into a separation process. The separation process generates one channel that is dominated by noise, and another channel that is a
combination of noise and information. An identification process is used to identify which channel has the information component. The noise-dominant signal is then used to set process characteristics that are applied to the combination signal to
efficiently reduce or eliminate the noise component. In this way, the noise is effectively removed from the combination signal to generate a good quality information signal. The information signal may be, for example, a speech signal, a seismic signal,
a sonar signal, or other acoustic signal.
In a more specific example, the separation process uses two microphones to distinguish a speaker's voice from the environmental noise component. When properly positioned, the microphones receive in different magnitudes both the speaker's voice
as well as environmental noise components. The microphones may be adapted to enhance separation results by modulating the input of the two types of components, namely the desired voice and the environmental noise components, such as modulation of the
gain, direction, location, and the like. The signals from the microphones are simultaneously or subsequently received in a separation process, which generates one channel that is noise dominant, and generates a second channel that is a combination of
noise and speech components. The identification process is used to determine which signal is the combination signal and which has stronger speech components. The combination signal is filtered using a noise-reduction filter to identify, reduce or
remove noise components. Since the noise signal is used to adapt and set the filter's coefficients, the filter is enabled to efficiently pass a particularly good quality speech signal which is audibly distinct from the noise component.
Advantageously, the present separation process enables nearly real-time signal separation using only a reasonable level of computing power, while providing a high quality information signal. Further, the separation process may be flexibly
implemented in analog or digital devices, such as communication devices, and may use alternative processing algorithms and filtering topologies. In this way, the separation process is adaptable to a wide variety of devices, processes, and applications.
For example, the separation process may be used in a variety of communication devices such as mobile wireless devices, portable handsets, headsets, walkie-talkies, commercial radios, car kits, and voice activated devices.
Other aspects and embodiments are illustrated in drawings, described below in the "Detailed Description" section, or defined by the scope of the claims.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a block diagram illustrating a separation process in accordance with the present invention;
FIG. 2 is a block diagram illustrating a separation process in accordance with the present invention;
FIG. 3 is a flowchart of a separation process in accordance with the present invention;
FIG. 4 is a flowchart of a separation process in accordance with the present invention;
FIG. 5 is a block diagram of a wireless mobile device using a separation process in accordance with the present invention;
FIG. 6 is a block diagram of one embodiment of an improved ICA processing sub-module in accordance with the present invention;
FIG. 7 is a block diagram of one embodiment of an improved ICA speech separation process in accordance with the present invention; and
FIG. 8 is a block diagram of a de-noising processing in accordance with the present invention.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT
Referring now to FIG. 1, a process for separating an acoustic signal is illustrated. More particularly, separation process 10 is useful for separating or extracting a speech signal in a noisy environment. Although separation process 10 is
discussed with reference to a speech information signal, it will be appreciated that other acoustic information signals may be used, for example, mechanical vibrations, seismic waves or sonar waves. Separation process 10 may be operated on a processor
device, such as a microprocessor, programmable logic device, gate array, or other computing device. It will be appreciated that separation process 10 may also be implemented in one or more integrated circuit devices, or may incorporate more discrete
components. It will also be understood that portions of process 10 may be implemented as software or firmware cooperating with a hardware processing device.
Separation process 10 has a set of transducers 18 arranged to respond to environmental acoustic sources 12. In one application, each transducer, for example a microphone, is positioned to capture sound produced by a speech source 14 and noise
sources 13 and 15. Typically, the speech source will be a human speaking voice, while the noise sources will represent unwanted sounds, reverberations, echoes, or other sound signals, including combinations thereof. Although FIG. 1 shows only two noise
sources, it is likely that many more noise sources will exist in a real acoustic environment. In this regard, it would not be unusual for the noise sources to be louder than the speech source, thereby "burying" the speech signal in the noise. In one
example, a set of microphones is mounted on a portable wireless device, such as a mobile handset, and the speech source is a person speaking into the handset. Such a mobile handset may be operated in very noisy environments, where it would be highly
desirable to limit the noise component transmitted to the receiving party. In this regard, the separation process 10 provides the mobile handset with a cleaner, more usable speech signal. In another example, separation process 10 is operated on a
voice-activated device. In this case, one of the significant noise sources may be the operational noise of the device itself.
As defined herein, transducers are signal detection devices, and may be in the form of sound-detection devices such as microphones. Specific examples of microphones for use with embodiments of the invention include electromagnetic,
electrostatic, and piezo-electric devices. The sound-detection devices may process sounds in analog form. The sounds may be converted into digital format for the processor using an analog-to-digital converter. In one example, the separation process
enables a diverse range of applications in addition to speech separation, such as locating specific acoustic events using waves that are emitted when those events occur. The waves (such as sound) from the events of interest are used to determine the
range of the source position from a designated point. In turn, the source position of the event of interest may be determined.
Separation process 10 uses a set of at least two spaced-apart microphones, such as microphones 19 and 20. To improve separation, it is desirable that the microphones have a direct path to the speaker's voice. In such a direct path, the
speaker's voice travels directly to each microphone, without any intervening physical obstruction. The separation process 10 may have more than two microphones 21 and 22 for applications requiring more robust separation, or where placement constraints
cause more microphones to be useful. For example, in some applications it may be possible that a speaker may be placed in a position where the speaker is shielded from one or more microphones. In this case, additional microphones would be used to
increase the likelihood that at least two microphones would have a direct path to the speaker's voice. Each of the microphones receives acoustic energy from the speech source 14 as well as from the noise sources 13 and 15, and generates a composite
signal having both speech components and noise components. Since each of the microphones is separated from every other microphone, each microphone will generate a somewhat different composite signal. For example, the relative content of noise and
speech may vary, as well as the timing and delay for each sound source.
Separation process 10 may use a set of at least two spaced-apart microphones with directivity characteristics. In certain applications, it is desirable to use directional microphones where the directivity pattern can be generated in many
different embodiments. In one example the directivity is due to the physical characteristic of the microphone (e.g. cardiod or noise canceling microphone). Another implementation uses the combination and processing of multiple microphones (e.g.
processing of two omnidirectional microphones yields one directional microphone). In another use, the placement and physical occlusion of microphones can lead to a directivity characteristic of the microphone. The use of directivity patterns in the
microphones may facilitate the separation process or void the separation process (e.g. ICA process) thus focusing on the post processing process.
The composite signal generated at each microphone is received by a separation process 26. The separation process 26 processes the received composite signals and generates a first channel 27 and a second channel 28. In one example, the
separation process 26 uses an independent component analysis (ICA) process for generating the two channels 27 and 28. The ICA process filters the received composite signals using cross filters, which are preferably infinite impulse response filters with
nonlinear bounded functions. The nonlinear bounded functions are nonlinear functions with pre-determined maximum and minimum values that can be computed quickly, for example a sign function that returns as output either a positive or a negative value
based on the input value. Following repeated feedback of signals, two channels of output signals are produced, with one channel dominated with noise so that it consists substantially of noise components, while the other channel contains a combination of
noise and speech. It will be understood that other ICA filter functions and processes may be used consistent with this disclosure. Alternatively, the present invention contemplates employing other source separation techniques. For example, the
separation process could use a blind signal source (BSS) process, or an application specific adaptive filter process using some degree of a priori knowledge about the acoustic environment to accomplish substantially similar signal separation.
The separation process 26 is thereby tuned to generate a signal that is noise-dominant, and another signal that is a combination of noise and speech. In order to enable further processing, the channels 27 or 28 are identified according to
whether each respective channel has the noise-dominant signal or the composite or combination signal. To do so, the separation process 10 uses an identification process 30. The identification process 30 may apply an algorithmic function to one or both
of the channels to identify the channels. For example, the identification process 30 may measure distinct characteristic of the channel such as the energy or signal-to-noise ratio (SNR) in the channels, or other distinctive characteristic, and based on
expected criteria, may determine which channel is noise-dominant and which is noise plus speech (combination). In another example, the identification process 30 may evaluate the zero-crossing rate characteristics of one or both channels, and based on
expected criteria, may determine which channel is noise-only and which is the combination channel. In these examples, the identification process evaluates the characteristics of the channel signal(s) to identify the channels.
As used herein, the term "noise-dominant" refers to the channel having lesser magnitudes or amounts of the speech signal or alternatively, greater magnitudes or amounts of the noise signal, as compared to the noise+speech combination channel.
Correspondingly, the term "noise+speech" or "combination" channel refers to the channel having greater magnitudes or amounts of the speech signal than in the noise-dominant channel. Such language should not be construed as literally referring to a
channel devoid of the other signal, i.e., speech or noise. Alternatively, it is to be understood that both channels 27 and 28 will have overlapping noise and speech signals, with one containing greater speech characteristics and the other containing
greater noise characteristics.
The identification process 30 may also use one or more multi-dimensional characteristics to assist in the identification process. For example, a voice recognition engine may be receiving the signal generated by the separation process 10. The
identification process 30 may monitor the speech recognition accuracy that the engine achieves, and if higher recognition accuracy is measure when using one of the channels as the combination channel, then it is likely that the channel is the combination
channel. Conversely, if low speech recognition is found when using one of the channels as the combination channel, then it is likely that the channels have been mis-identified, and the other channel is actually the combination channel. In another
example, a voice activity detection (VAD) module may be receiving the signal generated by the separation process 10. The identification module monitors the resulting voice activity when each channel is used as the combination channel in the separation
process 10. The channel that produces the most voice activity is likely the combination channel, while the channel with less voice activity is the noise-dominant channel.
In another application of the identification process 30, the identification process 30 uses a-priori information to initially identify the channels. For example, in some microphone arrangements, one of the microphones is very likely to be the
closest to the speaker, while all the other microphones will be further away. Using this pre-defined position information, the identification process can pre-determine which of the channels (27 or 28) will be the combination signal, and which will be
the noise-dominant signal. Using this approach has the advantage of being able to identify which is the combination channel and which is the noise-dominant channel without first having to significantly process the signals. Accordingly, this method is
efficient and allows for fast channel identification, but uses a more defined microphone arrangement, so is less flexible. T | | |