|
Claims  |
|
|
The invention claimed is:
1. In a telecommunications network carrying incoming signals having both speech and noise energy, a method including iterative estimations using an LPC speech model for
processing said signals at a selected point in said network to reduce said noise energy, comprising the steps of:
converting said incoming signals to a time-series of spectral energy data frames;
reducing in an initial stage comprising said LPC speech model the noise energy in each frame, thereby creating noise-reduced data frames having residual low- amplitude non-stationary noise appearing randomly in data frames as false formant
pickets or other noise components of varying amplitudes;
selecting pickets in each said data frame which in accordance with a first criterion are likely to represent false formant pickets;
minimizing the variation in amplitudes of said selected pickets with the pickets in corresponding positions in adjacent said data frames, thereby creating noise-minimized frames; and
combining said noise-minimized frames with said noise-reduced frames and transmitting the combined signal through said network.
2. The method of claim 1, comprising the further step of:
minimizing the within-frame variations in amplitude of further said pickets identified in accordance with a second predetermined criterion as likely to represent said other noise components.
3. The method of claims 1 or 2, comprising the further steps of:
making an estimate of the noise power spectrum for each said data frame; and
spectrally subtracting from said convened signal prior to its entering said initial stage a predetermined fraction of said noise power spectrum estimate.
4. In a telecommunications network carrying incoming signals having both speech and noise energy, a method including iterative estimations using an LPC speech model for processing said signals at a selected point in said network to reduce said
noise energy, comprising the steps of:
converting said incoming signals to a time-series of spectral energy data frames;
reducing in an initial stage comprising said LPC speech model the noise energy in each said frame, thereby creating noise-reduced data frames having residual low-amplitude non-stationary noise appearing randomly in data frames as false formant
pickets or other noise components of varying amplitudes;
making an estimate of the noise power spectrum for each said data frame;
identifying in successive said data frames any said pickets at or below a threshold amplitude set at a defined distance above said noise power spectrum estimate for each said data frame.
comparing the amplitude of each identified picket in a given data frame to the amplitudes of corresponding said pickets in time-adjacent data frames to determine which has the minimum amplitude; combining said noise-minimized frames with said
noise-reduced frames; and substituting said minimum value for the amplitude of each said identified picket; and
transmitting the combined signal through said network.
5. The method of claim 4, wherein said successive time-adjacent frames are the frames immediately adjacent to said given frame.
6. The method of claim 5, wherein said defined distance according to claim 4 is not substantially greater than one power spectrum standard deviation above said noise power spectrum estimate established for each said frame.
7. The method of claims 4, 5, or 6, comprising the further steps of
identifying in successive ones of said data frames one or more spectral pickets having amplitudes within a range which according to a predetermined criterion are likely to represent non-stationary noise energy;
comparing within the same frame the amplitude of each said identified spectral picket with the amplitudes of nearby pickets of different frequency values to determine if any of said nearby pickets have amplitudes falling below said range, and if
any are found;
adjusting downward the amplitude of each identified said picket; and
transmitting the frames with said downward-adjusted picket amplitudes through said network.
8. The method of claim 7, wherein said nearby pickets are the pair located two picket positions on either side of said identified picket.
9. The method of claim 8, wherein said downward adjustment step comprises:
averaging the values of said pair of pickets; and
substituting the result for the amplitude values of said identified picket and of the pickets next-adjacent to said identified picket.
10. The method of claims 4, 5, or 6 comprising the further steps of
spectrally subtracting from said converted signal a predetermined fraction of said noise power spectrum estimate; and
applying the resultant said spectrally reduced signal to said initial stage.
11. The method of claim 10, wherein said said noise power spectrum estimate is generated in said initial stage.
12. The method of claim 10, wherein said predetermined fraction is in a range of up to 50 percent of said noise power spectrum estimate.
13. The method of claim 9, comprising the further steps of:
spectrally subtracting from said converted signal a predetermined fraction of said noise power spectrum estimate; and
applying the resultant said spectrally reduced signal to said initial stage.
14. The method of claim 13, wherein said noise power spectrum estimate is generated in said first stage.
15. The method of claim 14, wherein said predetermined fraction is in a range of up to 50 percent of said noise power spectrum estimate.
16. The method of claim 3, wherein said selected point in said network is a toll switch.
17. The method of claim 7, wherein said selected point in said network is a toll switch.
18. The method of claim 10, wherein said selected point in said network is a toll switch.
19. The method of claim 15, wherein said selected point in said network is a toll switch.
20. In a telecommunications network carrying incoming signals having both speech and noise energy, a method for processing said signals at a selected point in said network to reduce said noise energy, comprising the steps of:
converting said incoming signals to a time-series of special energy data frames;
using an iterated filter based on an LPC speech modal with filter-order adjustments, iteratively processing said data frames to create a new filter for each iteration of a current frame, a noise power spectra estimate for each frame and a
noise-reduced output speech signal for each iteration, thereby also creating residual low-amplitude non-stationary noise appearing randomly in data frame pickets at false formant or other noise components;
detecting noise-only frames and in response thereto updating said noise power spectrum estimate only during said noise-only frames;
identifying in successive said data frames from the output speech signal of said iterative filter any said pickets at or below a threshold amplitude set at a defined distance above said noise power spectrum estimate for each said data frame;
comparing the amplitude of each identified picket in a given data frame to the amplitudes of corresponding said pickets in time-adjacent data frames to determine which has the minimum amplitude;
substituting said minimum value for the amplitude of each said identified picket;
combining said noise-minimized frames with said noise-reduced frames; and
transmitting the combined signal through said network.
21. The method of claim 20, comprising the further step of:
attenuating said incoming signal in further response to said detection of noise-only frames; and
passing said attenuated incoming signal to said network.
22. The method of claim 21, comprising the further step of:
identifying in successive said data frames any said pickets at or below a threshold amplitude set at a defined distance above said noise power spectrum estimate for each said data frame;
comparing the amplitude of each identified picket in a given data frame to the amplitude of corresponding said pickets in time-adjacent data frames to determine which has the minimum amplitude;
substituting said minimum value for the amplitude of each said identified picket to further minimize noise components therein;
combining said noise-minimized frames with said noise-reduced frames; and
transmitting the combined signal to said network.
23. The method of claim 22, comprising the further step of:
detecting voiced speech and in response thereto adjusting said LPC modal filter order to substantially the 10.sup.th order.
24. The method of claim 23, comprising the further step of
detecting unvoiced speech and in response thereto adjusting the LPC modal filter order to a range of from 4.sup.th to 6.sup.th order.
25. The method of claim 24, wherein said successive time-adjacent frames are the frames immediately adjacent to said given frame.
26. The method of claim 25, wherein said iterative filter operations further comprise:
selecting from said time-series of data frames a short sequence of consecutive frames including the current frame;
iterating the current frame over only said frame sequence; and
performing iterations numbering from three to seven on said current frame, thereby to create a smoothed estimate of the speech power spectrum for said frame.
27. The method of claim 26, wherein said iterative processing further comprises creating a new Wiener Filter for each iteration of a current frame by combining the output of the previous Wiener Filter iteration with the unfiltered said incoming
signal in a preselected weighting ratio, thereby to control through said weighting the final noise content of said output signal vs. the degree of high-frequency filtering in said iterative filtering.
28. The method of claim 27, wherein said short sequence of frames includes from 1 to 2 future frames and from 1 to 4 past frames.
29. The method of claim 28, wherein said data frames during said iterations are overlapped.
30. The method of claims 20, 21, 22, 23, 24, 25, 26, 27, 28, or 29, wherein said defined distance according to claim 20 is not substantially greater than one power spectrum standard deviation above said noise power spectrum estimate for each
said frame.
31. The method of claim 30, comprising the further steps of:
identifying in successive ones of said data frames one or more spectral pickets having amplitudes within a range which according to a predetermined criterion are likely to represent non-stationary noise energy;
comparing within the same frame the amplitude of each said identified spectral picket with the amplitudes of nearby pickets of different frequency values to determine if any of said nearby pickets have amplitudes falling below said range; and if
any are found,
adjusting downward the amplitude of each identified said picket; and
transmitting the frames with downward-adjusted picket amplitudes through said network.
32. The method of claim 31, wherein said nearby pickets are the pair located two picket positions on either said of said identified picket.
33. The method of claim 32, wherein said downward adjustment step comprises:
averaging the values of said pair of pickets; and
substituting said average for the amplitude values of said identified picket and of the next-adjacent to said identified picket.
34. The method of claims 20, 21, 22, 23, 24, 25, 26, 27, 28, or 29, comprising the further steps of:
spectrally subtracting from said converted signal a predetermined fraction of said noise power spectrum estimate; and
applying the resultant said spectrally reduced signal to said iterative filter step.
35. The method of claim 34, wherein said predetermined fraction is in a range of up to 50 percent of said noise power spectrum estimate.
36. The method of claim 35, wherein said data frames substantially overlap one another by substantially 50 percent.
37. The method of claims 20, 21, 22, 23, 24, 25, 26, 27, 28, or 29, wherein said selected point in said network is a toll switch.
38. The method of claim 30, wherein said selected point in said network is a toll switch. |
|
|
|
|
Claims  |
|
|
Description  |
|
|
FIELD OF THE INVENTION
This invention relates to enhancing the quality of speech in a noisy telecommunications channel or network and, particularly, to apparatus which enhances the speech by removing residual noise content following an initial noise reduction
operation.
BACKGROUND OF THE INVENTION
In all forms of voice communications systems, noise from a variety of causes can interfere with the user's communications. Corrupting noise can occur with speech at the input of a system, in the transmission path(s), and at the receiving end.
The presence of noise is annoying or distracting to users, can adversely affect speech quality, and can reduce the performance of speech coding and speech recognition apparatus.
Speech enhancement technology is important to cellular radio telephone systems which are subjected to car noise and channel noise, to pay phones located in noisy environments, to long-distance communications over noisy radio links or other poor
paths and connections, to teleconferencing systems with noise at the speech source, and air-ground communication systems where loud cockpit noise corrupts pilot speech and is both wearing and dangerous. Further, as in the case of a speech recognition
system for automatic dialing, recognition accuracy can deteriorate in the noisy environment if the recognizer algorithm is based on a statistical model of clean speech.
Noise in the transmission path is particularly difficult to overcome, one reason being that the noise signal is not ascertainable from its source. Therefore, suppressing it cannot be accomplished by generating an "error" signal from a direct
measurement of the noise and then cancelling out the error signal by phase inversion.
Various approaches to enhancing a noisy speech signal when the noise component is not directly observable have been attempted. A review of these techniques is found in "Enhancement and Bandwidth Compresion of Noisy Speech," by J. S. Lim and A.
V. Oppenheim, Proceedings of the IEEE, Vol. 67, No. 12, Dec. 1979, Section V, pages 1586-1604. These include spectral subtraction of the estimated noise amplitude spectrum from the whole spectrum computed for the available noisy signal, and an
iterative model-based filter proposed by Lim and Oppenheim which attempts to find the best all-pole model of the speech component given the total noisy signal and an estimate of the noise power spectrum. The model-based approach was used in "Constrained
Iterative Speech Enhancement with Application to Speech Recognition," by J. H. L. Hansen and M. A. Clements, IEEE Transactions On Signal Processing, Vol. 39, No. 4, Apr. 1991, pages 795-805, to develop a non-real-time speech smoother, where additional
constraints were imposed on the method of Lim/Oppenheim during the iterations to limit the model to maintain characteristics of speech.
The effects of the earlier methods in the Lim/Oppenheim reference are to improve the signal-to-noise ratio via the processing, but with poor speech quality improvement due to the introduction of non-stationary noise in the filtered outputs. Even
very low level non-stationary noise of the type observed after filtering can be objectionable to human hearing. The advantage of smoothing across time frames in Hansen's non-real-time smoother is to further reduce the level of the non-stationary noise
that remains. Hansen's smoothing approach provides considerable speech quality enhancement compared with the methods in Lim/Oppenheim, but this smoothing technique cannot be operated in real-time since it processes all data frames, past and future, at
each iteration stage. Thus, the smoothing process cannot work effectively in a telecommunications environment.
SUMMARY OF THE INVENTION
The invention is a signal processing method for a communication network, which filters out noise using iterative estimation of the LPC speech model with the addition of real-time operation continuous estimation of the noise power spectrum,
modification of the signal refiltered each iteration, and time constraints on the number of poles and their movements across time frames. The noise-corrupted input speech signal is applied to a special iterated linear Wiener Filter the purpose of which
is to output in real-time an estimate of the speech which then is transmitted into the network.
The filter requires an estimate of the current noise power spectral density function. This is obtained from spectral estimation of the input in noise gaps that are typical in speech. The detection of these noise-only frames is accomplished by a
Voice Activity Detector (VAD). When noise-only is detected in the VAD, the output is an attenuated original input so that the full noise power is not propagated onto the network. The purpose of transmitting an attenuated input in noise-only frames is
to present to the receiver end a "comfort" noise.
When speech plus noise is detected in the time frame under consideration by the filter, an estimate is made as to whether the speech is voiced or unvoiced. The order of the LPC model assumed in the iterated filter is modified according to the
speech type detected. As a rule, the LPC model order is set at 10 for voiced speech and 6 for unvoice speech in a time frame where the speech bandwidth is 4 KHz. This dynamic adaptation of model order is used to suppress unused model poles that can
produce time-dependent modulated tonelike noise "chirps" in the filtered speech.
In accordance with another aspect of the invention, a tracking of changes in the noise spectrum is provided by updating with new noise-only frames to a degree that depends on a "distance" between the new and old noise spectrum estimates.
Parameters may be set on the minimum number of contiguous new noise frames that must be detected before a new noise spectrum update is estimated and on the weight the new noise spectrum update is given.
By including a spectral subtraction step prior to the iterative filtering to create an initial signal that is noise-reduced, the overall noise reduction process can operate at lower S/N ratios.
In accordance with another aspect of the invention, some of the lower-level undesirable noise-induced false formant regions generated by the internal operation of the iterative filter are removed by adding a further step of spectral smoothing
across adjacent frames. This additional step may then advantageously be followed by a process for eliminating low-level noise components from a filtered frame after this interframe smoothing step has been applied.
These and further inventive improvements to the art of using iterative estimation of a filter that incorporates an adaptive speech model and noise spectral estimation with updates to suppress noise of the type which cannot be directly measured,
are hereinafter detailed in the description to follow of a specific novel embodiment of the invention used in a telecommunication network.
DESCRIPTION OF THE DRAWINGS
FIG. 1A is a diagram of an illustrative telecommunications network containing the invention;
FIG. 1B is a signal processing resource;
FIG. 1C is a diagram of a further illustrative network containing the invention;
FIG. 2 is a diagram of smoothing and iterative operations practiced in the invention;
FIG. 3 is a flowchart showing the framework for speech enhancement;
FIG. 4 is a diagram of apparatus which generates the iteration sequence for constrained speech filtering;
FIGS. 5A, 5B, and 5C are diagrams depicting the interframe smoothing operation for LPC roots of the speech model; and the intraframe LPC autocorrelation matrix relaxation from iteration to iteration;
FIG. 6A is a diagram showing frames used in the prior art to update each iteration of the current frame in the non-real-time smoothing method;
FIG. 6B is a diagram showing the improved method used for updating each iteration of the current frame;
FIGS. 7A and 7B are tables of smoothing weights for the LSP position roots to smooth across seven speech frames around the current frame;
FIGS. 8 and 9 are signal traces showing aspects of the noise estimator;
FIG. 10 is a flowchart of the steps used to update the required noise spectrum used in the iterative filter.
FIG. 11 is a high-level block diagram of the basic process of the invention further enhanced by a special subtraction step prior to the filtering, and interframe and intraframe smoothing steps included after the initial stage of filtering;
FIG. 12 is a flow chart showing the combined processing stages and sequences of FIGS. 4 and 11;
FIG. 13 is a frequency/spectral amplitude diagram of four successive frames illustrating the interframe smoothing to remove false formant pickets created by low level internally generated noise from the initial stage of filtering;
FIG. 14 is a frequency/amplitude picket diagram illustrating the removal of chirp sounds using intra-frame information from a filtered signal; and
FIG. 15 is a frequency/amplitude graph illustrating the spectral subtraction process pre-filtering step in FIG. 11.
DETAILED DESCRIPTION OF AN ILLUSTRATIVE EMBODIMENT
The invention is essentially an enhancement process for filtering in-channel speech-plus-noise when no separate noise reference is available and which operates in real time. The invention will be described in connection with a telecommunications
network, although it is understood that the principles of the invention are applicable to many situations where noise on an electronic speech transmission medium must be reduced. An exemplary telecommunications network is shown in FIG. 1A, consisting of
a remotely located switch 10 to which numerous communications terminals such as telephone 11 are connected over local lines such as 12 which might be twisted pair. Outgoing channels such as path 13 emanate from remote office 10. The path 13 may cross
over an international border 14. The path 13 continues to a U.S. based central office 15 with a switch 16 which might be a No. 4ESS toll switch serving numerous incoming paths denoted 17 including path 13. Alternatively, as in FIG. 1C, the
communications path in which the invention is situated may comprise two toll switches 15 connected by trunks 9, each in turn connecting to local central offices 10 via trunks 8. The noise reduction circuitry of the present invention may, in this
arrangement, be located as processors 21 in the toll switches 15.
Returning to FIG. 1A, switch 16 sets up an internal path such as path 18 which, in the example, links an incoming call from channel 13 to an eventual outgoing transmission channel 19, which is one of a group of outgoing channels. The incoming
call from channel 13 is assumed to contain noise generated in any of the segments 10, 11, 12, 13 of the linkage. Noise is present in combination with the desired signal; but the noise source cannot be directly measured.
In accordance with the invention, a determination is made in logic unit 20 whether noise above a certain predetermined threshold is present in the switch output from channel 13. Logic unit 20 also determines whether the call is voice, by ruling
out fax, modem and other possibilities. Further, logic unit 20 determines whether the originating number is a customer of the transmitted noise reduction service. If logic unit 20 makes all three determinations, the call is routed to processor unit 21
by switch 22; otherwise the call is passed directly through to channel 19. While only one processing unit 21 is shown, all of the channels such as path 18 outgoing from switch 16 are connectable to other processors 21 or to the same time-shared
processor (not shown).
The incoming signal from noisy channel 13 may be pre-processed to advantage by an analog filter (not shown) which has a frequency response restricted to that of the baseband telephone signal.
In the system discussed here, the noisy speech is digitized in processor 21 at an 8 KHz rate, to create a time series of sample frames in the frequency domain. Advantageously, the frame size used is 160 samples (20 msec.) and a 50 percent
overlap is imposed on these blocks to insure continuity of the reconstructed filtered speech.
Referring now to FIG. 1B, processor 21 comprises a model-based iterative signal estimator 23. The signal spectrum used in signal estimator 23 is determined by assuming an all-pole LPC modal and iterating each frame to estimate the unknown
parameters. The purpose of signal estimator 23 is to operate on incoming speech to obtain the approximate speech content. The call also is routed via bypass 24 to Voice Activity Detector (VAD) 25, which continuously detects noise or speech-plus-noise
frames and determines if a speech frame is voiced or unvoiced. The required noise spectrum to be used in the signal estimator 23 is estimated from noise-only frames detected by VAD 25.
When a processed frame is detected as noise only, the process in signal estimator 23 is not implemented; and VAD 25 signals a circuit 26 to switch in a suppressor 27 to pass attenuated original input to output channel 19. In this mode, the
noise-only input to signal estimator 23 is attenuated substantially before it goes to the outgoing path 19 and to the far-end listener by gate 28. Additionally, when a noise-only frame is detected, VAD 25 signals a noise weight update function 29
associated with signal estimator 23 to make a new noise spectral estimate based on the current noise frames, and to combine this with the previous noise spectral estimate.
When speech is detected by VAD 25, input to circuit 26 is switched to signal estimator 23 such that the filtered speech is passed to the outgoing line 19. In addition, processor 21 sets the order of the LPC speech model for the signal estimator
23 at 10.sup.th order if voiced speech is detected and at 4.sup.th to 6.sup.th order for an unvoiced speech frame. The motivation for this adaptive order of speech model is that the iterative search for the LPC poles can result in false formants in the
frequency band where the ratio of signal power spectrum to noise power speciman is low. False formants result in noise of tonal nature with random frequency and duration in the filtered output that can be objectionable to the human ear, even though they
are usually very low level relative to the average speech signal amplitude. Hence, since the LPC order typically needed for unvoiced speech is only about half that of voiced speech for the bandwidth of interest, and since unvoiced speech is usually
weaker than voiced speech, it is important to modulate the LPC order such that the speech model is not over-specified for unvoiced speech frames.
The processes practiced in the filter 23 are adaptations of the available filter approach in the Lim/Oppenheim reference and on the interframe and intraffame smoothing applied by J. H. L. Hansen to improve the iterative convergence for his
non-real-time AUTO-LSP Smoother discussed in the Hansen/Clements reference. Variations realized by the present invention provide heretofore unrealized real-time filter improvements in noise reduction for telecommunications. The filter operation will
now be described.
SIGNAL-MODEL SMOOTHING ACROSS ADJACENT TIME FRAMES
If the speech is not already in digital form, processor 21 will routinely effect an incoming signal analog-to-digital conversion, which generates frame blocks of sampled input. Frame size of 160 samples, or 20 msec., is a time duration
sufficient for speech sampled at 8 KHz rate to be approximated as a statistically stationary process for LPC modeling purposes. The iterated Wiener Filter and the LPC model of the speech process used as one component of speech estimator 23 are based on
a stationary process assumption wherein the short-term speech is relatively stationary in amplitude and frequency. Hence, it is significant that the frames are processed in these short time blocks.
Referring now to FIG. 2, the input signal plus noise may be expressed by y[n]=s[n]+d[n], where y is the available input sample, and s and d are the signal and noise parts. The samples are blocked into frames which overlap substantially, for
example, by 50 percent. The data blocks are each weighted by a time window, such as the Hanning window, so that the sum of the overlapped windowed frames correctly spaced in time will add to give the original input time series. The use of a window
reduces the variance in the LPC model estimated for a data frame, and frame overlap provides a continuity in the reconstructed filtered signal output to channel 19 in FIG. 1B.
As in the iterative AUTO-LSP smoother in the Hansen/Clements reference, there are two types of constraints for the present invention that are applied at each iteration of the signal estimator 23 during the processing of the current frame of input
data. These are the LPC Autocorrelation matrix relaxation constraint applied at each intraframe iteration of the current frame, and the interframe smoothing of the current frame's LPC speech model pole positions across the LPC pole positions realized at
the iteration in process for adjacent past and future frames. The LPC pole constraints are not applied directly since these occur as complex numbers in the Z-plane, and the proper association to make of the complex pole positions for interframe
smoothing is not clear. An indirect approach is possible by using an equivalent representation of the LPC poles called the Line Spectral Pair (LSP), the details of which are discussed in the Hansen/Clements reference and in Digital Speech Processing,
Synthesis, and Recognition, by S. Fururi, Marcel Dekker, Inc., New York, N.Y., 1989, Chapter V. The N.sup.th order LPC model pole positions are equivalently represented by a set of N/2 LSP "position" roots and N/2 LSP "difference" roots that lie on the
Unit Circle in the complex Z-plane. The utility of this equivalent LSP representation of the LPC poles is that lightly damped formant locations in the signal's LPC model spectrum are highly correlated with the LSP position roots, and the bandwidths of
the LPC spectrum at these formants are highly correlated with the LSP difference roots. For a stable LPC model, the two kinds of LSP roots will lie exactly on the Unit Circle and will alternate around this circle. The ordering in position of LSP roots
is obvious, and their smoothing across time frames is much simpler to implement than is the smoothing of complex LPC roots. In summary, the LPC poles at each iteration of the current frame being filtered are smoothed across LPC poles at the same
iteration in adjacent frames by smoothing the equivalent LSP position roots and by applying a lower bound on the minimum distance of a "difference" root to adjacent "position" root. The latter bounding restrains the sharpness of any LPC model's formants
to be speech-like during the iterative search.
The invention calls for performing the LSP position smoothing across nearby contiguous time frames, but in the filter implemented for real-time application in a communication network, only a few frames ahead of the current frame being filtered
can be available. For 20 msec. frames with 50 percent overlap, the minimum delay imposed by using two future frames as indicated in FIG. 2 is 30 msec. Even this small delay may be significant in some communication networks. The filter discussed here
assumes four past frames and two future frames for smoothing. Although the entire past frames are available, only those correlated with the current frame should be used.
ITERATION PROCESS
The constrained iterative steps performed for the current frame K are shown in FIG. 3 with the iteration 1 , . . . , J details indicated in FIG. 4. The Wiener Filter-LSP cycle is initiated by filtering the input block y[n] in the frequency
domain, by the Wiener Filter 33 (hereinafter "WF") where the signal and noise power spectral estimates used are C.multidot.S.sub.y (f) and S.sub.d (f). That is, the initial WF's signal spectrum component is the total input spectrum scaled by C to have
the expected signal's power P.sub.signal =P.sub.total -P.sub.noise. After initialization, the loop in FIG. 3 performs the following steps for the current frame K:
(1) Start the iteration loop by estimating the LPC parameters of the WF output signal where the LPC autocorrelation calculation is subject to a relaxation over autocorrelation values of previous iterations for the frame. This relaxation step
attempts to further stabilize the iterative search for the best speech LPC model. This is discussed below in conjunction with FIG. 5.
(2) From the LPC model found in (1) at iteration j for frame K, solve for the LSP position roots P.sub.j and difference roots Q.sub.j. This requires the real-root solution of two polynomials each of one-half the LPC order.
(3) Smooth the LSP position roots P.sub.j for the current frame K across adjacent frames as indicated in FIG. 2 and FIG. 5C, and constrain the LSP difference roots Q.sub.j to fall a minimum distance away from the smoothed P.sub.j roots. Each
difference root Q.sub.j is constrained to be more than a minimum distance D.sub.min away from its closest smoothed P.sub.j root. This prevents the smoothed LPC pole positions from being driven to the Unit Circle of the complex Z-plane. This
"divergence" was a problem in the Lim/Oppenheim iterative filter of the Lim/Oppenheim reference that was addressed in the smoother in the Hansen/Clements reference. The constraint is desirable to iterate to a realistic speech estimate. The value
D.sub.min =0.086 radians has been used in telecommunications tests of the method.
(4) Convert the smoothed LSP roots to smoothed LPC parameters, compute the LPC signal models power spectrum S.sub.s (f).sub.j scaled such that the average power equals the current K.sub.-- th frame estimated signal power:
(5) Use the smoothed LPC model signal spectrum S.sub.s (f).sub.j and the current noise power spectrum estimate S.sub.d (f) to construct the next iteration's Wiener Filter H.sub.j (f) as shown in FIG. 3 and FIG. 4. The term Wiener Filter is used
loosely here since this filter is the usual non-casual WF raised to a power pow, and, more generally, S.sub.d (f) may be scaled by some .theta..noteq.1. Values for pow between 0.6 and 1.0 have been used in telecommunications tests of the method. The
larger pow is, the greater the change that occurs with each iteration, but with smaller pow values the iterative search for the speech signal component should be more stable. In the applications tested, the S.sub.d (f) term in WF was not scaled, i.e.,
.theta.=1 was used.
(6) As shown in FIG. 4, filter a combination of the previous iterations WF time-series output s.sub.j-1 [n] and the original input data y[n] with the current H.sub.j (f) to get the next iteration of signal estimate s.sub.j [n]. The linear
combination used is (1-B).y[n]+B.s.sub.j-1 [n], where 0.ltoreq.B.ltoreq.1. If B=0, the filter operates on the initial input as in the Lim/Oppenheim iterative filter, and if B=1 the input to the next WF is the previous WF output as done in the Hansen
AUTO-LSP smoother in Hansen/Clements reference. Values of B between 0.80 and 0.95 have been used in most of the experiments on this filter. With these values of B, some desirable features of both the Lim/Oppenheim filter and Hansen smoother were
combined. This weighting concept is new in the present invention. It gives additional control of the amount of final noise content vs. the degree of high-frequency filtering observed in the iterated filtered speech.
The combining of features of the two previous signal-modeled iterative algorithms in the Lim/Oppenheim and Hansen/Clements references, specifically the weighted combination of Wiener Filter inputs each iteration, has been found subjectively to
result in a less muffled sounding speech estimate, with a trade-off of slightly increased residual noise in the output. Combining is shown in FIG. 4, where it is seen that the input signal to the FILTER at the j.sub.-- th iteration is the TOTAL INPUT
y[n] and the Wiener Filter OUTPUT s[n].sub.j-1 from the (j-1).sub.-- th iteration.
(7) In the present implementation of the method the number of iterations intra is an input parameter determined by experiment. For the results obtained in experiments, a value of 4 to 7 intraframe iterations were used in combinations [intra,
pow] such as [7, 0.65], [5, 0.8], and [4, 1.0] where values of the feedback factor B were between 0.80 and 0.95. The best values depend on the noise class and speech type. For broad band flat noise, intra=6 may be typical; while only 4 or 5 iterations
may suffice when the noise power spectrum is heavily biased below 1 KHz of the [0, 4 KHz] voice-band spectrum.
FRAME PROCESSING
The method of processing the frames to achieve real-time operation of filter 23 is shown in FIG. 6B. The K.sub.-- th frame is assumed to be the present time reference point with frames K-4,K-3,K-2,K-1 the previously processed and archived frames
while frames K+1 and K+2 are the available future frames. As in the smoothing approach in the Hansen/Clements reference, filter 23 averages the LSP room of the K.sub.-- th frame speech model with those of the past and future frames at each K.sub.-- th
frame iteration by using the past frame LSP histories found at the iteration cycle in process. However, unlike the non-real-time smoother in Hansen/Clements reference illustrated in FIG. 6A, the present invention uses only two future frames and also
stores the required past-frame LSP histories during the iterations done for each frame so that it accumulates these histories for the previous four frames to be smoothed with the current frame during each intraframe iteration. Similar to the method of
Hansen/Clements reference, the weights are tapered across the frames as shown in FIG. 2 and the taper used for each LSP root depends on the current frames SNR as well as the SNR history up to this K.sub.-- th frame.
Another improvement in the invention is the use of table lookup for the frame LSP weights to be applied across frames. Weight tables applied in the invention are of the type shown in FIGS. 7A-7B, whereas the weights required in Hansen/Clements
reference are obtained by time-consuming formula computations which are undesirable in a real-time application. The values applied in the tables in FIGS. 7A-7B can be easily and independently adjusted, unlike the constraints imposed by the formula used
in Hansen/Clements reference. The current frame SNR thresholds at which a weight vector applied to a particular LSP root number is switched from one table to another, are selected independently. The general strategy in constructing smoothing vectors is
to apply more smoothing to the higher order LSP positions (i.e. higher formant frequencies) as indicated reading left to right in the tables. This is due to the greater influence of noise at given SNR observed on the higher order LSP speech positions.
Another trend imposed on the table values is that smoothing is broad and uniform when the frame SNR is low and smooth domain is decreased as SNR is increased to the point where no smoothing is applied at high SNR. This trend is due to the decreasing
effect of noise on the filtered speech as frame SNR is improved. The frame SNR thresholds used to switch from one table of weight vectors to another are selected as multiples of the current estimate Npow of the noise power estimated from noise-only
frames detected by the VAD. The increasing thresholds used are Th1=2.Npow for change from table Win1 to Win2, Th2=3.Npow from table Win2 to Win3, Th3=7.Npow from table Win 3 to Win4, Th4=11.Npow from table Win4 to Win5, with Win0 imposed if a
sufficiently long run of very low SNR frames occurs. The latter window condition is then applied to noise frame runs.
USE OF VOICE ACTIVITY DETECTION
An important aspect of the invention is the multiple application of a VAD to both detect noise-only frames from noisy speech and to determine the best model order to apply in each frame by detecting voice or unvoiced speech if speech is present.
As noted before, the best order for a LPC speech model differs for voiced and unvoiced speech frames. Also, as noted earlier, the noise spectrum is updated only when no voice signal is detected in a sufficient number of contiguous frames. During a time
interval when only noise is detected, suppressor 27 in circuit 26 is activated to attenuate the outgoing original signal, and signal estimator 23 is then inactive. If, however, speech is detected, then circuit 26 switches the output of 23 to the output
channel 19. Further, the class of speech, voiced or unvoiced, conditions the order of the LPC speech model to be used in the signal estimator 23. Also, the detection of change between the three possible states (noise-frame, voiced-frame, and
unvoiced-frame) causes the LSP history for past frames K-4, K-3, K-2, andK-1 to be reinitialized before application of smoothing to the current K.sub.-- th frame. This is both necessary and logical for best speech filtering since the purpose of
smoothing across past time frames is to average disparate noise by making use of the short term stationary of speech across the frames averaged.
Estimating the noise power spectral density S.sub.d (f) from noise-only frames using a voice activity detector (VAD), in accordance with the invention, provides an advantage. The filter process outlined in FIG. 3 is based on the assumption that
the noise present during speech has the same average power spectrum as the estimated S.sub.d (f). If the noise is statistically wide-sense stationary, noise estimates would not need to be updated. However, for the speech enhancement applications
illustrated herein, and also for many other transmitted noise reduction applications, the noise energy is only approximately stationary. In these cases, a running estimate of S.sub.d (f) is needed. Accordingly, VAD in FIG. 1B, selected to have good
immunity to noise at the operating SNR, is used to identify when speech is not present. Noise-only frames detected between speech segments are used to update the noise power spectrum estimate, as shown in FIG. 10. One suitable VAD for use in the FIG.
1B application is obtained from the GSM 06.32 VAD Standard discussed in "The Voice Activity Detector for the PAN-EUROPEAN Digital Cellular Mobile Telephone Service," by D. K. Freeman et at., in IEEE Conf. ICASSP. 1989, Section S7.6, pages 369-372.
The pre-filtered and post-filtered speech examples shown in FIGS. 8 and 9 indicate how voice activity detection is used to trigger attenuation of the outgoing signal when no voice is detected. As discussed in the Freeman et at. reference, the
activation of the VAD on a noise frame may be a convoluted balance of detected input level and repeated frame decisions of "no speech" properties.
IMPROVED OUTPUT USING SPEECH CLASSIFIER
Advantageously, a VAD speech classifier decision may be incorporated in the front end of the LPC model step as shown in FIG. 3. This is because the parameter settings such as LPC or | | |