|
Description  |
|
|
FIELD OF INVENTION
The present invention is directed to the analysis and resynthesis of
signals, such as speech or other sounds, and more particularly to a system
for analyzing the component parts of a sound, modifying at least some of
those component parts to effect a desired result, and resynthesizing the
modified components into a signal that accomplishes the desired result.
This signal can be converted into an audible sound or used as an input
signal for further processing, such as automatic speech recognition.
BACKGROUND OF THE INVENTION
There exist a number of fields in which it is desirable to modify the
characteristics of signal, particularly speech or other sound signals, in
order to achieve a desired result. For example, in the coding of speech
for transmission purposes, it is desirable to compress the speech to
thereby reduce the amount of data that is to be transmitted. At the
receiving end of the transmission, the compressed speech is expanded to
reproduce the original sounds. The time scale modification of speech is
also useful in the playback of recorded information. For example, a
secretary who is transcribing recorded dictation may desire to speed up or
slow down the playback rate, so that the words are reproduced at a rate
that matches the typing speed. Of course, when the playback speed differs
from the original recording speed, the pitch of the reproduced sound is
altered, so that it does not sound natural. Consequently, it is desirable
to modify the pitch of the recorded sound in conjunction with the time
scale modification, so that the reproduction will sound more natural.
Another area in which the modification of sounds is useful is in
sound-source separation. For example, when two people are speaking
simultaneously, it is desirable to be able to separate the sounds from the
two speakers and reproduce them individually. Similarly, when a person is
speaking in a noisy environment, it is desirable to be able to separate
the speaker's voice from the background noises.
In each of these areas, as well as others, the signal to be acted upon is
first analyzed, to determine its component parts. Some of these component
parts can then be modified, to produce a particular result, e.g.
separation of the component parts into two groups to separate the voices
of two speakers. Each group of component parts can then be separately
resynthesized, to audibly reproduce the voices of the individual speakers
or otherwise process them individually.
In the past, the analysis of sound, particularly speech, has been typically
carried out with respect to the spectral content of the sound, i.e. its
component frequencies. The various types of analysis which use this
approach rely upon linear models of the human auditory system. In fact,
however, the auditory system is nonlinear in nature. Of particular
interest in this regard is the cochlea, i.e. that portion of the inner ear
which transforms the pressure waves of a sound into electrical impulses,
or neuron firings, that are transmitted to the brain. The cochlea
essentially functions as a bank of filters, whose bandwidths change at
different sound levels. Similarly, neurons change their sensitivity as
they adapt to sound, and the inner hair cells produce nonlinear rectified
versions of the sound. This ability of the ear to adapt to changes in
sound makes it difficult to describe auditory perception in terms of
linear concepts, such as the spectrum or Fourier transform of a sound.
Therefore, a different, and perhaps more useful, approach to the analysis
of sound is from the standpoint of its temporal content. More
particularly, an auditory signal has characteristic periodicity
information that remains undisturbed by most nonlinear transformations.
Even if the bandwidth, amplitude and phase characteristics of a signal are
changing, its repetitive characteristics do not. Furthermore, sounds with
the same periodicity typically come from the same source. Thus, the
auditory system operates under the assumption that sound fragments with a
consistent periodicity can be combined and assigned to a single source.
Along these lines, an analytical tool has been developed which provides a
visual representation of the temporal content of a signal. This tool,
which is called a correlogram, represents the signal as a
three-dimensional function of time, frequency and periodicity. To generate
a correlogram, a one-dimensional acoustic pressure is processed in a
cochlear model. This model produces a two-dimensional map of neural firing
rate as a function of time and distance along the basilar membrane of the
cochlea. Then, by measuring the periodicities of the output signals from
the cochlear model, a third dimension is added to produce the correlogram.
The information contained in the correlogram can be used in a variety of
ways. In addition to sound visualization, it can be used for pitch
detection and modification, as well as sound separation. For further
information regarding the correlogram and its applications, see Slaney et
at, "On The Importance of Time--A Temporal Representation of Sound"
published in Visual Representation of Speech Signals, edited by Martin
Cooke, Steve Beet and Malcolm Crawford, 1993, John Wiley & Sons Ltd., the
disclosure of which is incorporated herein by reference.
Heretofore, there has been no known technique for resynthesizing the
information in a correlogram into a waveform that can be used to produce
an audible sound or be otherwise processed. Part of the difficulty lies in
the fact that, as a result of the signal processing that takes place to
produce the correlogram, information regarding the phase content of the
original signal is suppressed. Thus it is not possible to simply reverse
the signal processing in order to reproduce the original sound. Rather,
additional steps must be carried out to recover the suppressed phase
information. This problem is further exacerbated if the correlogram is
modified prior to resynthesis, since the modification may result in the
loss of additional information.
Accordingly, it is the general objective of the present invention to
provide a system and process for analyzing a signal, such as sound, with
respect to its component features and reconstructing the signal from those
features. Although not limited thereto, the present invention is
particularly directed to a process which enables information in a
correlogram to be inverted to produce a waveform that can be used to
produce an audible sound or otherwise processed, for example in an
automatic speech recognition system.
BRIEF STATEMENT OF THE INVENTION
In accordance with the foregoing objective, the present invention provides
a signal resynthesis system which is based upon the recognition that each
individual row, or channel, of the correlogram, which is a short-time
autocorrelation function, is equivalent to the magnitude of the short-time
Fourier transform of a signal. By estimating a signal on the basis of its
Short-Time Fourier Transform Magnitude, each channel of information from
the cochlear model can be reconstructed. Once this information is
retrieved, a sound waveform can be resynthesized through approximate
inversion of the cochlear filters, and can be used to generate an audible
sound or otherwise be processed.
The process for reconstructing the cochlear model data can be optimized
with the use of techniques for improving the initial estimate of the
signal from the magnitude of its short-time Fourier transform, and by
employing information that is known apriori about the signal during the
estimation process.
This same approach to sound reconstruction is applicable to other types of
sound analysis systems as well.
The foregoing features of the invention, as well as other aspects thereof,
are explained in greater detail hereinafter with reference to a preferred
embodiment that is illustrated in the accompanying drawings.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a general block diagram of a sound analysis and resynthesis
system of a type in which the present invention can be employed;
FIG. 2 is a more detailed block diagram of one embodiment of the sound
analysis system;
FIG. 3 is a schematic diagram of the automatic gain control circuit in one
channel of the cochlear model;
FIG. 4 is a detailed block diagram of another embodiment of the cochlear
model;
FIG. 5 is an example of one frame of a correlogram;
FIG. 6 is a pictorial representation of the structure for performing the
short-time autocorrelation;
FIG. 7 is a more detailed schematic representation of the autocorrelation
structure for one channel;
FIG. 8 is a flow chart of the iterative procedure for estimating a signal
from its correlogram;
FIG. 9 is a signal diagram illustrating the overlap and add procedure;
FIG. 10 is a chart comparing the results of signal estimations with and
without synchronization;
FIG. 11 is a flowchart of the correlogram inversion process;
FIG. 12 is a schematic diagram of the AGC conversion circuit;
FIG. 13 is a flow chart of the process for inversion of the half-wave
rectification of the filtered signal;
FIG. 14 is a block diagram of the inverse cochlear filter; and
FIG. 15 is a block diagram of a closed-loop implementation of the sound
analysis and resynthesis system.
DETAILED DESCRIPTION
To facilitate an understanding of the present invention and its
applications, it is described hereinafter with specific reference to its
implementation in a speech analysis and modification system that employs a
cochlear model and correlograms. It will be appreciated, however, that the
practical applications of the invention are not limited to this particular
embodiment.
A speech analysis system, of the type in which the present invention can be
utilized, is illustrated in block diagram form in FIG. 1. Referring
thereto, a speech signal from a source 10, such as a microphone or a
recording, is provided to a sound analysis system 12. The sound analysis
system produces a parametric representation of the original speech signal,
which can then be modified to produce a desired result. For example, the
parametric representation can be time-compressed for transmission purposes
or faster playback, and/or the pitch can be altered. Alternatively, sound
source separation can be carried out, to separate the voice of a speaker
from a noisy background or the like. The particular form of modification
that is carried out at the second stage 14 of the process will depend upon
the result to be produced, and can be any suitable technique for modifying
parametric signals to achieve a desired result. The details of the
particular modification that is employed do not form a part of the
invention, and therefore will not be described herein.
After the appropriate processing to achieve a desired result, the modified
parametric representation undergoes a sound resynthesis process 16. This
process is a pseudo-inverse of the original sound analysis, to produce a
sound which is as close as possible to the original sound, with the
desired modifications, e.g. the original speaker's voice without the
background noise. The result of the sound resynthesis process is a
waveform in the form of an electrical signal which can be applied to an
output device 18 that is appropriate for any particular use of the
waveform. For example, the output device could be a speaker to generate
the modified sound, a recorder to store it for later use, a transmitter, a
speech recognition device that converts the spoken words to text, or the
like.
A more detailed representation of the sound analysis system 12 is
illustrated in block diagram form in FIG. 2. A portion of the sound
analysis system comprises a model 19 of the cochlea in the inner ear. The
cochlea converts pressure changes in the ear canal into neural firing
rates that are transmitted through the auditory nerve. Sound pressure
waves cause motion of the tympanic membrane which in turn transmits motion
through the three ossicles (malleus, incus, and stapes) to the oval window
of the cochlea. These vibrations are transmitted as motion of the basilar
membrane in the cochlea. The membrane has decreasing stiffness from its
base to its apex, which causes its mechanical response to change as a
function of place. The net effect of this physiological arrangement is
that the basilar membrane acts like a set of band-pass filters whose
center frequencies vary with distance along the membrane. Accordingly, the
first portion of the cochlear model 19 comprises a bank 20 of cascaded
filters. The output signals from the early stages of the filter bank
represent the response of the basilar membrane at the base of the cochlea,
and subsequent stages produce outputs that are obtained closer to the
apex. The center frequencies and bandwidths of the filters decrease
approximately exponentially in a direction from base to apex. The output
signal from each filter is referred to as a channel of information, and
represents the signal at a point along the basilar membrane.
Within the cochlea, inner hair cells attached to the basilar membrane are
stimulated by its movement, increasing the neural firing rate of the
connected neurons. Since these hair cells respond best to motion in one
direction, the signal for each channel is half-wave or otherwise
nonlinearly rectified in a second stage 22 of the model.
Another characteristic of the cochlea is the fact that the sensitivity and
the impulse responses of the membrane vary as a function of the sound
level and its recent history. This feature is implemented in the cochlear
model by means of an automatic gain control 24 that modifies the gain of
each channel. As the level of the signal, e.g. its power, increases in a
given frequency region, the gain is correspondingly reduced.
A more detailed diagram of an automatic gain control circuit for one
channel is shown in FIG. 3. Referring thereto, the half-wave rectified
signal x from the filter is multiplied by a gain value G in a multiplier
25 to produce an output signal y. The circuit monitors the level of the
output signal y to set the gain to an appropriate value that maintains the
signal level within a suitable range. The AGC circuit 24 also functions to
model the coupling that occurs between locations along the basilar
membrane. To this end, the circuit receives inputs regarding the gain
factor in the adjacent channels, at a summer 26. These inputs, together
with the level of the signal y, are modified by two filter parameters, e
and t, to generate a state variable. The parameter e represents the time
constant for the filter, and t is a target value for the gain. To prevent
instability, the state variable for the AGC filter can be limited to a
maximum value of 1 in a limiting circuit 27. Furthermore, to insure that
the gain is never zero, the state variable can be limited to a value which
is less than one by a small amount epsilon (eps). The state variable is
subtracted from the value unity in a summer 28, to determine the gain
amount G which is multiplied with the input signal x. The state variable
is also supplied to the adjacent left and fight channels to provide for
the coupling between channels.
Preferably, the AGC circuit for each channel is made up of multiple AGC
stages of the type shown in FIG. 3, e.g. four, which are cascaded
together. Each of the filters has a different time constant e and output
target value t, with the first filter in the series having the largest
time constant (smallest e value) and largest target value.
An alternative embodiment of a cochlear model is shown in FIG. 4. In this
embodiment, the AGC circuits 24 do not directly modify the level of the
half-wave rectified signals from the filters 20. Rather, an adaptive AGC
configuration is employed to modify the parameters of the filters
themselves.
The output signals which are obtained from the cochlear model 19 provide a
parametric representation of the input signal. This representation, which
is referred to as a cochleagram, comprises a time-frequency
representation, that can be used to analyze and display sound signals. A
more useful representation of the original signal is provided, however,
when its temporal structure is considered. To this end, the short-time
autocorrelation of each channel in the cochleagram is measured in a
subsequent stage 30 (FIG. 2), as a function of cochlear place, i.e. best
frequency, versus time. The autocorrelation operation is a function of a
third variable. Consequently, the resulting output data is a
three-dimensional function of frequency, time and autocorrelation delay.
All autocorrelations which end at the same time can be assembled into a
frame of data. By displaying successive frames at a rate that is
synchronized with the sound, a moving image of the sound can be provided.
This moving image, or the data that it represents, is referred to as a
correlogram. An example of one frame of a correlogram is shown in FIG. 5.
The short-time autocorrelator can be implemented by means of a group of
tapped delay lines with multiplication, such as a CCD array. Referring to
FIG. 6, each channel of data from the cochlear model 19 is fed to one row
of a CCD array 32. Each stage of the array provides a delayed version of
the input signal. The instantaneous value of the signal is compared with
each of the delayed versions, for example by multiplying and integrating
the signals as shown in FIG. 7. The pattern of autocorrelation versus
delay time characterizes the periodicity of the original sound.
The circuits for the cochlear model and the autocorrelator can be
implemented on a single chip. For further information regarding such an
implementation, as well as a more detailed explanation of the individual
circuits, see Lyon, "CCD Correlators for Auditory Models", Proceedings of
the Twenty-Fifth Asilomar Conference on Signals, Systems and Computers,
IEEE 785-789, Nov. 4-6, 1991, the disclosure of which is incorporated
herein by reference.
As noted above, the correlogram is a useful tool for analyzing and
processing speech signals. For example, if different portions of the
correlogram represent signals that have different periodicity, these
portions can be identified as emanating from different sources. These
portions can then be separated from one another, to thereby separate the
sound sources. Once the sound sources have been separated, their
correlograms can be inverted to reproduce the waveforms that were used to
produce them. These waveforms can then be processed as desired, or further
inverted to resynthesize the original sounds. To resynthesize the sound,
each channel of the correlogram must first be inverted to reconstruct the
cochleagram. The reconstructed cochleagram must then be inverted to arrive
at the original sound signal.
The inversion of the correlogram is based upon the recognition that the
autocorrelation function is related to the square of the magnitude of the
Fourier transform of a signal. Thus, the correlogram provides information
pertaining to the magnitude of the Fourier transform of the signal that
was autocorrelated.
To facilitate an understanding of the correlogram inversion process, a
brief description of some of the principles relating to Fourier analysis
is set forth herein. More complete analyses of these principles are
contained in the publications that are referenced in the following
description.
If x(n) denotes a real sequence, for example the samples of a sound
waveform or a cochlear model channel output, its Short Time Fourier
Transform (STFT) is given as X.sub.w (mS,.omega.). The analysis window
used to calculate the STFT, w(n), is defined to be real and non-zero for
0.ltoreq.n.ltoreq.L-1. Applying the window to the sequence creates a
windowed portion of the sequence ending at a time index mS:
x.sub.w (mS,n)=x(n)w(mS-n) (1)
The variable S sets the amount of shift between windows and the index, m,
is the window number. For each sequence of data so defined, the STFT is
calculated to be
##EQU1##
The STFTs created from a signal are unique and consistent, so that given
the STFTs at a sufficient number of window locations, the signal can be
reconstructed exactly. However, an arbitrary set of STFTs might not
correspond to a signal. A procedure has been developed to estimate the
best signal x(n), given a set of STFTs, Y.sub.w (mS, .omega.). See Griffin
and Lim, "Signal Estimation From Modified Short-Time Fourier Transform,"
IEEE Transactions on Acoustics, Speech and Signal Processing, April 1984,
pp. 236-243. This procedure can be employed in the practice of the present
invention.
The signal estimation problem using a row of the correlogram, however,
starts with the short-time auto-correlation function. The short-time
auto-correlation function, R.sub.x (mS,.omega.), can be calculated from
the STFT, using the Fourier transform, and is written
##EQU2##
where * indicates complex conjugation. The short-time auto correlation
function provides information about the magnitude of the STFT, but not the
phase. The magnitude squared of the STFT is given by
##EQU3##
Therefore, an approach using only the magnitude of the STFT, i.e.,
.vertline.Y.sub.w (mS,.omega.).vertline., must be employed to find the
best estimate, x(n), of the original signal, x(n). An iterative procedure
to arrive at the best estimate was developed by Griffin and Lim, and is
described in the publication identified above.
In the application of that procedure to the present invention, the
magnitude of the STFT, .vertline.Y.sub.w (mS,.omega.).vertline. is given,
and an initial guess is made for the phase. One readily apparent guess is
to assume zero phase, which leads to a maximally peaky signal that looks
roughly speech-like. This initial STFT, .vertline.Y.sub.O
(mS,.omega.).vertline., will not necessarily be a valid STFT, however. The
following iterations can be carded out to improve the estimate.
A new estimate for the signal, x.sub.i (n), is calculated from
.vertline.Y.sub.i-1 (mS,.omega.).vertline. based on the following
procedure known as overlap-and-add:
##EQU4##
where the index i represents the number of iterations that have occurred
and y.sub.i-1 (mS,n) is the inverse Fourier transform of Y.sub.i-1
(mS,.omega.), which is equal to y'.sub.i-1 (mS-n) where y'.sub.i-1 has
zero phase when the difference between mS and n is zero. At this point an
estimate for the time-domain signal has been obtained. The phases of
individual STFTs are forced to be consistent by adding the overlapping
windows together.
The next step in the iteration procedure is to calculate the STFT of
x.sub.i (n):
##EQU5##
The phase of this new STFT is kept, the magnitude is replaced with the
known value, .vertline.Y.sub.w (mS,.omega.).vertline., and this new
modified STFT is used in the next iteration of the procedure.
This process of determining an estimated signal and finding its Fourier
transform, substituting the known magnitude information into the
transform, and calculating a new estimate can be repeated in an iterative
manner until the results begin to converge to a best estimate x(n). The
phase information for each STFT is calculated from the most recent
estimate of the signal, while the magnitude is always set back to that
which was originally supplied. This iterative procedure is illustrated in
Steps 31 and 33 of the flow chart shown in FIG. 8.
In essence, therefore, the best estimate for the original signal x(n) is
obtained by overlapping and adding the windowed time series obtained from
the Short-Time Fourier Transform. Each window of information is obtained
from the inverse Fourier transform of the STFT magnitude corresponding to
the correlogram. Preferably, the length L of the window is restricted to
be a multiple of four times the amount of window shift S. With this
approach, computational requirements can be reduced because the
denominator of the foregoing equation will be unity when a sinusoidal
window as defined by the following is used:
##EQU6##
As successive iterations of the process illustrated in FIG. 8 are carried
out, the results converge to a locally optimum solution x(n). The number
of iterations that are required to develop this set of points will be
largely dependent upon the accuracy of the initial estimate x.sub.o (n).
In the above-referenced publication by Griffin and Lim, they suggest that
25-100 iterations may be required. However, if the accuracy of the initial
guess can be improved, the number of required iterations can be
significantly reduced.
A speech waveform is characterized by a large number of peaks and troughs.
In a straightforward application of the overlap and add technique that is
used to obtain the initial estimate of a speech signal, prior knowledge of
the peaky nature of the signal provides a motivation to overlap each
successive window of information on the series with zero phase shift. In
other words, with reference to FIG. 9, when the information from window m
is added to the series, it is placed at a location that is displaced from
the information of the previous window by an amount equal to S. However,
the accuracy of the initial estimate can be significantly increased if the
relative locations of the window m and the previously developed data are
shifted so that they are synchronized with one another. The amount of the
shift is obtained by maximizing the cross-correlation of the information
in window m with the remainder of the estimated signal up to window m-1.
One procedure for determining the initial estimate in this manner is
described in Roucos et at, "High Quality Time-Scale Modification for
Speech," Proceedings of the 1985 IEEE Conference on Acoustics, Speech and
Signal Processing, 1985, pp. 493-496, the disclosure of which is
incorporated herein by reference.
To briefly illustrate the application of such a procedure to the present
invention, let x.sup.(m) (n) represent the state of the signal estimate
after the first m windows of data have been overlapped and added. An
initial value x.sup.(O) (n) for the signal estimate is defined as follows:
x.sup.(o) (n)=w(n)y.sub.w (O,n) (8)
Thereafter, the information from the next window, y.sub.w (m,n), is shifted
and added to the initial estimate. The amount of overlap is defined so
that the cross-correlation of the original estimate and the newly added
window of information is at a maximum. This cross-correlation,
R.sub.xy.sbsb.w, is defined as follows:
##EQU7##
The magnitude of the shift, k, is limited to one quarter of the window
length. Once k.sub.max (=k with the largest coefficient) is found, it is
used to overlap and add the m.sup.th window in the following manner:
x.sup.(m) (n)=x.sup.(m-1) (n)+w(n)y.sub.w (mS,n+k.sub.max) (10)
This process is repeated until all the windows have been added to the
estimate, and x(n) is then divided by the denominator of Equation 5. The
result of this process provides the initial estimate for the signal
x.sub.O (n) in the procedure of FIG. 8.
In the frequency domain, this procedure is approximately equal to adding a
linear phase to each window of data that is overlapped-and-added to form
x.sub.O (n). To be perfectly proper, the shifts in Equations 9 and 10
should be circular but they are well approximated by a conventional linear
shift.
The synchronized overlap-and-add procedure represented by Equations 9 and
10 essentially involves a process in which a window m of data is located
at a position indicated by mS, and the phase of the underlying signal
x.sup.(m-1) (n) is shifted until a maximum correlation is obtained.
Alternatively, it is possible to shift both the data and the window m by
the amount k. In this alternative approach, the initial estimate x.sup.(o)
(n) is again defined as set forth in Equation 8, and the denominator of
Equation 5 is defined as c(n), where
c.sup.(o) (n)=w.sup.2 (n) (11)
Once the value for k.sub.max is found according to Equation 9, the m.sup.th
window is added to the signal estimate in the following manner:
x.sup.(m) (n)=x.sup.(m-1) (n)+w(mS-k.sub.max -n)y.sub.w
(mS,n+k.sub.max)(12)
In addition, the value for c(n) is updated as follows:
c.sup.(m) (n)=c.sup.(m-1) (n)+w.sup.2 (mS-k.sub.max -n) (13)
Once all of the windows have been added in this manner, the value for x(n)
is then divided by c(n), to obtain x.sub.o (n).
It has been found that this approach, in which each window of information
is synchronized with the previously developed signal, significantly
improves the process of estimating a signal from a set of STFT magnitudes.
FIG. 10 illustrates an example in which a 300 Hz sinusoidal signal, which
is modulated at 60 Hz, is reconstructed from its STFT magnitudes, for the
two cases in which the initial estimate is obtained with and without the
synchronizing approach described above. As can be seen therefrom, the
initial error is reduced by about half when the synchronized approach is
employed. In addition, the error is smaller for the same number of
iterations when the windows are synchronized. Thus, fewer iterations of
the inversion process are needed, thereby reducing the required
computational resources.
In fact, the initial estimate x(n) may be sufficiently accurate that no
iterations of the procedure shown in FIG. 8 would be necessary. In a
further simplification of the initial signal estimation process, the
windowed correlograms can be directly employed, rather than transform them
into the power spectrum domain, take the square root of the spectrum to
obtain the magnitude, and then transform the result back to the time
domain. This approach to the estimation of the signal from the
autocorrelation function, although much simpler, is practical because the
temporal structure of the original signal is preserved in the
autocorrelation function, and the amplitude for a channel is also
reflected in the amplitude of each autocorrelation function, in a squared
form.
To further improve the correlogram inversion process, information that is
known about the original signals can be employed to create a better
estimate and further reduce the computational load. More particularly, it
is known that the signals are half-wave rectified in the cochlear model.
Accordingly, after each iteration of the overlap and add procedure, the
signal estimate is preferably half-wave rectified.
It is also known that, prior to half-wave rectification, the signals in
each channel of the correlogram are linearly delayed relative to one
another by the stages of the cochlear filter. This information can be
employed to predict the phase of successive channels after the first
channel's signal is inverted by means of the overlap and add procedure.
If a channel is labelled as .lambda..sub.1, its signal is identified as
x(.lambda..sub.1,n). From the signal estimated for channel .lambda..sub.1,
a set of STFTs for that signal, i.e., X.sub.w (.lambda..sub.1,mS,.omega.),
can be calculated using the procedures illustrated in FIGS. 8 and 9, and
the phase information retained. The phase for each window of the next
channel .lambda..sub.2 is given by the phase of the .lambda..sub.1
channel, or
##EQU8##
where the operator .angle. represents phase as a unit magnitude complex
vector. It is possible to employ this previously derived phase information
for later channel calculations because the channels share a lot of
information. With knowledge of the fact that the cochlear filter
introduces a phase delay between channels, the anticipated phase change
between channel .lambda..sub.1 and .lambda..sub.2 can also be included in
the estimate. If the two channels are not adjacent, the phase change
across the appropriate number of stages in the cochlear filter should be
included. In this case, the estimated phase is changed to
##EQU9##
The STFTMs and their estimated phase functions are combined to create a
set of estimated STFTs
X.sub.w (.lambda..sub.2,mS,.omega.)=Y.sub.w
(.lambda..sub.2,mS,.omega.).angle.X.sub.w (.lambda..sub.2,mS,.omega.)(16)
which are used to create the windows of data
##EQU10##
Finally, these sequences are combined in the synchronized overlap and add
method to create the initial estimate of the signal for channel
.lambda..sub.2,
##EQU11##
which is used to initialize the correlogram inversion process described
previously.
The foregoing procedures invert the information in the correlogram to
reconstruct a waveform corresponding to the cochleagram that was used to
produce the correlogram. The process for inverting the correlogram can be
carried out in a computer that is suitably programmed in accordance with
the foregoing procedures and equations. The overall operation of the
computer to carry out the process is summarized in the flowchart of FIG.
11. As shown therein Steps 31 and 33 are iteratively repeated until the
signal estimates converge. Alternatively, it is possible to carry out a
fixed number of iterations. The appropriate number of iterations to use
can be empirically determined to assume reasonable convergence in most
cases.
Of course, where the correlogram has been modified, the reconstructed
cochleagram that is obtained with the foregoing procedure will be modified
in a similar manner. For example, if the correlogram is modified to
isolate the sounds from a particular source, the information in the
reconstructed cochleagram will pertain only to the isolated sound.
The reconstructed waveform that is obtained through the correlogram
inversion process can be directly applied to some utilization devices.
More particularly, the waveform corresponding to the reconstructed
cochleagram is a time-frequency representation of the original signal,
which can be directly input to a speech recognition unit, for example, to
convert the speech information into text. Alternatively, it may be
desirable to further process the reconstructed cochleagram to resynthesize
the original sound. To obtain the original (or modified) sound, the
reconstructed cochleagram must be inverted. This inversion can involve
three steps: AGC inversion, inversion of the half-wave rectification, and
inversion of the cochlear filters.
Each channel in the cochleagram is scaled by a time varying function
calculated by the AGC filter. In order to invert this operation, it is
necessary to determine the scaling function at each instant in time. Upon
examination of the circuit of FIG. 3, it is evident that the loop gain is
dependent only on the AGC output, which can be approximated from the
inverted correlogram. Thus, by swapping the input and output points, and
dividing instead of multiplying by the loop gain, the AGC is inverted. The
restructured filter to perform the inversion is shown in FIG. 12. As can
be seen, it is similar to the circuit of FIG. 3, except that the input
signal y is divided by the gain value to produce an output signal x. If
the AGC for each channel consists of multiple stages, the AGC inversion
will also require multiple stages, in reverse order.
To prevent the AGC inversion process from becoming unstable, it may be
necessary to limit the level of the input signal to the cochlear model. If
the original input signal to the model is too large, the forward gain is
small. During the inversion process, the input signal is divided by the
small gain. If there are any errors in the reconstructed cochleagram, they
become magnified and could create instability. However, by limiting the
level of the input signal, this potential problem is avoided. The actual
limit is best determined empirically, by performing inversion for signals
with different amplitudes.
The inversion of the half-wave rectification is based upon the method of
convex projections, given the known properties of the signal. It is known
that the signals which form the cochleagram are half-wave rectified and
band limited in the cochlear model. It has been previously shown that a
band-limited signal and its half-wave rectified representation create
closed convex sets, where a convex set is defined as a set in which, given
any two points in the set, their midpoint is also a member of the set.
See, for example, Yang et at., "Auditory Representations of Acoustic
Signals," IEEE Transactions on Information Theory, Vol. 38, No. 2, March
1992, pp. 824-839, the disclosure of which is incorporated herein by
reference. Thus, by applying the method of convex projections as described
in the Yang et al. publication to the signals obtained from the circuit of
FIG. 12, the half-wave rectification can be inverted.
To illustrate, the positive values in the time domain of the originally
filtered signals are known from the inverted correlogram, as well as the
fact that these signals are band limited. By bandpass filtering each
signal in the frequency domain, a new signal is formed which includes
negative values. These negative values can be combined with the known
positive values, and the resulting signal can again be bandpass filtered.
By iterating between these two domains in this manner, the results
converge to an approximation of the original signal from each channel of
the cochlear model. This process is illustrated in the flowchart of FIG.
13, and can be implemented in a computer or in an analogous hardware
circuit.
Finally, the inversion of the cochlear filter involves a reversal of the
structure of the filter, coupled with a time reversal of both the output
signal of each channel and the final result. The structure of the inverse
cochlear filter is shown in FIG. 14. Note that the data y.sub.n from each
channel of the cochleagram is fed into the structure at the appropriate
point in a time-reversed manner, i.e., backwards. A spectral tilt
correction can be applied to the time-reversed signal to adjust the gain
of any frequencies where the combination of the forward and the inverse
cochlear filters have a gain that is not equal to unity. Finally, the
ultimate result is reversed to obtain the original waveform, which can
then be applied to an appropriate output device, for example a speaker to
produce the desired sound, a recorder, or the like.
Many of these disclosed steps are optional, depending upon the desired
result and available resources. If the AGC inversion is not performed, for
example, some computational effort is saved and the output will be
compressed in a perceptually relevant manner. The cochlear filter is
basically a bank of bandpass filters, and therefore the HWR inversion
| | |