|
Claims  |
|
|
We claim:
1. A method of discriminating noise and voice energy in a communication
signal, comprising the steps of:
for a plurality of block periods:
sampling said signal a number of times to obtain sample values;
calculating a block energy value for said signal by summing the squares of
said sample values from said number of samples; and
for an update period equal to a sum of said plurality of block periods:
assigning a maximum block energy value calculated during said update period
to a variable E.sub.max ;
assigning a minimum block energy value calculated during said update period
to a variable E.sub.min ;
calculating a noise energy threshold value based on the relative values of
E.sub.max and E.sub.min, wherein between a first upper bound and a first
lower bound said noise energy threshold may assume a continuum of values;
calculating a voice energy threshold value based on the relative values of
E.sub.max and E.sub.min, wherein between a second upper bound and a second
lower bound said voice energy threshold may assume a continuum of values;
and
updating said noise energy threshold and said voice energy threshold in
accordance with said calculations for their respective values;
said voice energy estimation value E.sub.voice is updated according to the
formula:
E.sub.voice, n =(1-.alpha..sub.voice)*E.sub.voice,n-1 +.alpha..sub.voice
*E.sub.n, where E.sub.voice, n
is said voice energy estimation value for said current block period,
.alpha..sub.voice is a voice time constant, E.sub.voice, n-1 is said voice
energy estimation value for an immediately preceding voice block period,
and E.sub.n is said current block energy; and
said noise energy estimation value E.sub.noise is updated according to the
formula:
E.sub.noise, n =(1-.alpha..sub.noise)*E.sub.noise,n-1 +.alpha.-.sub.noise
*E.sub.n, where E.sub.noise,n
is said noise energy estimation value for said current block period,
.alpha..sub.noise is a noise time constant, E.sub.noise, n-1 is said noise
energy estimation value for an immediately preceding noise block period,
E.sub.n is said current block energy.
2. The method of claim 1, further comprising the steps of:
performing the steps of claim 1 for a plurality of said update periods; and
calculating an adaptive discrimination threshold, used to discriminate said
block periods containing voice energy from those containing noise energy,
based on the relative values of either E.sub.max and E.sub.min or a noise
energy estimation variable, E.sub.noise, and a voice energy estimation
variable, E.sub.voice, wherein between certain bounds said discrimination
threshold may assume a continuum of values.
3. The method of claim 2, further comprising the step of:
selecting one of three algorithms for calculating said discrimination
threshold based upon a number of characteristics of said signal, wherein
a first algorithm, associated with a first state, is used to calculate said
discrimination threshold when a noise energy margin and a voice energy
margin are distinguishably detected in said signal;
a second algorithm, associated with a second state, is used to calculate
said discrimination threshold when a tone or stationary noise is detected
in said signal; and
a third algorithm, associated with a third state, is used to calculate said
discrimination threshold when neither said noise and voice energy margins
are distinguishably detected nor said tone or stationary noise is detected
in said signal.
4. The method of claim 3, wherein:
for said first algorithm, said discrimination threshold is assigned a value
given by a product of said noise energy estimation variable E.sub.noise
and a continuous function of the ratio of said voice energy estimation
variable E.sub.voice to said variable E.sub.noise ;
for said second algorithm, said discrimination threshold is assigned a
value of either a constant or a multiple of said variable value of
E.sub.max ; and
for said third algorithm, said discrimination threshold is assigned a value
given by a product of said variable E.sub.min and a continuous function of
the ratio of said variable E.sub.max to said variable E.sub.min.
5. The method of claim 4, further comprising the steps of:
smoothing said third state discrimination threshold value for a current
update period, of said plurality of update periods, using the equation
expressed as: T'.sub.m+1 =0.5*T.sub.m +0.5*T.sub.m+1, where T'.sub.m+1 is
said smoothed third state discrimination threshold value for said current
update period, T.sub.m+1 is said third state discrimination threshold
value for said current update period, and T.sub.m is said smoothed third
state discrimination threshold value for a last previous update period, of
said plurality of update periods, of said third state; and
assigning said smoothed third state discrimination threshold value, T'm+1,
for said current update period to said third state discrimination
threshold value, T.sub.m+1, for said current update period, wherein said
smoothing reduces the instantaneous variability of said third state
discrimination threshold.
6. The method of claim 5, further comprising the steps of:
calculating a value of said variable E.sub.noise using geometric averaging;
and
calculating a value of said variable E.sub.voice using geometric averaging.
7. The method of claim 6, further comprising the steps of:
ascribing said current block period as containing voice if said current
block energy value exceeds said current state discrimination threshold
value; and
ascribing said current block period as containing noise if said current
block energy value is less than said current state discrimination
threshold value.
8. The method of claim 7, further comprising the steps of:
updating said voice energy estimation value E.sub.voice when said current
block energy exceeds said voice energy threshold value; and
updating said noise energy estimation value E.sub.noise when said current
block energy is less than said noise energy threshold value.
9. The method of claim 7, further comprising the steps of:
calculating a zero cross rate of said signal for each of said plurality of
block periods; and
ascribing said current block period as containing voice if said zero cross
rate of a block period immediately preceding said current block period
exceeds or equals a zero cross rate threshold value.
10. The method of claim 9, wherein:
said zero cross rate, ZCR, is calculated according to the equation:
##EQU12##
where L is the number of samples in said current block and x(l) is said
sample value for an l.sup.th sample of said number of samples. |
|
|
|
|
Claims  |
|
|
Description  |
|
|
BACKGROUND OF THE INVENTION
1. Field of the Invention
The invention relates to methods for conservation of bandwidth in a packet
network. More specifically, the invention relates to methods for reducing
the bandwidth consumption in voice-over packet networks by improved
detection of active signals, background noise, and silence.
2. Description of the Background Art
A system for bandwidth savings, known as time assignment speech
interpolation (TASI), was introduced to increase the capacity of submarine
telephone cables used in analog telephony. TASI was subsequently replaced
with a similar digital system. Such schemes are commonly known as digital
speech interpolation (DSI) systems. As multimode and variable-rate speech
coding techniques have improved, several promising silence compression
standards have been developed and issued to address the bandwidth saving
problem. The algorithm standardized by the GSM for use in the Pan-European
digital Cellular Mobile Telephone Service is an example of a voice
activity detection (VAD) technique designed for the mobile environment.
Another VAD algorithm in wireless applications is provided with the
ITA/EIA/IS-127 Enhanced Variable Rate Codec standard. There are two
silence compression standards from ITU: G.723.1 Annex A, and G.729 Annex
B.
Although these standards for bandwidth savings are very effective, their
complexity is very high. The complexity of these methods derives from the
fact that they rely upon processing the spectral features of a signal,
which requires an analysis of the frequency and/or spectrum of the signal
to identify the characteristics of speech, voice, or other distinct
signals. These methods require adaptive algorithms to reduce noise, band
pass filters to isolate speech, and the like to identify accurately
characteristics of the signal to detect voice from other sounds, signals,
or noise.
Complex standards require complex algorithms and therefore require
significant processing capabilities. The method of the present invention
significantly reduces complexity and therefore can be implemented in high
channel density wired telephony applications. The present invention is
simple in terms of processing and memory requirements and results in
excellent performance.
SUMMARY OF THE INVENTION
In voice-over packet applications, speech signal is transmitted using data
packets. The general telephone network will limit the bandwidth of the
speech signal to 300 to 3,400 Hz range. In most speech codecs, the signal
is sampled at 8 Khz resulting in the maximum signal bandwidth of 4 Khz.
Each sample is represented with 16 bits, resulting in a 128 kbps bit rate.
To save on bandwidth, PCM and ADPCM codecs are widely used in telephony
applications and are important in high channel density implementation of
voice-over packet applications. For the purpose of bandwidth savings with
PCM and ADPCM codecs, voice activity detection is used to distinguish
silence from active signal. The silence packets are not transmitted during
any nonspeech interval, effectively increasing the number of channels. In
voice-over packet applications, the input speech level can be varied from
-50dBm0 to 0dBm0, facsimile signal level varies from -48dBm0 to 0dBm0, the
noise properties may change considerably during a conversation.
To detect signal activity accurately under different signal input and noise
conditions, the energy threshold is adapted to the input signal and noise
levels. Because of its adaptive function, the corresponding signal
activity detection algorithm herein provides bandwidth savings with low
complexity and low delay and performs well for a wide range of signal
energy input levels and background noise environments as well as signal
energy level changes. Because the bandwidth savings may change based on
packet network traffic load, the algorithm is dynamically configurable to
adjust the bandwidth savings percentages.
In development of voice-over packet network applications, a reliable
bandwidth saving method is crucial to achieve a desirable balance between
acceptable perceived sound quality and reduction in bandwidth
requirements. Due to a variety of working conditions a number of
challenges are imposed upon such a method. The bandwidth savings needs to
be accomplished with both low delay and low complexity. The method must
perform well for a wide range of input signal levels, must work in a
variety of background noise environments, and must be robust in the
presence of active signal and/or background noise level changes. Since the
bandwidth requirements may change based on network factors such as load or
traffic conditions or because of changing performance needs, the present
invention is dynamically configurable to perform well under different
requirements. It is common for the noise environment to alter in
real-time, and the present invention dynamically adjusts through
monitoring such changes to accomplish bandwidth savings and to perform
well under a wide variety of conditions.
The present invention accomplishes efficient savings in bandwidth through a
system for active signal (e.g., voice, facsimile, dialtone) and background
noise detection and discrimination which utilizes block energy threshold
adaptation, adaptive marginal signal/noise discrimination, state control
logic, and active signal smoothing. The system distinguishes active signal
(e.g., voice, speech, etc.) from background noise to allow for the
compression or elimination of periods of silence or background noise. The
system includes a state machine for logic control in establishing a
dynamic adaptive threshold, below which the signal is identified as
silence or background noise, and above which the signal is identified as
active signal. The threshold is established by factors, including an
active signal estimation technique from discrimination of noise below a
first threshold and active signal above a second threshold. Signal between
the thresholds cannot be discriminated and is therefore not used in the
estimation to avoid loss of voice through misidentification as noise or
silence. The system is efficient in detection of active signals and
elimination of noise, while maintaining a safety margin to avoid
degradation of voice quality by misidentification of low voice signals as
background or silence.
The state machine, FIG. 2, includes the flow logic, FIG. 3, for updating
the adaptive block energy threshold used for threshold detection, FIG. 1.
There are three states in the state machine: learning state, converged
state, and constant envelope state. Learning state is the initial and
default state, where the system does not have any reliable estimates of
noise or active signal energy levels. The state control logic 6 is in
converged state when the current energy level threshold is acceptable and
the noise and signal level estimations are reliable. When the input signal
has an approximate constant envelope, the state machine is in the constant
envelope state to distinguish facsimile from background noise in order to
identify facsimile as active signal, not noise.
The system utilizes signal energy detection to establish and adjust the
adaptive lower and upper thresholds. The signal is divided into blocks of
a desired length, and signal features relating to the signal energy level
are extracted for analysis to determine signal feature characteristics
used to establish noise and active signal predictive thresholds. These
established thresholds are used to discriminate the signal.
A signal from a source is first processed to determine the energy E.sub.(n)
of the signal. The energy level is processed into energy vectors
corresponding to discrete time intervals, for analysis. Each block is
first processed by comparison with an initial set of thresholds within a
marginal signal and noise discriminator, to discriminate initially between
noise and signal. If below a first noise threshold, the block is
classified as noise. If above a second voice threshold, the block is
classified as active signal. Once discriminated, blocks below the noise
threshold are used in noise level estimation, and blocks above the active
signal threshold are used in active signal level estimation. Blocks
between the thresholds are not used in level estimation. In this manner
the present invention creates a clear separation between signal and noise.
These processed signal blocks are then used to create active estimates of
the noise level and of the active signal level. The estimation is a
continuous processing activity updated as further signal blocks are
discriminated and made available to the estimator. In the exemplary
embodiment, estimation is performed using a combination RMS/geometric
averaging of block energies under the control of the marginal signal and
noise discriminator. However, either RMS or geometric averaging alone
could be used, as could other power estimation techniques, sample based or
block based averaging. The method of both sampling and averaging can be
varied through a change of factors such as time constants, frame size for
block energy threshold detection, changing noise and/or signal thresholds,
elimination of a discrimination gap between noise and signal, estimate
noise/voice division, etc., still within the scope of the invention as
herein taught.
The estimates of noise level and active signal level are later used in
establishing the adaptive thresholds used to process the current signal
block in the threshold detector to determine if the signal is noise or
voice used in establishing an output decision for use in compression for
bandwidth savings.
The determined energy level E.sub.(n) of the signal is also supplied to a
threshold detector to make the detection between noise and active signals.
The current values of the adaptive thresholds within the detector, as
established from the active estimates of noise signal and active signal
level based upon the control of the state control logic, are used to
classify an input block into "active signal" or "noise" comparing the
corresponding block energy E (.sub.n) with the adaptive threshold. The
threshold adaption is performed based upon a current one of several
available algorithms selected by a state control logic based upon the
dynamics of the signal estimation processing. Different threshold
functions are applied to the detection based upon the reliability of these
estimates and the consistency of the signal envelope.
Weak active signals, which may present intermittent low signal levels, can
be misclassified as noise. In order to reduce misclassification, the
output of the threshold detector is smoothed. By smoothing, short term
active signal drops are not classified as noise and subsequently
improperly compressed. The smoothed output of the threshold detector is
used as the output decision of the system method. The smoothing mechanism
is influenced by the traffic load configuration. In the exemplary
embodiment, a hang-over period smoothing method is implemented.
Alternative delay methods or smoothing algorithms can be implemented.
However, the computational processing power needed to perform signal
smoothing processing must be considered in implementing the present
invention, which relies upon simplification for effective implementation.
The output decision is then used by the voice-over packet network
communication system to implement the desired processing of the current
packet for bandwidth savings by appropriate compression based upon the
simplified active signal/noise discrimination of the present invention.
In energy-based signal activity detection, one of the difficulties is that
a simple energy measure cannot distinguish low-level speech sounds (weak
active signal) from background noise if the signal-to-noise ratio is not
high enough. In the implementation of the preferred embodiment of the
present invention as described below, the following assumptions have been
made. However, these values can be adjusted to process signals according
to desired design parameters while remaining within the inventive concept
taught herein:
during natural conversation, within a long enough period of time, there
will exist at least one silence frame (i.e., a signal frame that does not
contain speech sounds) of a minimum duration;
during natural conversation, weak speech sounds should normally last only
for short periods of time;
the short-term statistics (up to 1.5 seconds) of a noise are stationary or
pseudo-stationary;
the block energy threshold should be a function of noise level, active
signal level, and signal-to-noise ratio.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is an overall block diagram for the signal processing and threshold
detection system of the present invention.
FIG. 2 is a block diagram illustrating the interaction of the states of the
state control logic of the present invention.
FIG. 3 is a logic flow chart illustrating the threshold update process of
the state control logic of the present invention.
FIG. 4 is a graph illustrating the coefficient K(E.sub.max /E.sub.min) for
the learning state of the state control logic of the present invention.
FIG. 5 is a graph illustrating the coefficient K(E.sub.voice /E.sub.noise)
for the learning state of the state control logic of the present
invention.
DETAILED DESCRIPTION OF PREFERRED EXEMPLARY EMBODIMENTS
FIG. 1 is a block diagram illustrating an exemplary embodiment of the
overall logic flow of the present invention. The signal from a source in a
packet network passes through splitter 9 and is inputted into block 1
where the signal energy is calculated.
The signal energy is calculated using a block energy calculation technique
where the input signal is partitioned into nonoverlapped 2.5 ms blocks.
The 2.5 ms exemplary block size results in 20 samples/block, when an 8 kHz
sampling rate is used. The block energy is calculated as a sum of sample
squares or root-mean-square algorithm. The calculation can be performed
according to a standard signal energy algorithm such as:
##EQU1##
for example, where: N=20 if 2.5 ms blocks are used and N=40 if 5 ms blocks
are used.
Table I illustrates an exemplary typical result from the calculation of
block energy. In the algorithm as implemented in an exemplary embodiment,
the block length N =40 (samples of 5 ms), the threshold update period
L=256 blocks (1.28 sec) and the update subperiod S=32 blocks (160 ms), the
dimension of minimum/maximum energy vectors is D=8 (eight subperiods
within a period or L/S). In the following example, shortened for the sake
of illustration, N=5, L=12, and S=4, and therefore D=3.
TABLE I
Block Samples Energy Value
1 -1
3
3
1
3 29
2 1
-2
-3
-2
0 18
3 2
-2
3
0
-2 21
4 2
0
-1
1
1 7
5 2
4
0
3
-4 45
6 4
-3
-3
3
2 47
7 -4
-5
3
-4
-3 75
8 1
-3
-1
-5
4 52
9 0
-1
0
-2
-1 6
10 -3
0
2
0
1 14
11 -3
-2
2
1
-1 19
12 0
2
-5
1
-5 55
The calculated block energies are used to extract features from the input
signal at block 2 of FIG. 1. Using the calculating block energies, the
following features are extracted every 1.28 seconds:
1. Minimum energy vector.
2. Maximum energy vector.
3. Minimum energy.
4. Maximum energy. The minimum and maximum energy vectors are obtained by
partitioning a 1.28-second period into eight parts. For each part the
minimum and maximum block energies are determined. The minimum and maximum
energies are determined from the minimum and maximum energy vectors,
respectively. In an exemplary embodiment, 5 ms block energy features are
extracted for each threshold update period (1.28 seconds). Other block
size and update periods can be used as appropriate for the signal, the
desired compression, active signal quality and bandwidth savings. The
threshold is partitioned into eight non-overlapped subperiod intervals J
of 160ms (length N=5 ms blocks). Minimum and maximum energy vectors
E.sub.vct--min and E.sub.vct--max are extracted as follows:
Evct--min(j)=min{E(n)} and Evct--max (j)=max{E (n)}
where: E(n) is 5 ms block energy, and j=0,1,2 . . . , 7 and n.di-elect
cons.[jN, (j+1)N-1]
The minimum energy and maximum energy are the minimum or maximum 5 ms block
energy during the whole threshold update period, i.e., Emin=min{Evct--min}
and Emax=max{Evct--max}. The 2.5 ms block threshold block energy E(1) is
extracted for the threshold detector 5 while the 2.5 ms block-based zero
crossing rate is considered as an optional feature which can be extracted
for consideration in threshold determination by the state control logic 6.
Because zero crossing rate is strongly affected by dc offset, a highpass
filter should be used if the input signal has dc components. Block-based
zero crossing rate can be extracted as follows:
##EQU2##
where L=20 is the block length.
Table II illustrates an exemplary feature extraction from the exemplary
block energies illustrated in Table I.
TABLE II
Block Emin
Block # Energy Vector Emax Vector Min Energy Max Energy
1 29
2 18
3 21
4 7 7 29
5 45
6 47
7 75
8 52 45 75
9 6
10 14
11 19
12 55 6 55 6 75
Marginal Signal/Noise Discriminator.
The purpose of the marginal signal and noise discriminator, block 3, to
keep a distance or gap between noise level and active signal level, so
that overlapped parts of active signal and noise lock energies can be
eliminated before the subsequent noise and active signal energy
estimations. The noise energy level estimate and the active signal energy
level estimate are used by state control logic 6 during threshold
establishment in the "converged state." Establishing a region between a
maximum noise level and a minimum active signal level is accomplished by
maintaining two energy margins: one for noise, and the other for active
signal. When block energy is below the noise margin, it is considered
noise and used in noise level estimation. Similarly, when block energy is
above the active signal margin, it is considered active signal and used in
active signal level estimation. Otherwise, the block energy is not used in
level estimation. The output of estimator 4 is used by state control logic
6 to select the current state based upon the signal envelope consistency
and reliability. Therefore, the estimation of noise and active signal
energy are independent of the output results of the bandwidth savings
algorithm, and divergence due to misclassification can be avoided.
Signal/Noise Level Estimation.
The signal and noise level estimation 4 is performed using the geometric
averaging of block energies under the control of the marginal signal and
noise discriminator. The outputs are active signal level and noise level.
These outputs represent an ongoing adaptive estimate of the a | | |