|
Description  |
|
|
FIELD OF THE INVENTION
The present invention generally relates to a method of encoding a signal
containing speech and more particularly to a method employing a linear
predictor to encode a signal.
DESCRIPTION OF THE RELATED ART
A modern communication technique employs a Codebook Excited Linear
Prediction (CELP) coder. The codebook is essentially a table containing
excitation vectors for processing by a linear predictive falter. The
technique involves partitioning an input signal into multiple portions
and, for each portion, searching the codebook for the vector that produces
a filter output signal that is closest to the input signal.
The typical CELP technique may distort portions of the input signal
dominated by noise because the codebook and the linear predictive filter
that may be optimum for speech may be inappropriate for noise.
OBJECT AND SUMMARY OF THE INVENTION
It is an object of the present invention to provide a method of encoding a
signal containing both speech and noise while avoiding some of the
distortions introduced by typical CELP encoding techniques.
Additional objectives and advantages of the invention will be set forth in
the description that follows and in part will be obvious from the
description, or may be learned by practice of the invention. The objects
and advantages of the invention may be realized and attained by means of
the instrumentalities and combinations particularly pointed out in the
appended claims.
To achieve the objects and in accordance with the purpose of the invention,
as embodied and broadly described herein, a method of processing a signal
having a speech component, the signal being organized as a plurality of
frames, is used. The method comprises the steps, performed for each frame,
of determining whether the frame corresponds to a first mode, depending on
whether the speech component is substantially absent from the frame;
generating an encoded frame in accordance with one of a first coding
scheme, when the frame corresponds to the first mode, and a second coding
scheme when the frame does not correspond to the first mode; and decoding
the encoded frame in accordance with one of the first coding scheme, when
the frame corresponds to the first mode, and the second coding scheme when
the frame does not correspond to the first mode.
BRIEF DESCRIPTION OF THE DRAWINGS
The foregoing and other objects, aspects and advantages will be better
understood from the following detailed description of a preferred
embodiment of the invention with reference to the drawings, in which:
FIG. 1 is a block diagram of a transmitter in a wireless communication
system according to a preferred embodiment of the invention;
FIG. 2 is a block diagram of a receiver in a wireless communication system
according to the preferred embodiment of the invention;
FIG. 3 is block diagram of the encoder in the transmitter shown in FIG. 1;
FIG. 4 is a block diagram of the decoder in the receiver shown in FIG. 2;
FIG. 5A is a timing diagram showing the alignment of linear prediction
analysis windows in the encoder shown in FIG. 3;
FIG. 5B is a timing diagram showing the alignment of pitch prediction
analysis windows for open loop pitch prediction in the encoder shown in
FIG. 3;
FIGS. 6A and 6B show a flowchart illustrating the 26-bit line spectral
frequency vector quantization process performed by the encoder of FIG. 3;
FIG. 7 is a flowchart illustrating the operation of a pitch tracking
algorithm;
FIG. 8 is a block diagram showing in more detail the open loop pitch
estimation of the encoder shown in FIG. 3;
FIG. 9 is a flowchart illustrating the operation of the modified pitch
tracking algorithm implemented by the open loop pitch estimation shown in
FIG. 8;
FIG. 10 is a flowchart showing the processing performed by the mode
determination module shown in FIG. 3;
FIG. 11 is a dataflow diagram showing a part of the processing of a step of
determining spectral stationarity values shown in FIG. 10;
FIG. 12 is a dataflow diagram showing another part of the processing of the
step of determining spectral stationarity values;
FIG. 13 is a dataflow diagram showing another part of the processing of the
step of determining spectral stationarity values;
FIG. 14 is a dataflow diagram showing the processing of the step of
determining pitch stationarity values shown in FIG. 10;
FIG. 15 is a dataflow diagram showing the processing of the step of
generating zero crossing rate values shown in FIG. 10;
FIGS. 16A, 16B and 16C illustrate a dataflow diagram showing the processing
of the step of determining level gradient values in FIG. 10;
FIG. 17 is a dataflow diagram showing the processing of the step of
determining short-term energy values shown in FIG. 10;
FIGS. 18A, 18B and 18C are a flowchart of determining the mode based on the
generated values as shown in FIG. 10;
FIG. 19 is a block diagram showing in more detail the implementation of the
excitation modeling circuitry of the encoder shown in FIG. 3;
FIG. 20 is a diagram illustrating a processing of the encoder show in FIG.
3;
FIGS. 22A and 22B show a chart of speech coder parameters for mode A;
FIG. 23 is a chart of speech coder parameters for mode A;
FIG. 24 is a chart of speech coder parameters for mode A;
FIG. 25 is a block diagram illustrating a processing of the speech decoder
shown in FIG. 4; and
FIG. 21 is a timing diagram showing an alternative alignment of linear
prediction analysis windows.
DETAILED DESCRIPTION OF A PREFERRED EMBODIMENT OF THE INVENTION
FIG. 1 shows the transmitter of the preferred communication system.
Analog-to-digital (A/D) converter 11 samples analog speech from a
telephone handset at an 8 KHz rate, converts to digital values and
supplies the digital values to the speech encoder 12. Channel encoder 13
further encodes the signal, as may be required in a digital cellular
communications system, and supplies a resulting encoded bit stream to a
modulator 14. Digital-to-analog (D/A) converter 15 converts the output of
the modulator 14 to Phase Shift Keying (PSK) signals. Radio frequency (RF)
up converter 16 amplifies and frequency multiplies the PSK signals and
supplies the amplified signals to antenna 17.
A low-pass, antialiasing, filter (not shown) filters the analog speech
signal input to A/D converter 11. A high-pass, second order biquad, filter
(not shown) filters the digitized samples from A/D converter 11. The
transfer function is:
##EQU1##
The high pass filter attenuates D.C. or hum contamination may occur in the
incoming speech signal.
FIG. 2 shows the receiver of the preferred communication system. RF down
converter 22 receives a signal from antenna 21 and heterodynes the signal
to an intermediate frequency (IF). A/D converter 23 converts the IF signal
to a digital bit stream, and demodulator 24 demodulates the resulting bit
stream. At this point the reverse of the encoding process in the
transmitter takes place. Channel decoder 25 and speech decoder 26 perform
decoding. D/A converter 27 synthesizes analog speech from the output of
the speech decoder.
Much of the processing described in this specification is performed by a
general purpose signal processor executing program statements. To
facilitate a description of the preferred communication system, however,
the preferred communication system is illustrated in terms of block and
circuit diagrams. One of ordinary skill in the art could readily
transcribe these diagrams into program statements for a processor.
FIG. 3 shows the encoder 12 of FIG. 1 in more detail, including an audio
preprocessor 31, linear predictive (LP) analysis and quantization module
32, and open loop pitch estimation module 33. Module 34 analyzes each
frame of the signal to determine whether the frame is mode A, mode B, or
mode C, as described in more detail below. Module 35 performs excitation
modelling depending on the mode determined by module 34. Processor 36
compacts compressed speech bits.
FIG. 4 shows the decoder 26 of FIG. 2, including a processor 41 for
unpacking of compressed speech bits, module 42 for excitation signal
reconstruction, filter 43, speech synthesis filter 44, and global post
filter 45.
FIG. 5A shows linear prediction analysis windows. The preferred
communication system employs 40 ms. speech frames. For each frame, module
32 performs LP (linear prediction) analysis on two 30 ms. windows that are
spaced apart by 20 ms. The first LP window is centered at the middle, and
the second LP window is centered at the leading edge of the speech frame
such that the second LP window extends 15 ms. into the next frame. In
other words, module 32 analyzes a first part of the frame (LP window 1) to
generate a first set of filter coefficients and analyzes a second part of
the frame and a part of a next frame (LP window 2) to generate a second
set of filter coefficients.
FIG. 5B shows pitch analysis windows. For each frame, module 32 performs
pitch analysis on two 37.625 ms. windows. The first pitch analysis window
is centered at the middle, and the second pitch analysis window is
centered at the leading edge of the speech frame such that the second
pitch analysis window extends 18.8125 ms. into the next frame. In other
words, module 32 analyzes a third part of the frame (pitch analysis window
1) to generate a first pitch estimate and analyzes a fourth part of the
frame and a part of the next frame (pitch analysis window 2) to generate a
second pitch estimate.
Module 32 employs multiplication by a Hamming window followed by a tenth
order autocorrelation method of LP analysis. With this method of LP
analysis, module 32 obtains optimal filter coefficients and optimal
reflection coefficients. In addition, the residual energy after LP
analysis is also readily obtained and, when expressed as a fraction of the
speech energy of the windowed LP analysis buffer, is denoted as
.alpha..sub.1 for the first LP window and .alpha..sub.2 for the second LP
window. These outputs of the LP analysis are used subsequently in the mode
selection algorithm as measures of spectral stationarity, as described in
more detail below.
After LP analysis, module 32 bandwidth broadens the filter coefficients for
the first LP window, and for the second LP window, by 25 Hz, converts the
coefficients to ten line spectral frequencies (LSF), and quantizes these
ten line spectral frequencies with a 26-bit LSF vector quantization (VQ),
as described below.
Module 32 employs a 26-bit vector quantization (VQ) for each set of ten
LSFs. This VQ provides good and robust performance across a wide range of
handsets and speakers. Separate VQ codebooks are designed for "IRS
filtered" and "flat unfiltered" ("non-IRS-filtered") speech material. The
unquantized LSF vector is quantized by the "IRS filtered" VQ tables as
well as the "flat unfiltered" VQ tables. The optimum classification is
selected on the basis of the cepstral distortion measure. Within each
classification, the vector quantization is carried out. Multiple
candidates for each split vector are chosen on the basis of energy
weighted mean square error, and an overall optimal selection is made
within each classification on the basis of the cepstral distortion measure
among all combinations of candidates. After the optimum classification is
chosen, the quantized line spectral frequencies are converted to filter
coefficients.
More specifically, module 32 quantizes the ten line spectral frequencies
for both sets with a 26-bit multi-codebook split vector quantizer that
classifies the unquantized line spectral frequency vector as a "voiced
IRS-filtered," "unvoiced IRS-filtered," "voiced non-IRS-filtered," and
"unvoiced non-IRS-filtered" vector, where "IRS" refers to intermediate
reference system filter as specified by CCITT, Blue Book, Rec.P.48.
FIGS. 6A and 6B show an outline of the LSF vector quantization process.
Module 32 employs a split vector quantizer for each classification,
including a 3-4-3 split vector quantizer for the "voiced IRS-filtered" and
the "voiced non-IRS-filtered" categories 51 and 53. The first three LSFs
use an 8-bit codebook in function modules 55 and 57, the next four LSFs
use a 10-bit codebook in function modules 59 and 61, and the last three
LSFs use a 6-bit codebook in function modules 63 and 65. For the "unvoiced
IRS-filtered" and the "unvoiced non-IRS-filtered" categories 52 and 54, a
3-3-4 split vector quantizer is used. The first three LSFs use a 7-bit
codebook in function modules 56 and 58, the next three LSFs use an 8-bit
vector codebook in function modules 60 and 62, and the last four LSFs use
a 9-bit codebook in function modules 64 and 66. From each split vector
codebook, the three best candidates are selected in function modules 67,
68, 69, and 70 using the energy weighted mean square error criteria. The
energy weighting reflects the power level of the spectral envelope at each
line spectral frequency. The three best candidates for each of the three
split vectors result in a total of twenty-seven combinations for each
category. The search is constrained so that at least one combination would
result in an ordered set of LSFs. This is usually a very mild constraint
imposed on the search. The optimum combination of these twenty-seven
combinations is selected in function module 71 depending on the cepstral
distortion measure. Finally, the optimal category or classification is
determined also on the basis of the cepstral distortion measure. The
quantized LSFs are converted to filter coefficients and then to
autocorrelation lags for interpolation purposes.
The resulting LSF vector quantizer scheme is not only effective across
speakers but also across varying degrees of IRS filtering which models the
influence of the handset transducer. The codebooks of the vector
quantizers are trained from a sixty talker speech database using flat as
well as IRS frequency shaping. This is designed to provide consistent and
good performance across several speakers and across various handsets. The
average log spectral distortion across the entire TIA half rate database
is approximately 1.2 dB for IRS filtered speech data and approximately 1.3
dB for non-IRS filtered speech data.
Two estimates of the pitch are determined per free at intervals of 20 msec.
These open loop pitch estimates are used in mode selection and to encode
the closed loop pitch analysis if the selected mode is a predominantly
voiced mode.
Module 33 determines the two pitch estimates from the two pitch analysis
windows described above in connection with FIG. 5B, using a modified form
of the pitch tracking algorithm shown in FIG. 7. This pitch estimation
algorithm makes an initial pitch estimate in function module 73 using an
error function calculated for all values in the set {(22.0, 22.5, . . . ,
114.5}, followed by pitch tracking to yield an overall optimum pitch
value. Function module 74 employs look-back pitch tracking using the error
functions and pitch estimates of the previous two pitch analysis windows.
Function module 75 employs look-ahead pitch tracking using the error
functions of the two future pitch analysis windows. Decision module 76
compares pitch estimates depending on look-back and look-ahead pitch
tracking to yield an overall optimum pitch value at output 77. The pitch
estimation algorithm shown in FIG. 7 requires the error functions of two
future pitch analysis windows for its look-ahead pitch tracking and thus
introduces a delay of 40 ms. In order to avoid this penalty, the preferred
communication system employs a modification of the pitch estimation
algorithm of FIG. 7.
FIG. 8 shows the open loop pitch estimation 33 of FIG. 3 in more detail.
Pitch analysis windows one and two are input to respective compute error
functions 331 and 332. The outputs of these error function computations
are input to a refinement of past pitch estimates 333, and the refined
pitch estimates are sent to both look back and look ahead pitch tracking
334 and 335 for pitch window one. The outputs of the pitch tracking
circuits are input to selector 336 which selects the open loop pitch one
as the first output. The selected open loop pitch one is also input to a
look back pitch tracking circuit for pitch window two which outputs the
open loop pitch two.
FIG. 9 shows the modified pitch tracking algorithm implemented by the pitch
estimation circuitry of FIG. 8. The modified pitch estimation algorithm
employs the same error function as in the FIG. 7 algorithm in each pitch
analysis window, but the pitch tracking scheme is altered. Prior to pitch
tracking for either the first or second pitch analysis window, the
previous two pitch estimates of the two previous pitch analysis windows
are refined in function modules 81 and 82, respectively, with both
look-back pitch tracking and look-ahead pitch tracking using the error
functions of the current two pitch analysis windows. This is followed by
look-back pitch tracking in function module 83 for the first pitch
analysis window using the refined pitch estimates and error functions of
the two previous pitch analysis windows. Look-ahead pitch tracking for the
first pitch analysis window in function module 84 is limited to using the
error function of the second pitch analysis window. The two estimates are
compared in decision module 85 to yield an overall best pitch estimate for
the first pitch analysis window. For the second pitch analysis window,
look-back pitch tracking is carried out in function module 86 as well as
the pitch estimate of the first patch analysis window and its error
function. No look-ahead pitch tracking is used for this second pitch
analysis window with the result that the look-back pitch estimate is taken
to be the overall best pitch estimate at output 87.
FIG. 10 shows the mode determination processing performed by mode selector
34. Depending on spectral stationarity, pitch stationarity, short term
energy, short term level gradient, and zero crossing rate of each 40 ms.
frame, mode selector 34 classifies each frame into one of three modes:
voiced and stationary mode (Mode A), unvoiced or transient mode (Mode B),
and background noise mode (Mode C). More specifically, mode selector 34
generates two logical values, each indicating spectral stationarity or
similarity of spectral content between the currently processed frame and
the previous frame (Step 1010). Mode selector 34 generates two logical
values indicating pitch stationarity, similarity of fundamental
frequencies, between the currently processed frame and the previous frame
(Step 1020). Mode selector 34 generates two logical values indicating the
zero crossing rate of the currently processed frame (Step 1030), a rate
influenced by the higher frequency components of the frame relative to the
lower frequency components of the frame. Mode selector 34 generates two
logical values indicating level gradients within the currently processed
frame (Step 1030). Mode selector 34 generates five logical values
indicating short-term energy of the currently processed frame (Step 1050).
Subsequently, mode selector 34 determines the mode of the frame to be mode
A, mode B, or mode C, depending on the values generated in Steps 1010-1050
(Step 1060).
FIG. 11 is a block diagram showing a processing of Step 1010 of FIG. 10 in
more detail. The processing of FIG. 11 determines a cepstral distortion in
dB. Module 1110 converts the quantized filter coefficients of window 2 of
the current frame into the lag domain, and module 1120 converts the
quantized filter coefficients of window 2 of the previous frame into the
lag domain. Module 1130 interpolates the outputs of modules 1110 and 1120,
and module 1140 converts the output of module 1130 back into falter
coefficience. Module 1150 converts the output from module 1140 into the
cepstral domain, and module 1160 converts the unquantized filter
coefficients from window 1 of the current frame into the cepstral domain.
Module 1170 generates the cepstral distortion d.sub.c from the outputs of
1150 and 1160.
FIG. 12 shows generation of spectral stationarity value LPCFLAG1, which is
a relatively strong indicator of spectral stationarity for the frame. Mode
selector 34 generates LPCFLAG1 using a combination of two techniques for
measuring spectral stationarity. The first technique compares the cepstral
distortion d.sub.c using comparators 1210 and 1220. In FIG. 12, the
d.sub.t1 threshold input to comparator 1210 is -8.0 and the d.sub.t2
threshold input to comparator 1220 is -6.0.
The second technique is based on the residual energy after LPC analysis,
expressed as a fraction of the LPC analysis speech buffer spectral energy.
This residual energy is a by-product of LPC analysis, as described above.
The .alpha.1 input to comparator 1230 is the residual energy for the
falter coefficients of window 1 and the .alpha.2 input to comparator 1240
is the residual energy of the filter coefficients of window 2. The
.alpha.t1 input to comparators 1230 and 1240 is a threshold equal to 0.25.
FIG. 13 shows dataflow within mode selector 34 for a generation of spectral
stationarity value flag LPCFLAG2, which is a relatively week indicator of
spectral stationarity. The processing shown in FIG. 13 is similar to that
shown in FIG. 12, except that LPCFLAG2 is based on a relatively relaxed
set of thresholds. The d.sub.t2 input to comparator 1310 is -6.0, the
d.sub.t3 input to comparator 1320 is -4.0, the d.sub.t4 input to
comparator 1350 is -2.0, the .alpha.t1 input to comparators 1330 and 1340
is a threshold 0.25, and the .alpha.t2 to comparators 1360 and 1370 is
0.15.
FIG. 14 illustrates the process by which mode selector 34 measures pitch
stationarity using both the open loop pitch values of the current frame,
denoted as P.sub.1 for pitch window 1 and P.sub.2 for pitch window 2, and
the open loop pitch value of window 2 of the previous frame denoted by
P.sub.-1. A lower range of pitch values (P.sub.L1 P.sub.U1) and an upper
range of pitch values (P.sub.L2 P.sub.U2) are:
P.sub.L1 =MIN (P.sub.-1, P.sub.2)-P.sub.t
P.sub.U1 =MIN (P.sub.-1, P.sub.2)+P.sub.t
P.sub.L2 =MAX (P.sub.-1, P.sub.2)-P.sub.t
P.sub.U2 =MAX (P.sub.-1, P.sub.2)+P.sub.t,
where P.sub.t is 8.0. If the two ranges are non-overlapping, i.e., P.sub.L2
>P.sub.U1, then only a weak indicator of pitch stationarity, denoted by
PITCHFLAG2, is possible end PITCHFLAG2 is set if P.sub.1 lies within
either the lower range (P.sub.L1, P.sub.U1) or upper range (P.sub.L2,
P.sub.U2). If the two ranges are overlapping, i.e., P.sub.L2
.ltoreq.P.sub.U1, a strong indicator of pitch stationarity, denoted by
PITCHFLAG1, is possible and is set if P.sub.1 lies within the range
(P.sub.L, P.sub.U), where
P.sub.L =(P.sub.-1 +P.sub.2)/2-2P.sub.t
P.sub.U =(P.sub.-1 +P.sub.2)/2+2P.sub.t
FIG. 14 shows a dataflow for generating PITCHFLAG1 and PITCHFLAG2 within
mode selector 34. Module 14005 generates an output equal to the input
having the largest value, and module 14010 generates an output equal to
the input having the smallest values. Module 1420 generates an output that
is an average of the values of the two inputs. Modules 14030, 14035,
14040, 14045, 14050 and 14055 are adders. Modules 14080, 14025 and 14090
are AND gates. Module 14087 is an inverter. Modules 14065, 14070, and
14075 are each logic blocks generating a true output when (C>=B)&(C<=A).
The circuit of FIG. 14 also processes reliability values V.sub.-1, V.sub.1,
and V.sub.2, each indicating whether the values P.sub.-1, P.sub.1, and
P.sub.2, respectively, are reliable. Typically, these reliability values
are a by-product of the pitch calculation algorithm. The circuit shown in
FIG. 14 generates false values for PITCHFLAG 1 and PITCHFLAG 2 if any of
these flags V.sub.-1, V.sub.1, V.sub.2, are false. Processing of these
reliability values is optional.
FIG. 15 shows dataflow within mode selector 34 for generating two logical
values indicating a zero crossing rate for the frame. Modules 15002,
15004, 15006, 15008, 15010, 15012, 15014 and 15016 each count the number
of zero crossings in a respective 5 millisecond subframe of the frame
currently being processed. For example, module 15006 counts the number of
zero crossings of the signal occurring from the time 10 millisecond from
the beginning of the frame to the time 15 ms from the beginning of the
frame. Comparators 15018, 15020, 15022, 14024, 15026, 15028, 15030, and
15032 in combination with adder 15035, generate a value indicating the
number of 5 millisecond (MS) subframes having zero crossings of >=15.
Comparator 15040 sets the flag ZC.sub.-- LOW when the number of such
subframes is less than 2, and the comparator 15037 sets the flag ZC.sub.--
HIGH when the number of such subframes is greater than 5. The value
ZC.sub.t input to comparators 15018-15032 is 15, the value Z.sub.t1 input
to comparator 15040 is 2, and the value Z.sub.t2 input to comparator 15037
is 5.
FIGS. 16A, 16B, and 16C show a data flow for generating two logical values
indicative of short term level gradient. Mode selector 34 measures short
term level gradient, an indication of transients within a frame, using a
low-pass filtered version of the companded input signal amplitude. Module
16005 generates the absolute value of the input signal S(n), module 16010
compands its input signal, and low-pass filter 16015 generates a signal
A.sub.L (n) that, at time instant n, is expressed by:
A.sub.L (n)=(63/64)A.sub.L (n-1)+(1/64)C(.vertline.s(n).vertline.)
where the companding function C(.) is the .mu.-law function described in
CCITT G.711. Delay 16025 generates an output that is a 10 ms-delayed
version of its input and subtractor 16027 generates a difference between
A.sub.L (n) and A.sub.L (N-80). Module 16030 generates a signal that is an
absolute value of its input.
Every 5 ms, mode selector 34 compares A.sub.L (n) with that of 10 ms ago
and, if the difference .vertline.A.sub.L (n)-A.sub.L (n-80).vertline.
exceeds a fixed relaxed threshold, increments a counter. (In the preceding
expression, 80 corresponds to 8 samples per MS times 10 MS). As shown in
FIG. 16C, if this difference does not exceed a relatively stringent
threshold (L.sub.t2 =32) for any subframe, mode selector 43 sets LVLFLAG2,
weakly indicating an absence of transients. As shown in FIG. 16B, if this
difference exceeds a more relaxed threshold (L.sub.t1 =10) for no more
than one subframe (L.sub.t3 =2) mode selector 34 sets LVLFLAG1, strongly
indicating an absence of transients.
More specifically, FIG. 16B shows delay circuits 16032-16046 that each
generate a 5 ms delayed version of its input. Each of latches 16048-16062
save a signal on its input. Latches 16048-16062 are strobed at a common
time, near the end of each 40 ms speech frame, so that each latch saves a
portion of the frame separated by 5 ms from the portion saved by an
adjacent latch. Comparators 16064-16078 each compare the output of a
respective latch to the threshold L.sub.t1 and adder 16080 sums the
comparator outputs and sends the sum to comparator 16082 for comparison to
the threshold L.sub.t3.
FIG. 16C shows a circuit for generating LVLFLAG2. In FIG. 16C, delays
16132-16146 are similar to the delays shown in FIG. 16B and latches
16148-16162 are similar to the latches shown in FIG. 16B. Comparators
16164-16178 each compare an output of a respective latch to the threshold
L.sub.t2 =2. Thus, OR gate 16180 generates a true output if any of the
latched signal originating from module 16030 exceeds the threshold
L.sub.t2. Inverter 16182 inverts the output of OR gate 16180.
FIG. 17 shows a data flow for generating parameters indicative of short
term energy. Short term energy is measured as the mean square energy
(average energy per sample) on a frame basis as well as on a 5 ms basis.
The short term energy is determined relative to a background energy
E.sub.bn. E.sub.bn is initially set to a constant E.sub.0
=(100.times.(12).sup.1/2).sup.2. Subsequently, when a frame is determined
to be mode C, E.sub.bn is set equal to (7/8)E.sub.bn +(1/8)E.sub.0. Thus,
some of the thresholds employed in the circuit of FIG. 17 are adaptive. In
FIG. 17, E.sub.t.phi. =0.707 E.sub.bn, E.sub.t1 =5, E.sub.t2 =2.5
E.sub.bn, E.sub.t3 =1.8 E.sub.bn, E.sub.t4 =E.sub.bn, E.sub.t5 =0.707
E.sub.bn, and E.sub.t6 =16.0.
The short term energy on a 5 ms basis provides an indication of presence of
speech throughout the frame using a single flag EFLAG1, which is generated
by testing the short term energy on a 5 ms basis against a threshold,
incrementing a counter whenever the threshold is exceeded, and testing the
counter's final value against a fixed threshold. Comparing the short term
energy on a frame basis to various thresholds provides indication of
absence of speech throughout the frame in the form of several flags with
varying degrees of confidence. These flags are denoted as EFLAG2, EFLAG3,
EFLAG4, and EFLAG5.
FIG. 17 shows dataflow within mode selector 34 for generating these flags.
Modules 17002, 17004, 17006, 17008, 17010, 17015, 17020, and 17022 each
count the energy in a respective 5 MS subframe of the frame currently
being processed. Comparators 17030, 17032, 17034, 17036, 17038, 17040,
17042, and 17044, in combination with adder 17050, count the number of
subframes having an energy exceeding E.sub.to =0.707 E.sub.bn.
FIGS. 18A, 18B, and 18C show the processing of step 1060. Mode selector 34
first classifies the frame as background noise (mode C) or speech (modes A
or B). Mode C tends to be characterized by low energy, relatively high
spectral stationarity between the current frame and the previous frame, a
relative absence of pitch stationarity between the current frame and the
previous frame, and a high zero crossing rate. Background noise (mode C)
is declared either on the basis of the short term energy flag EFLAG5 alone
or by combining short term energy flags EFLAG4, EFLAG3, and EFLAG2 with
other flags indicating high zero crossing rate, absence of pitch, absence
of transients, etc.
More specifically, if the mode of the previous frame was A or if EFLAG2 is
not true, processing proceeds to step 18045 (step 18005). Step 18005
ensures that the current frame will not be mode C if the previous frame
was mode A. The current frame is mode C if (LPCFLAG1 and EFLAG3) is true
or (LPCFLAG2 and EFLAG4) is true or EFLAG5 is true (steps 18010, 18015,
and 18020). The current frame is mode C if ((not PITCHFLAG1) and LPCFLAG1
and ZC.sub.-- HIGH) is true (step 18025) or ((not PITCHFLAG1) and (not
PITCHFLAG2) and LPCFLAG2 and ZC.sub.-- HIGH) is true (step 18030). Thus,
the processing shown in FIG. 18A determines whether the frame corresponds
to a first mode (Mode C), depending on whether a speech component is
substantially absent from the frame.
In step 18045, a score is calculated depending on the mode of the previous
free. If the mode of the previous frame was mode A, the score is
1+LVFLAG1+EFLAG1+ZC.sub.-- LOW. If the previous mode was mode B, the score
is 0+LVFLAG1+EFLAG1+ZC.sub.-- LOW. If the mode of the previous frame was
mode C, the score is 2+LVFLAG1+EFLAG1+ZC.sub.-- LOW.
If the mode of the previous frame was mode C or not LVLFLAG2, the mode of
the current frame is mode B (step 18050). The current frame is mode A if
(LPCFLAG1 & PITCHFLAG1) is true, provided the score is not less than 2
(steps 18060 and 18055). The current frame is mode A if (LPCFLAG1 and
PITCHFLAG2) is true or (LPCFLAG2 and PITCHFLAG1) is true, provided score
is not less than 3 (steps 18070, 18075, and 18080).
Subsequently, speech encoder 12 generates an encoded frame in accordance
with one of a first coding scheme (a coding scheme for mode C), when the
frame corresponds to the first mode, and an alternative coding scheme (a
coding scheme for modes A or B), when the frame does not correspond to the
first mode, as described in mode detail below.
For mode A, only the second set of line spectral frequency vector
quantization indices need to be transmitted because the first set can be
inferred at the receiver due to the slowly varying nature of the vocal
tract shape. In addition, the first and second open loop patch estimates
are quantized and transmitted because they are used to encode the closed
loop pitch estimates in each subframe. The quantization of the second open
loop pitch estimate is accomplished using a non-uniform 4-bit quantizer
while the quantization of the first open loop pitch estimate is
accomplished using a differential non-uniform 3-bit quantizer. Since the
vector quantization indices of the LSF's for the first linear prediction
analysis window are neither transmitted nor used in mode selection, they
need not be calculated in mode A. This reduces the complexity of the short
term predictor section of the encoder in this mode. This reduced
complexity as well as the lower bit rate of the short term predictor
parameters in mode A is offset by faster update of all the excitation
model parameters.
For mode B, both sets of line spectral frequency vector quantization must
be transmitted because of potential spectral nonstationarity. However, for
the first set of line spectral frequencies we need search only 2 of the 4
classifications or categories. This is because the IRS vs. non-IRS
selection varies very slowly with time. If the second set of line spectral
frequencies were chosen from the "voiced IRS-filtered" category, then the
first set can be expected to be from either the "voiced IRS-filtered" or
"unvoiced IRS-filtered" categories. If the second set of line spectral
frequencies were chosen from the "unvoiced IRS-filtered" category, then
again the first set can be expected to be from either the "voiced
IRS-filtered" or "unvoiced IRS-filtered" categories. If the second set of
line spectral frequencies were chosen from the "voiced non-IRS-filtered"
category, then the first set can be expected to be from either the "voiced
non-IRS-filtered" or "unvoiced non-IRS filtered" categories. Finally, if
the second set of line spectral frequencies were chosen from the "unvoiced
non-IRS-filtered" category, then again the first set can be expected to be
from either the "voiced non-IRS-filtered" or "unvoiced non-IRS-filtered"
categories. As a result only two categories of LSF codebooks need be
searched for the quantization of the first set of line spectral
frequencies. Furthermore, only 25 bits are needed to encode these
quantization indices instead of the 26 needed for the second set of LSF's,
since the optimal category for the first set can be coded using just 1
bit. For mode B, neither of the two open loop pitch estimates are
transmitted since they are not used in guiding the closed loop pitch
estimates. The higher complexity involved in encoding as well as the
higher bit rate of the short term predictor parameters in mode B is
compensated by a slower update of all the excitation model parameters.
For mode C, only the second set of line spectral frequency vector
quantization indices need to be transmitted because for human ear is not
as sensitive to rapid changes in spectral shape variations for noisy
inputs. Further, such rapid spectral shape variations are atypical for
many kinds of b | | |