|
Claims  |
|
|
What is claimed is:
1. An improvement in the method for compressing digitally encoded input
speech or audio vectors at a transmitter by using a scaling unit
controlled by a quantized residual gain factor QG, a synthesis filter
controlled by a set of quantized linear protective coefficient parameters
QLPC, a pitch predictor controlled by pitch and pitch predictor parameters
QP and QPP, a weighting filter controlled by a set of perceptual weighting
parameters W, and a permanent indexed codebook containing a predetermined
number M of codebook vectors, each having an assigned codebook index, to
find an index which identifies the best match between an input speech or
audio vector s.sub.n that is to be coded and a synthesized vector s.sub.n
generated from a stored vector in said indexed codebook, wherein each of
said digitally encoded input vectors consists of a predetermined number K
of digitally coded samples, comprising the steps of
buffering and grouping said input speech or audio vectors into frames of
vectors with a predetermined number N of vectors in each frame,
performing an initial analysis for each successive frame, said analysis
including the computation of a residual gain factor G, a set of perceptual
weighting parameters W, a pitch parameter P, a pitch predictor parameter
PP, and a set of said linear predictive coefficient parameters LPC, and
the computation of quantized values QG, QP, QPP and QLPC of parameters G,
P, PP and LPC using one or more indexed quantizing tables for the
computation of each quantized parameter or set of parameters
for each frame transmitting indices of said quantized parameters QG, QP,
QPP and QLPC determined in the initial analysis step as side information
about vectors analyzed for later use in looking up in one or more
identical tables said quantized parameters QG, QP QPP and QLPC while
reconstructing speech and audio vectors from encoded vectors in a frame,
where each index for a quantized parameter points to a location in one or
more of said identical tables where said quantized parameter may be found,
computing a zero-state response vector from the vector output of a
zero-input response filter comprising a scaling unit, synthesis filter and
weighting filter identical in operation to said scaling unit, synthesis
filter and weighting filter used for encoding said input vectors, said
zero-state response vector being computed for each vector in said
permanent codebook by first setting to zero the initial condition of said
zero-input response filter so that the response computed is not influenced
by a preceding one of said codebook vectors processed by said zero-input
response filter, and the using said quanitized values of said residual
gain factor, set of linear predictive coefficient parameters, and said set
of perceptual weighting parameters computed in said initial analysis step
by processing each vector in said permanent codebook through said
zero-input response filter to compute a zero-state response vector, and
storing each zero-state response vector computed in a zero-state response
codebook at or together with an index corresponding to the index of said
vector in said permanent codebook used for this zero-state response
computation step, and
after thus performing an initial analysis of and computing a zero-state
response codebook for each successive frame of input speech or audio
vectors, encode each input vector s.sub.n of a frame in sequence by
transmitting the codebook index of the vector in said permanent codebook
which corresponds to the index of a zero-state response vector in said
zero-state response codebook that best matches a vector v.sub.n obtained
from an input vector s.sub.n by
subtracting a long term pitch prediction vector s.sub.n from the input
vector s.sub.n to produce a difference vector d.sub.n and filtering said
difference vector d.sub.n by said perceptual weighting filter to produce a
final input vector f.sub.n, where said long term pitch prediction s.sub.n
is computed by taking a vector from said permanent codebook at the address
specified by the preceding particular index transmitted as a compressed
vector code and performing gain scaling of this vector using said
quantized gain factor QG, then synthesis filtering the vector obtained
from said scaling using said quantized values QLPC of said set of linear
predictive coefficient parameters to obtain a vector d.sub.n and from
vector d.sub.n producing a long term pitch predicted vector s.sub.n of the
next input vector s.sub.n through a pitch synthesis filter using said
quantized values of pitch predictor parameters QP and QPP, said long term
prediction vector s.sub.n being a prediction of the next input vector
s.sub.n, and
producing said vector v.sub.n by subtracting from said final input vector
f.sub.n the vector output of said zero-input response filter generated in
response to a permanent codebook vector at the codebook address of the
last transmitted index code, said vector output being generated by
processing through said zero input response filter, said permanent
codebook vector located at said last transmitted index code where the
output of said zero input response filter is discarded while said
permanent codebook vector located at said last transmitted index code is
being processed sample by sample in sequence into said zero input response
filter until all samples of said codebook vector have been entered, and
where the input of said zero input response filter is interrupted after
all samples of said codebook vector have been entered and then the desired
vector output from said zero-input response filter is processed out sample
by sample for subtraction from said final vector v.sub.n, and
for each input vector s.sub.n in a frame, finding the vector stored in said
zero-state response codebook which best matches the vector v.sub.n,
thereby finding the best match of a codebook vector with an input vector,
using an estimate vector s.sub.n produced from the best match codebook
vector found for the preceding input vector,
having found the best match of said vector v.sub.n with a zero-state
response vector in said zero-state response codebook for an input speech
or audio vector s.sub.n, transmit the zero-state response codebook index
of the current best-match zero-state response vector as a compressed
vector code of the current input vector, and also use said index of the
current best-match zero-state response vector to select a vector from said
permanent codebook for computing said long term pitch predicted input
vector s.sub.n to be subtracted from the next input vector s.sub.n of the
frame.
2. An improvement as defined in claim 1, including a method for
reconstructing said input speech or audio vectors from index coded vectors
at a receiver, comprised of decoding said side information transmitted for
each frame of index coded vectors, using the indices received to address a
permanent codebook identical to said permanent codebook in said
transmitter to successively obtain decoded vectors, scaling said decoded
vectors by said quantized gain factor QG, and performing synthesis
filtering using said set of linear predictive coefficient parameters and
pitch prediction filtering using said quantized pitch parameters QP and
QPP to produce approximation vectors s.sub.n of the original signal
vectors s.sub.n.
3. An improvement as defined in claim 2 wherein said receiver includes
postfiltering of said approximation vectors s.sub.n by long-delay
postfiltering and short-delay postfiltering in cascade, said quantized
pitch and quantized pitch predictor parameters controlling said long-term
postfiltering and said quantized linear predictive coefficient parameters
controlling said short-term postfiltering, whereby adaptive postfiltered
digitally encoded speech or audio vectors are provided.
4. An improvement as defined in claim 3 including automatic gain control of
the adaptive postfiltered digitally encoded speech or audio signal is
provided by estimating the square root of the power of said postfiltered
speech or audio signal to obtain a value .sigma..sub.a (n) of said
postfiltered speech or audio signal and estimating the square root of the
power of a postfiltering speech or audio signal input to obtain a value
.sigma..sub.1 (n) of decoded input speech or audio vectors before
postfiltering, and controlling the gain of the postfiltered speech or
audio output signal by a scaling factor that is a ratio of .sigma..sub.1
(n) to .sigma..sub.2 (n).
5. An improvement as defined in claim 4 wherein said quantized gain factor,
quantized pitch and quantized pitch predictor parameters, and quantized
linear predictive coefficient parameters are derived from said side
information transmitted to said receiver.
6. An improvement as defined in claim 3 wherein postfiltering is
accomplished by using a transfer function for said long-delay postfilter
of the form
##EQU8##
where C.sub.g is an adaptive scaling factor, p is the quantized value QP
of the pitch parameter P, and the factors .gamma. and .lambda. are
determined according to the following formulas
.gamma.=C.sub.z (x), .lambda.=C.sub.p f(x), 0<C.sub.z, C.sub.p< 1
where C.sub.z and C.sub.p are fixed scaling factors,
##EQU9##
U.sub.th is an unvoiced threshold value, and x is a voicing indicator
parameter that is a function of coefficients b.sub.1, b.sub.2 and b.sub.3,
where b.sub.1, b.sub.2, b.sub.3 are coefficients of said quantized pitch
predictor QPP given by P.sub.1 (z)=1-b.sub.1 z.sup.-p+1 -b.sub.2 z.sup.-p
-b.sub.3 z.sup.-p-1 where z is the inverse of the input delay operator
z.sup.-1 used in the z transform representation of transfer functions.
7. An improvement as defined in claim 6 wherein postfiltering is
accomplished by using a transfer function for said short-delay postfilter
of the form
##EQU10##
where .alpha. and .beta. are bandwidth expansion coefficients.
8. An improvement as defined in claim 7 wherein postfiltering further
includes in cascade first-order filtering with a transfer function
1-.mu.z.sup.-1, .mu.<1
where .mu. is a coefficient.
9. A postfiltering method for enhancing digitally processed speech or audio
signals comprising the steps
of buffering said speech or audio signals into frames of vectors, each
vector having K successive samples,
performing analysis of said buffered frames of speech or audio signals in
predetermined blocks to compute linear predictive coefficients, pitch and
pitch predictor parameters, and
filtering each vector with long-delay and short-delay postfiltering in
cascade, said long-delay postfiltering being controlled by said pitch and
pitch predictor parameters and said short-delay postfiltering being
controlled by said linear predictive coefficient parameters, wherein
postfiltering is accomplished by using a transfer function for said
short-delay postfilter of the form
##EQU11##
where z is the inverse of the unit delay operator z.sup.-1 used in the z
transform representation of transfer functions, and .alpha. and .beta. are
fixed scaling factors.
10. A postfiltering method as defined in claim 9 including automatic gain
control of the postfiltered digitally encoded speech or audio signal
provided by estimating the square root of the power of said postfiltered
digitally encoded speech or audio signal to obtain a value .sigma..sub.2
(n) of said postfiltered speech signal and estimating the square root of
the power of a postfiltering input speech or audio signal to obtain a
value .sigma..sub.1 (n) of decoded input speech or audio signal before
postfiltering, and controlling the gain of the postfiltered speech or
audio signal by a scaling factor that is a ratio of .sigma..sub.1 (n) to
.sigma..sub.2 (n).
11. A postfiltering method as defined in claim 10 wherein postfiltering is
accomplished by using a transfer function for said long-delay postfilter
of the form
##EQU12##
where C.sub.g is an adaptive scaling factor, p is the quantized value of
the pitch parameter QP and the factors .gamma. and .lambda. are adaptive
bandwidth expansion parameters determined according to the following
formulas
.gamma.=C.sub.z f(x), .lambda.=C.sub.p f(x), 0<C.sub.z, C.sub.p <1
where C.sub.z and C.sub.p are fixed scaling factors and
##EQU13##
U.sub.th is an unvoiced threshold value, and x is a voicing indicator that
is a function of coefficients b.sub.1, b.sub.2, b.sub.3 where b.sub.1,
b.sub.2, b.sub.3 are coefficients of said quantized pitch predictor QPP
given by P.sub.1 (z)=1-b.sub.1 z.sup.-p+1 -b.sub.2 z.sup.-p -b.sub.3
z.sup.-p-1 where z is the inverse of the input delay operator z.sup.-1
used in the z transform representation of transfer functions.
12. A postfiltering method as defined in claim 11 wherein postfiltering
further includes in cascade first-order filtering with a transfer function
1-.mu.z.sup.-1, .mu.<1
where .mu. is a coefficient. |
|
|
|
|
Claims  |
|
|
Description  |
|
|
BACKGROUND OF THE INVENTION
This invention relates a real-time coder for compression of digitally
encoded speech or audio signals for transmission or storage, and more
particularly to a real-time vector adaptive predictive coding system.
In the past few years, most research in speech coding has focused on bit
rates from 16 kb/s down to 150 bits/s. At the high end of this range, it
is generally accepted that toll quality can be achieved at 16 kb/s by
sophisticated waveform coders which are based on scalar quantization. N.
S. Jayant and P. Noll, Digital Coding of Waveforms, Prentice-Hall Inc.,
Englewood Cliffs, N.J., 1984. At the other end, coders (such as
linear-predictive coders) operating at 2400 bits/s or below only give
syntheticquality speech. For bit rates between these two extremes,
particularly between 4.8 kb/s and 9.6 kb/s, neither type of coder can
achieve high-quality speech. Part of the reason is that scalar
quantization tends to break down at a bit rate of 1 bit/sample. Vector
quantization (VQ), through its theoretical optimality and its capability
of operating at a fraction of one bit per sample, offers the potential of
achieving high-quality speech at 9.6 kb/s or even at 4.8 kb/s. J.
Makhoul, S. Roucos, and H. Gish, "Vector Quantization in Speech Coding,"
Proc. IEEE, Vol. 73, No. 11, November 1985.
Vector quantization (VQ) can achieve a performance arbitrarily close to the
ultimate rate-distortion bound if the vector dimension is large enough. T.
Berger, Rate Distortion Theory, Prentice-Hall Inc., Englewood Cliffs,
N.J., 1971. However, only small vector dimensions can be used in practical
systems due to complexity considerations, and unfortunately, direct
waveform VQ using small dimensions does not give adequate performance. One
possible way to improve the performance is to combine VQ with other data
compression techniques which have been used successfully in scalar coding
schemes.
In speech coding below 16 kb/s, one of the most successful scalar coding
schemes is Adaptive Predictive Coding (APC) developed by Atal and
Schroeder [B. S. Atal and M. R. Schroeder, "Adaptive Predictive Coding of
Speech Signals," Bell Syst. Tech. J., Vol. 49, pp. 1973-1986, October
1970; B. S. Atal and M. R. Schroeder, "Predictive Coding of Speech Signals
and Subjective Error Criteria," IEEE Trans. Acoust., Speech, Signal Proc.,
Vol. ASSP-27, No. 3, June 1979: and B. S. Atal, "Predictive Coding of
Speech at Low Bit Rates," IEEE Trans. Comm., Vol. COM-30, No. 4, April
1982]. It is the combined power of VQ and APC that led to the development
of the present invention, a Vector Adaptive Predictive Coder (VAPC). Such
a combination of VQ and APC will provide high-quality speech at bit rates
between 4.8 and 9.6 kb/s, thus bridging the gap between scalar coders and
VQ coders.
The basic idea of APC is to first remove the redundancy in speech waveforms
using adaptive linear predictors, and then quantize the prediction
residual using a scalar quantizer. In VAPC, the scalar quantizer in APC is
replaced by a vector quantizer VQ. The motivation for using VQ is
two-fold. First, although liner dependency between adjacent speech samples
is essentially removed by linear prediction, adjacent prediction residual
samples may still have nonlinear dependency which can be exploited by VQ.
Secondly, VQ can operate at rates below one bit per sample. This is not
achievable by scalar quantization, but it is essential for speech coding
at low bit rates.
The vector adaptive predictive coder (VAPC) has evolved from APC and the
vector predictive coder introduced by V. Cuperman and A. Gersho, "Vector
Predictive Coding of Speech at 16 kb/s," IEEE Trans. Comm., Vol. COM-33,
pp. 685-696, July 1985. VAPC contains some features that are somewhat
similar to the Code-Excited Linear Prediction (CELP) coder by M. R.
Schroeder, B. S. Atal, "Code-Excited Linear Prediction (CELP):
High-Quality Speech at Very Low Bit Rates," Proc. Int'l. Conf. Acoustics,
Speech, Signal Proc., Tampa, March 1985, but with much less computational
complexity.
In computer simulations, VAPC gives very good speech quality at 9.6 kb/s,
achieving 18 dB of signal-to-noise ratio (SNR) and 16 dB of segmental SNR.
At 4.8 kb/s, VAPC also achieves reasonably good speech quality, and the
SNR and segmental SNR are about 13 dB and 11.5 dB, respectively. The
computations required to achieve these results are only in the order of 2
to 4 million flops per second (one flop, a floating point operation, is
defined as one multiplication, one addition, plus the associated
indexing), well within the capability of today's advanced digital
signaling processor chips. VAPC may become a low-complexity alternative to
CELP, which is known to have achieved excellent speech quality at an
expected bit rate around 4.8 kb/s but is not presently capable of being
implemented in real-time due to its astronomical complexity. It requires
over 400 million flops per second to implement the coder. In terms of the
CPU time of a supercomputer CRAY-1, CELP requires 125 seconds of CPU time
to encode one second of speech. There is currently a great need for a
real-time, high-quality speech coder operating at encoding rates ranging
from 4.8 to 9.6 kb/s. In this range of encoding rates, the two coders
mentioned above (APC and CELP) are either unable to achieve high quality
or too complex to implement. In contrast, the present invention, which
combines Vector Quantization (VQ) with the advantages of both APC and
CELP, is able to achieve high-quality speech with sufficiently low
complexity for real-time coding.
OBJECTS AND SUMMARY OF THE INVENTION
An object of this invention is to encode in real time analog speech or
audio waveforms into a compressed bit stream for storage and/or
transmission, and subsequent reconstruction of the waveform for
reproduction.
Another object is to provide adaptive post-filtering of a speech or audio
signal that has been corrupted by noise resulting from a coding system or
other sources of degradation so as to enhance the perceived quality of
said speech or audio signal.
The objects of this invention are achieved by a system which approximates
each vector of K speech samples by using each of M fixed vectors stored in
a VQ codebook to excite a time-varying synthesis filter and picking the
best synthesized vector that minimizes a perceptually meaningful
distortion measure. The original sampled speech is first buffered and
partitioned into vectors and frames of vectors, where each frame is
partitioned into N vectors, each vector having K speech samples.
Predictive analysis of pitch-filtering parameters (P) linear-predictive
coefficient filtering parameters (LPC), perceptual weighting filter
parameters (W) and residual gain scaling factor (G) for each of successive
frames of speech is then performed. The parameters determined in the
analyses are quantized and reset every frame for processing each input
vector s.sub.n in the frame, except the perceptual weighting parameter. A
perceptual weighting filter responsive to the parameters W is used to help
select the VQ vector that minimizes the perceptual distortion between the
coded speech and the original speech. Although not quantized, the
perceptual weighting filter parameters are also reset every frame.
After each frame is buffered and the above analysis is completed at the
beginning of each frame, M zero-state response vectors are computed and
stored in a zero-state response codebook. These M zero-state response
vectors are obtained by first setting to zero the memory of an LPC
synthesis filter and a perceptual weighting filter in cascade with a
scaling unit controlled by the factor G, and then controlling the
respective filters with the quantized LPC filter parameters and the
unquantized perceptual weighting filter parameters, and exciting the
cascaded filters using one predetermined and fixed vector quantization
(VQ) codebook vector at a time. The output vector of the cascaded filters
for each VQ codebook vector is then store in a temporary zero-state
codebook at the corresponding address, i.e., is assigned the same index of
a temporary zero-state response codebook as the index of the exciting
vector out of the VQ codebook. In encoding each in each vector s.sub.n
within a frame, a pitch-predicted vector s.sub.n the vector s.sub.n is
determined by processing the last vector encoded as an index code through
a scaling unit, LPC synthesis filter and pitch predictor filter controlled
by the parameters QG, QLPC, QP and QPP for the frame. In addition, the
zero-input response of the cascaded filters (the ringing from excitation
of a previous vector) is first set in a zero-input response filter. Once
the pitch-predicted vector s.sub.n is subtracted from the input signal
vector s.sub.n, and a difference vector d.sub.n is passed through the
perceptual weighting filter to produce a filtered difference vector
f.sub.n, the zero-input response vector in the aforesaid zero-input
response filter is subtracted from the output of the perceptual weight
filter, namely the difference vector f.sub.n, and the resulting vector
v.sub.n is compared with each of the M stored zero-state response vectors
in search of the one having a minimum difference .DELTA. or distortion.
The index (address) of the zero-state response vector that produces the
smallest distortion, i.e., that is closest to v.sub.n, identifies the best
vector in the permanent VQ codebook. Its index (address) is transmitted as
the vector compressed code for the vector s.sub.n, and used by a receiver
which has an identical VQ codebook as the transmitter to find the
best-match vector. In the transmitter, that best-match vector is used at
the time of transmission of its index to excite the LPC synthesis filter
and pitch prediction filter to generate an estimate s.sub.n of the next
speech vector. The best-match vector is also used to excite the zero-input
response filter to set it for the next input vector s.sub.n to be
processed as described above. The indices of the best-match vectors for a
frame of vectors are combined in a multiplexer with the frame analysis
information hereinafter referred to as "side information," comprised of
the indices of quantized parameters which control pitch, pitch predictor
and LPC predictor filtering and the gain used in the coding process, in
order that it be used by the receiver in decoding the vector indices of a
frame into vectors using a codebook identical to the permanent VQ codebook
at the transmitter. This side information is preferably transmitted
through the multiplexer first, once for each frame of VQ indices that
follow, but it would be possible to first transmit a frame of vector
indices, and then transmit the side information since the frames of vector
indices will require some buffering in either case; the difference is only
in some initial delay at the beginning of speech or audio frames
transmitted in succession. The resulting stream of multiplexed indices are
transmitted over a communication channel to a decoder, or stored for later
decoding.
In the decoder, the bit stream is first demultiplexed to separate the side
information from the encoded vector indices that follow. Each encoded
vector index is used at the receiver to extract the corresponding vector
from the duplicate VQ codebook. The extracted vector is first scaled by
the gain parameter, using a table to convert the quantized gain index to
the appropriate scaling factor, and then used to excite cascaded LPC
synthesis and pitch synthesis filters controlled by the same side
information used in selecting the best-match index utilizing the
zero-state response codebook in the transmitter. The output of the pitch
synthesis filter is the coded speech, which is perceptually close to the
original speech. All of the side information, except the gain information,
is used in an adaptive postfilter to enhance the quality of the speech
synthesized. This postfiltering technique may be used to enhance any voice
or audio signal. All that would be required is an analysis section to
produce the parameters used to make the postfilter adaptive.
Other modifications and variation to this invention may occur to those
skilled in the art, such as variable-frame-rate coding, fast codebook
searching, reversal of the order of pitch prediction and LPC prediction,
and use of alternative perceptual weighting techniques. Consequently, the
claims which define the present invention are intended to encompass such
modifications and variations.
Although the purpose of this invention is to encode for transmission and/or
storage of analog speech or audio waveforms for subsequent reconstruction
of the waveforms upon reproduction of the speech or audio program,
reference is made hereinafter only to speech, but the invention described
and claimed is applicable to audio waveforms or to sub-band filtered
speech or audio waveforms.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1a is a block diagram of a Vector Adaptive Predictive Coding (VAPC)
processor embodying the present invention, and
FIG. 1b is a block diagram of a receiver for the encoded speech transmitted
by the system of FIG. 1a.
FIG. 2 is a schematic diagram that illustrates the adaptive computation of
vectors for a zero-state response codebook in the system of FIG. 1a.
FIG. 3 is a block diagram of an analysis processor in the system of FIG.
1a.
FIG. 4 is a block diagram of an adaptive post filter of FIG: 1b.
FIG. 5 illustrates the LPC spectrum and the corresponding frequency
response of an all-pole post-filter 1/[1-P(z/ .alpha.)] for different
values of .alpha.. The offset between adjacent plots is 20 dB.
FIG. 6 illustrates the frequency responses of the postfilter
[1-.mu.z.sup.-1 ][1-P(z/.beta.)]/[1-P(z/ .alpha.)] corresponding to the
LPC spectrum shown in FIG. 5. In both plots, .alpha.=0.8 and .beta.=0.5.
The offset between the two plots is 20 dB.
DESCRIPTION OF PREFERRED EMBODIMENTS
The preferred mode of implementation contemplates using programmable
digital signal processing chips, such as one or two AT&T DSP32 chips, and
auxiliary chips for the necessary memory and controllers for such
equipments as input sampling, buffering and multiplexing. Since the system
is digital, it is synchronized throughout with the samples. For simplicity
of illustration and explanation, the synchronizing logic is not shown in
the drawings. Also for simplification, at each point where a signal vector
is subtracted from another, the subtraction function is symbolically
indicated by an adder represented by a plus sign within a circle. The
vector being subtracted is on the input labeled with a minus sign. In
practice, the two's complement of the subtrahend is formed and added to
the minuend. However, although the preferred implementation contemplates
programmable digital signal processors, it would be possible to design and
fabricate special integrated circuits using VLSI techniques to implement
the present invention as a special purpose, dedicated digital signal
processor once the quantities needed would justify the initial cost of
design.
Referring to FIG. 1a, original speech samples in digital form from sampling
analog-to-digital converter 10 are received by an analysis processor 11
which partitions them into vectors s.sub.n of K samples per vector, and
into frames of N vectors per frame. The analysis processor stores the
samples in a dual buffer memory which has the capacity for storing more
than one frame of vectors, for example two frames of 8 vectors per frame,
each vector consisting of 20 samples, so that the analysis processor may
compute parameters used for coding the stored frame. As each frame is
being processed out of one buffer, a new frame coming in is stored in the
other buffer so that when processing of a frame has been completed, there
is a new frame buffered and ready to be processed.
The analysis processor 11 determines the parameters of filters employed in
the Vector Adaptive Predictive Code (VAPC) technique that is the subject
of this invention. These parameters are transmitted through a multiplexer
12 as side information just ahead of the frame of vector codes generated
with the use of a permanent vector quantized (VQ) codebook 13 and a
zero-state response (ZSR) codebook 14. The side information conditions the
receiver to properly filter decoded vectors of the frame. The analysis
processor 11 also computes other parameters used in the encoding process.
The latter are represented in FIG. 1a by labeled lines, and consist of
sets of parameters which are designated W for a perceptual weighting
filter 18, a quantized LPC predictor QLPC for an LPC synthesis filter 15,
and quantized pitch QP and pitch predictor QPP for a pitch synthesis
filter 16. Also computed by the analysis processor is a scaling factor G
that is quantized to AG for control of a scaling unit 17. The four
quantized parameters transmitted as side information are encoded for
transmission using a quantizing table as the quantized pitch index, pitch
predictor index, LPC predictor index and gain index. The manner in which
the analysis processor computes all of these parameters will be described
with reference to FIG. 3.
The multiplexer 12 preferably transmits the side information as soon as it
is available, although it could follow the frame of encoded input vectors,
and while that is being done, M zero-state response vectors are computed
for the zero-state response (ZSR) codebook 14 in a manner illustrated in
FIG. 2, which is to process each vector in the VQ codebook, 13 e.g., 128
vectors, through a gain scaling unit 17', an LPC synthesis filter 15', and
perceptual weighting filters 18' corresponding to the gain scaling unit
17, the LPC synthesis filter 15, and perceptual weighting filter 18 in the
transmitter (FIG. 1a). Ganged commutating switches S.sub.1 and S.sub.2 are
shown to signify that each fixed VQ vector processed is stored in memory
locations of the same index (address) in the ZSR codebook.
At the beginning of each codebook vector processing, the initial conditions
of the cascaded filters 15' and 18' are set to zero. This simulates what
the cascaded filters 15' and 18' will do with no previous vector present
from its corresponding VQ codebook. Thus, if the output of a zero-input
response filter 19 in the transmitter (FIG. 1a) is held or stored so at
each step of computing the VQ code index (to transmit for each vector of a
frame), it is possible to simplify encoding the speech vectors by
subtracting the zero-state response output from the vector f.sub.n. In
other words, assuming M=128, there are 128 different vectors permanently
stored in the VQ codebook to use in coding the original speech vectors
s.sub.n. Then every one of the 128 VQ vectors is read out in sequence, fed
through the scaling unit 17', the LPC synthesis filter 15', and the
perceptual weighting filter 18' shown in FIG. 2 without any history of
previous vector inputs (ie., without any ringing due to excitation by a
preceding vector) by resetting those filters at each step. The resulting
filter output vector is then stored in a corresponding location in the
zero-state response codebook 14. Later, while encoding input signal
vectors s.sub.n by finding the best match between a vector v.sub.n and all
of the zero state response vector codes, it is necessary to subtract from
a vector f.sub.n derived from the perceptual weighting filter a value that
corresponds to the effect of the previously selected VQ vector. That is
done through the zero-input response filter 19. The index (address) of the
best match is used as the compressed vector code transmitted for the
vector s.sub.n. Of the 128 zero-state response vectors, there will be only
one that provides the best match, i.e., least distortion. Assume it is in
location 38 of the zero-state response codebook as determined by a
computer 20 labeled "compute norm." An address register 20a will store the
index 38. It is that index that is then transmitted as a VQ index to the
receiver shown in FIG. 1b.
In the receiver, a demultiplexer 21 separates the side information which
conditions the receiver with the same parameters as corresponding filters
and scaling unit of the transmitter. The receiver uses a decoder 22 to
translate the parameter indices to parameter values. The VQ index for each
successive vector in the frame addresses a VQ codebook 23 which is
identical to the fixed VQ codebook 13 of the transmitter. The LPC
synthesis filter 24, pitch synthesis filter 25, and scaling unit 26 are
conditioned by the same parameters which were used in computing the
zero-state codebook values, and which were in turn used in the process of
selecting the encoding index for each input vector. At each step of
finding and transmitting an encoding index, the zero-input response filter
19 computes from the VQ vector at the location of the index transmitted a
value to be subtracted from the input vector f.sub.n to present a
zero-input response to be used in the best-match search.
There are various procedures that may be used to determine the best match
for an input vector s.sub.n. The simplest is to store the resulting
distortion between each zero-state response vectorcode output and the
vector v.sub.n with the index of that zero-state response vector code.
Assuming there are 128 vectorcodes stored in the codebook 14, there would
then be 128 resulting distortions stored in a computer 20. Then, after all
have been stored, a search is made in the computer 20 for the lowest
distortion value). Its index (address) of that lowest distortion value is
then stored in a register 20a and transmitted to the receiver as an
encoded vector via the multiplexer 12, and to the VQ codebook for reading
the corresponding VQ vector to be used in the processing of the next input
vector s.sub.n.
In summary, it should be noted that the VQ codebook is used (accessed) in
two different steps: first, to compute vector codes for the zero-state
response codebook at the beginning of each frame, using the LPC synthesis
and perceptual weighting filter parameters determined for the frame: and
second, to excite the filters 15 and 16 through the scaling unit 17 while
searching for the index of the bestmatch vector, during which the estimate
s.sub.n thus produced is subtracted from the input vector s.sub.n. The
difference d.sub.n is used in the best-match search.
As the best match for each input vector s.sub.n is found, the corresponding
predetermined and fixed vector from the VQ codebook is used to reset the
zero input response filter 19 for the next vector of the frame. The
function of the zero-input response filter 19 is thus to find the residual
response of the gain scaling unit 17' and filters 15' and 18' to
previously selected vectors from the VQ codebook. Thus, the selected
vector is not transmitted: only is used to read out the selected vector
from a VQ codebook 23 identical to the VQ codebook 13 in the transmitter.
The zero-input response filter 19 is the same filtering operation that is
used to generate the ZSR codebook 14, namely the combination of a gain G,
an LPC synthesis filter and a weighting filter, as shown in FIG. 2. Once a
best codebook vector match is determined, the best-match vector is applied
as an input to this filter (sample by sample, sequentially). An input
switch s.sub.in is closed and an output switch s.sub.out is open during
this time so that the first K output samples are ignored (K is the
dimension of the vector and a typical value of K is 20.) As soon as all K
samples have been applied as inputs to the filter, the filter input switch
s.sub.in is opened and the output switch s.sub.out is closed. The next K
samples of the vector f.sub.n, the output of the perceptual weighting
filter, begin to arrive and are subtracted from the samples of the vector
f.sub.n. The difference so generated is a set of K samples forming the
vector v.sub.n which is stored in a static register for use in the ZSR
codebook search procedure. In the ZSR codebook search procedure, the
vector v.sub.n is subtracted from each vector stored in the ZSR codebook,
and the difference vector A is fed to the computer 20 together with the
index (or stored in the same order, thereby to imply the index of the
vector out of the ZSR codebook). The computer 20 then determines which
difference is the smallest, i.e., which is the best match between the
vector v.sub.n and each vector stored temporarily (for one frame of input
vectors s.sub.n). The index of that best-match vector is stored in a
register 20a. That index is transmitted as a vectorcode and used to
address the VQ codebook to read the vector stored there into the scaling
unit 17, as noted above. This search process is repeated for each vector
in the ZSR codebook, each time using the same vector v.sub.n. Then the
best vector is determined.
Referring now to FIG. 1b, it should be noted that the output of the VQ
codebook 23, which precisely duplicates the VQ codebook 13 of the
transmitter, is identical to the vector extracted from the best-match
index applied as an address to the VQ codebook 13: the gain unit 26 is
identical to the gain unit 17 in the transmitter, and filters 24 and 25
exactly duplicate the filters 15 and 16, respectively, except that at the
receiver, the approximation s.sub.n rather than the prediction s.sub.n is
taken as the output of the pitch synthesis filter 25. The result, after
converting from digital to analog form, is synthesized speech that
reproduces the original speech with very good quality.
It has been found that by applying an adaptive postfilter 30 to the
synthesized speech before converting it from digital to analog form, the
perceived coding noise may be greatly reduced without introducing
significant distortion in the filtered speech. FIG. 4 illustrates the
organization of the adaptive postfilter as a long-delay filter 31 and a
short-delay filter 32. Both filters are adaptive in that the parameters
used in them are those received as side information from the transmitter,
except for the gain parameter, G. The basic idea of adaptive
post-filtering is to attenuate the frequency components of the coded
speech in spectral valley regions. At low bit rates, a considerable amount
of perceived coding noise comes from spectral valley regions where there
are no strong resonances to mask the noise. The postfilter attenuates the
noise components in spectral valley regions to make the coding noise less
perceivable. However, such filtering operation inevitably introduces some
distortion to the shape of the speech spectrum. Fortunately, our ears are
not very sensitive to distortion in spectral valley regions: therefore,
adaptive postfiltering only introduces very slight distortion in perceived
speech, but it significantly reduces the perceived noise level. The
adaptive postfilter will be described in greater detail after first
describing in more detail the analysis of a frame of vectors to determine
the side information.
Referring now to FIG. 3, it shows the organization of the initial analysis
of block 11 in FIG.. 1a. The input speech samples s.sub.n are first stored
in a buffer 40 capable of storing, for example, more than one frame of 8
vectors, each vector ha | | |