|
Claims  |
|
|
What is claimed is:
1. An improvement in the method for compressing digitally encoded speech or
audio signal by using a permanent indexed codebook of N predetermined
excitation vectors of dimension k, each having an assigned codebook index
j to find indices which identify the best match between an input speech
vector s.sub.n that is to be coded and a vector c.sub.j from a codebook,
where the subscript j is an index which uniquely identifies a codevector
in said codebook, and the index of which is to be associated with the
vector code, comprising the steps of
buffering and grouping said vectors into frames of L samples, with L/k
vectors for each frame,
performing initial analyses for each successive frame to determine a set of
parameters for specifying long-term synthesis filtering, short-term
synthesis filtering, and perceptual weighting,
computing a zero-input response of a long-term synthesis filter, short-term
synthesis filter, and perceptual weighting filter,
perceptually weighting each input vector s.sub.n of a frame and subtracting
from each input vector s.sub.n said zero input response to produce a
vector z.sub.n,
obtaining each codevector c.sub.j from said codebook one at a time and
processing each codevector c.sub.j through a scaling unit, said unit being
controlled by a gain factor G.sub.j, and further processing each scaled
codevector c.sub.j through a long-term synthesis filter, short-term
synthesis filter and perceptual weighting filter in cascade, said cascaded
filters being controlled by said set of parameters to produce a set of
estimates z.sub.j of said vector z.sub.n, one estimate for each codevector
c.sub.j,
finding the estimate z.sub.j which best matches the vector z.sub.n,
computing a quantized value of said gain factor G.sub.j using said vector
z.sub.n and the estimate z.sub.j which best matches z.sub.n,
pairing together the index j of the estimate z.sub.j which best matches
z.sub.n and said quantized value of said gain factor G.sub.j as index-gain
pairs for later reconstruction of said digitally encoded speech or audio
signal,
associating with each frame said index-gain pairs from said frame along
with the quantized values of said parameters otained by initial analysis
for use in specifying long-term synthesis filtering and short-term
synthesis filtering in said reconstruction of said digitally encoded
speech or audio signal, and
during said reconstruction, reading out of a codebook a codevector c.sub.j
that is identical to the codevector c.sub.j used for finding said best
estimate by processing said reconstruction codevector c.sub.j through said
scalar and said cascaded long-term and short-term synthesis filters.
2. An improvement in the method for compressing digitally encoded speech as
defined in claim 1 wherein said codebooks are made sparse by extracting
vectors from an initial arbitrary codebook, one at a time, and setting all
but a selected number of samples of highest amplitude values in each
vector to zero amplitude values, thereby generating a sparse vector with
the same number of samples as the initial vector, but with only said
selected number of samples having nonzero values.
3. An improvement in the method for compressing digitally encoded speech as
defined in claim 1 by use of a codebook to store vectors c.sub.j, where
the subscript j is an index for each vector stored, a method for designing
an optimum codebook using an initial arbitrary codebook and a set of m
speech training vectors sn by producing for each vector s.sub.n in
sequence said perceptually weighted vector z.sub.n, clustering said m
vectors z.sub.n, calculating N centroid vectors from said m clustered
vectors, where N<m, update said codebook by replacing N vectors c.sub.j
with vector s.sub.n used to produce vector z.sub.n found to be a best
match with said vector z.sub.j at index location j, and testing for
convergence between the updated codebook and said set of m speech training
vectors s.sub.n, and if convergence has not been achieved, repeating the
process using the updated codebook until convergence is achieved.
4. An improvement as defined in claim 3, including a final step of center
clipping vectors in the last updated codebook vector by setting to zero
all but a selected number of samples of lowest amplitude in each vector
c.sub.j, and leaving in each vector c.sub.j only said selected number of
samples of highest amplitude by extracting the vectors of said last
updated codebook, one at a time, and setting all but a selected number of
samples of highest amplitude values in each vector to amplitude values of
zero, thereby generating a sparse vector with the same number of samples
as the last updated vector, but with only said selected number of samples
having nonzero values.
5. An improvement as defined in claim 1 comprising a two-step fast search
method wherein the first step is to classify a current speech frame prior
to compressing by selecting one of a plurality of classes to which the
current speech frame belongs, and the seocnd step is to use a selected one
of a plurality of reduced sets of codevectors to find the best match been
each input vector z.sub.i and one of the codevectors of said selected
reduced set of codevectors having a unique correspondence between every
codevector in the set and particular vectors in said permanent indexed
codebook, whereby a reduced exhaustive search is achieved for processing
each input vector z.sub.i of a frame by first classifying the frame and
then using a reduced codevector set selected from the permanent index
codebook for every input vector of the frame.
6. An improvement as defined in claim 5 wherein classification of each
frame is carried out by examining the spectral envelope parameters of the
current frame and comparing said spectral envelope parameters with stored
vector parameters for all classes in order to select one of said plurality
of reduced sets of codevectors.
7. An improvement as defined in claim 1, wherein the step of computing said
quantized value of said gain factor G.sub.j and the estimate that best
matches z.sub.n is carried out by calculating the cross-correlation
between the estimate z.sub.j and said vector z.sub.n, and dividing the
cross-correlation product of said vector z.sub.n and said estima z.sub.j
in accordance with the following equation:
##EQU11##
where k is the number of samples in a vector.
8. An improvement in the method for compressing digitally encoded speech or
audio signal by using a permanent indexed codebook of N predetermined
excitation vectors of dimension k, each having an assigned codebook index
j to find indices which identify the best match between an input speech
vector s.sub.n that is to be coded and a vector c.sub.j from a codebook,
where the subscript j is an index which uniquely identifies a codevector
in said codebook, and the index of which is to be associated with the
vector code, comprising the steps of
designing said codebook to have sparse vectors by extracting vectors from
an initial arbitrary codebook, one at a time, and setting to zero value
all but a selected number of samples of highest amplitude values in each
vector, thereby generating a sparse vector with the same number of samples
as the initial vector, but with only said selected number of samples
having nonzero values,
buffering and grouping said vectors into frames of L samples, with L/k
vectors for each frame,
performing initial analyzes for each successive frame to determine a set of
parameters for specifying long-term synthesis filtering, short-term
synthesis filtering, and perceptual weighting,
computing a zero-input response of a long-term synthesis filter, short-term
synthesis fiIter, and perceptual weighting filter,
perceptually weighting each input vector s.sub.n of a frame and subtracting
from each input vector s.sub.n said zero input response to produce a
vector z.sub.n,
obtaining each codevector c.sub.j from said codebook one at a time and
processing each codevector c.sub.j through a scaling unit, said unit being
controlled by a gain factor G.sub.j, and further processing each scaled
codevector c.sub.j through a long-term synthesis filter, short-term
synthesis filter, said cascaded filters being controlled by said set of
parameters to produce a set of estimates z.sub.j of said vector z.sub.n,
one estimate for each codevector c.sub.j,
finding the estimate z.sub.j which best matches the vector z.sub.n,
computing a quantized value of said gain factor G.sub.j using said vector
z.sub.n and the estimate z.sub.j which best matches z.sub.n
pairing together the index j of the estimate z.sub.j which best matches
z.sub.n and said quantized value of said gain factor G.sub.j for later
reconstruction of said digitally encoded speech or audio signal,
associating with each frame said index-gain pairs from said frame along
with the quantized values of said parameters obtained by initial analysis
for use in specifying long-term synthesis filtering and short-term
synthesis filtering in said reconstruction of said digitally encoded
speech or audio signal, and
during said reconstruction, reading out of a codebook a codevector c.sub.j
that is identical codevector c.sub.j used for finding said best estimate
by processing said reconstruction codevector c.sub.j through said scalar
and said cascaded long-term and short-term synthesis filters.
9. An improvement in the method for compressing digitally encoded speech as
defined in claim 8 by use of a codebook to store vectors c.sub.j, where
the subscript j is an index for each vector stored, a method for designing
an optimum codebook using an initial arbitrary codebook and a set of m
speech training vectors s.sub.n by producing for each vector s.sub.n in
sequence said perceptually weighted vector z.sub.n, clustering said m
vectors z.sub.n, calculating N centroid vectors from said m clustered
vectors, where N<m, update said codebook by replacing N vectors c.sub.j
with vector s.sub.n used to produce vector z.sub.n found to be a best
match with said vector z.sub.j at index location j, and testing for
convergence between the updated codebook and said set of m speech training
vectors s.sub.n, and if convergence has not been achieved, repeating the
process using the updated codebook until convergence is achieved.
10. An improvement as defined in claim 9, including a final s of extracting
the last updated vectors, one at a time, and setting to zero value all but
a selected number of samples of highest amplitude values in each vector,
thereby generating a sparse vector with the same number of samples as the
last updated vetor, but with only said selected number of samples with
nonzero values.
11. An improvement as defined in claim 8 comprising a fast search method
using said codebook to select a number N.sub.c of good excitation vectors
c.sub.j, where N.sub.c is much smaller than N, and using said vectors
N.sub.c for an exhaustive search to find the best match between said
vector z.sub.n and estimate vector z.sub.j produced from a codevector
c.sub.j included in said N.sub.c codebook vectors by precomputing N
vectors z.sub.j, comparing an input vector z.sub.n with vectors z.sub.j,
and producing a codebook of N.sub.c codevectors for use in an exhaustive
search of the best match between said input vector z.sub.n and a vector
z.sub.j from a codebook of N.sub.c vectors.
12. An improvement as defined in claim 11 wherein said N.sub.c codebook is
produced by making rough classification of the gain-normalized spectral
shape of a current speech frame into one of M.sub.s spectral shape
classes, and selecting one of M.sub.s shaped codebooks for encoding an
input vector z.sub.n by comparing said input vector with the z.sub.j
vectors stored in the selected one of the M.sub.s shaped codebooks, and
then taking the N.sub.c condevectors which produce the N.sub.c smallest
errors for use in said N.sub.c codebook. |
|
|
|
|
Claims  |
|
|
Description  |
|
|
BACKGROUND OF THE INVENTION
This invention relates to a vector excitation coder which efficiently
compresses vectors of digital voice or audio for transmission or for
storage, such as on magnetic tape or disc.
In recent developments of digital transmission of voice, it has become
common practice to sample at 8 kHz and to group the samples into blocks of
samples. Each block is commonlY referred to as a "vector" for a type of
coding processing called Vector Excitation Coding (VXC). It is a powerful
new technique for encoding analog speech or audio into a digital
representation. Decoding and reconstruction of the original analog signal
permits quality reproduction of the original signal.
Briefly, the prior art VXC is based on a new and general source-filter
modeling technique in which the excitation signal for a speech production
model is encoded at very low bit rates using vector quantization. Various
architectures for speech coders which fall into this class have recently
been shown to reproduce speech with very high perceptual quality.
In a generic VXC coder, a vocal-tract model is used in conjunction with a
set of excitation vectors (codevectors) and a perceptually-based error
criterion to synthesize natural-sounding speech. One example of such a
coder is Code Excited Linear Prediction (CELP), which uses Gaussian random
variables for the codevector components. M. R. Schroeder and B. S. Atal,
"Code-Excited Linear Prediction (CELP): High-Quality Speech at Very Low
Bit Rates," Proceedings Int'l. Conference on Acoustics, Speech, and Signal
Processing, Tampa, March, 1985 and M. Copperi and D. Sereno, "CELP Coding
for High-Quality Speech at 8 kbits/s," Proceedings Int'l. Conference on
Acoustics, Speech, and Signal Processing, Tokyo, April, 1986. CELP
achieves very high reconstructed speech quality, but at the cost of
astronomic computational complexity (around 440 million multiply/add
operations per second for real-time selection of the optimal codevector
for each speech block).
In the present invention, VXC is employed with a sparse vector excitation
to achieve the same high reconstructed speech quality as comparable
schemes, but with significantly less computation. This new coder is
denoted Pulse Vector Excitation Coding (PVXC). A variety of novel
complexity reduction methods have been developed and combined, reducing
optimal codevector selection computation to only 0.55 million
multiply/adds per second, which is well within the capabilities of present
data processors. This important characteristic makes the hardware
implementation of a real-time PVXC coder possible using only one
programmable digital signal processor chip, such as the AT&T DSP32.
Implementation of similar speech coding algorithms using either
programmable processors or high-speed, special-purpose devices is feasible
but very impractical due to the large hardware complexity required.
Although PVXC of the present invention employs some characteristics of
multipulse linear predictive coding (MPLPC) where excitation pulse
amplitudes and locations are determined from the input speech, and some
characteristics of CELP, where Gaussian excitation vectors are selected
from a fixed codebook, there are several important differences between
them. PVXC is distinguished from other excitation coders by the use of a
precomputed and stored set of pulse-like (sparse) codevectors. This form
of vocal-tract model excitation is used together with an efficient error
minimization scheme in the Sparse Vector Fast Search (SVFS) and Enhanced
SVFS complexity reduction methods. Finally, PVXC incorporates an
excitation codebook which has been optimized to minimize the
perceptually-weighted error between original and reconstructed speech
waveforms. The optimization procedure is based on a centroid derivation.
In addition, a complexity reduction scheme called Spectral Classification
(SPC) is disclosed for excitation coders using a conventional codebook
(fully-populated codevector components). There is currently a high demand
for speech coding techniques which produce high-quality reconstructed
speech at rates around 4.8 kb/s Such coders are needed to close the gap
which exists between vocoders with an "electronic-accent" operating at 2.4
kb/s and newer, more sophisticated hybrid techniques which produce near
toll-quality speech at 9.6 kb/s.
For real-time implementations, the promise of VXC has been thwarted
somewhat by the associated high computational complexity. Recent research
has shown that the dominant computation (excitation codebook search) can
be reduced to around 40 M Flops without compromising speech quality
However, this operation count is still too high to implement a practical
real-time version using only a few current-generation DSP chips. The PVXC
coder described herein produces natural-sounding speech at 4.8 kb/s and
requires a total computation of only 1.2 M Flops.
OBJECTS AND SUMMARY OF THE INVENTION
The main object of this invention is to reduce the complexity of VXC speech
coding techniques without sacrificing the perceptual quality of the
reconstructed speech signal in the ways just mentioned.
A further object is to provide techniques for real-time vector excitation
coding of speech at a rate below the midrate between 2.4 kb/s and 9.6
kb/s.
In the present invention, a fully-quantized PVXC produces natural-sounding
speech at a rate well below the midrate between 2.4 kb/s and 9.6 kb/s.
Near toll-quality reconstructed speech is achieved at these low rates
primarily by exploiting codevector sparsity, by reformulating the search
procedure in a mathematically less complex (but essentially equivalent)
manner, and by precomputing intermediate quantities which are used for
multiple input vectors in one speech frame. The coder incorporates a pulse
excitation codebook which is designed using a novel perceptually-based
clustering algorithm. Speech or audio samples are converted to digital
form, partitioned into frames of L samples, and further partitioned into
groups of k samples to form vectors with a dimension of k samples. The
input vector s.sub.n is preprocessed to generate a perceptual weighted
vector z.sub.n, which is then subtracted from each member of a set of N
weighted synthetic speech vectors {z.sub.j }, j.epsilon. {1, . . . , N},
where N is the number of excitation vectors in the codebook. The set
{z.sub.j } is generated by filtering pulse excitation (PE) codevectors
c.sub.j with two time-varying, cascaded LPC synthesis filters H.sub.l (z)
and H.sub.s (z). In synthesizing {z.sub.j }, each PE code-vector is scaled
by a variable gain G.sub.j (determined by minimizing the mean-squared
error between the weighted synthetic speech signal z.sub.j and the
weighted input speech vector z.sub.n), filtered with cascaded long-term
and short-term LPC synthesis filters, and then weighted by a perceptual
weighting filter. The reason for perceptually weighting the input vector
z.sub.n and the synthetic speech vector with the same weighting filter is
to shape the spectuum of the error signal so that it is similar to the
spectrum of s.sub.n, thereby masking distortion which would otherwise be
perceived by the human ear.
In the paragraph above, and in all the text that follows, a tilde (.about.)
over a letter signifies the incorporation of a perceptual weighting
factor, and a circumflex ( / ) signifies an estimate.
An exhaustive search over N vectors is performed for every input vector
s.sub.n to determine the excitation vector c.sub.j which minimizes the
squared Euclidean distortion .parallel.e.sub.j .parallel..sup.2 between
z.sub.n and z.sub.j. Once the optimal c.sub.j is selected, a codebook
index which identifies it is transmitted to the decoder together with its
associated gain. The parameters of H.sub.l (z) and H.sub.s (z) transmitted
as side information once per input speech frame (after every (L/k)th
s.sub.n vector).
A very useful linear systems representation of the synthesis filters and
H.sub.s (z) and H.sub.l (z) is employed. Codebook search complexity is
reduced by removing the effect of the deterministic component of speech
(produced by synthesis filter memory from the previous vector--the zero
input response) on the selection of the optimal codevector for the current
input vector s.sub.n. This is performed in the encoder only by first
finding the zero-input response of the cascaded synthesis and weighting
filters. The difference z.sub.n between a weighted input speech vector
r.sub.n and this zero-input response is the input vector to the codebook
search. The vector r.sub.n is produced by filtering s.sub.n with W(z), the
perceptual weighting filter. With the effect of the deterministic
component removed, the initial memory values in H.sub.s (z) and H.sub.l
(z) can be set to zero when synthesizing {z.sub.j } without affecting the
choice of the optimal codevector. Once the optimal codevector is
determined, filter memory from the previous encoded vector can be updated
for use in encoding the subsequent vector. Not only does this filter
representation allow further reduction in the computation necessary by
efficiently expressing the speech synthesis operation as a matrix-vector
product, but it also leads to a centroid calculation for use in optimal
codebook design routines
The novel features that are considered characteristic of this invention are
set forth with particularity in the appended claims. The invention will
best be understood from the following description when read in conjunction
with the accompanying drawings.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a block diagram of a VXC speech encoder embodying some of the
improvements of this invention.
FIG. 1a is a graph of segmented SNR (SNR.sub.seg) and overall codebook
search complexity versus number of pulses per vector, N.sub.p.
FIG. 1b is a graph of segmented SNR (SNR.sub.seg) and overall codebook
search complexity versus number of good candidate vectors, N.sub.c, in the
two-step fast-search operation of FIG. 4a and FIG. 4b.
FIG. 2 is a block diagram of a PVXC speech encoder embodying the present
invention.
FIG. 3 illustrates in a functional block diagram the codebook search
operation for the system of FIG. 2 suitable for implementation using
programmable signal processors.
FIG. 4a is a functional block diagram which illustrates Spectral
Classification, a two-step fast-search operation.
FIG. 4b is a block diagram which expands a functional block 40 in FIG. 4a.
FIG. 5 is a schematic diagram disclosing a preferred embodiment of the
architecture for the PVXC speech encoder of FIG. 2.
FIG. 6 is a flow chart for the preparation and use of an excitation
codebook in the PVXC speech encoder of FIG. 2.
DESCRIPTION OF PREFERRED EMBODIMENTS
Before describing preferred embodiments of PVXC, the present invention, a
VXC structure will first be described with reference to FIG. 1 to
introduce some inventive concepts and show that they can be incorporated
in any VXC-type system. The original speech signal s.sub.n is a vector
with a dimension of k samples. This vector is weighted by a time-varying
perceptual weighting filter 10 to produce z.sub.n, which is then
subtracted from each member of a set of N weighted synthetic speech
vectors {z.sub.j }, j.epsilon. {1, . . . , N} in an adder 11. The set
{z.sub.j } is generated by filtering excitation codevectors c.sub.j
(originating in a codebook 12) with cascaded long-term synthesizer
(synthesis filter) filter 13 a short-term synthesizer (synthesis filter)
14a and a perceptual weighting filter 14b. Each codevector c.sub.j is
scaled in an amplifier 15 by a gain factor G.sub.j (computed in a block
16) which is determined by minimizing the mean-squared error e.sub.j
between z.sub.j and the perceptually weighted speech vector z.sub.n. In an
exhaustive search VXC coder of this type, an excitation vector c.sub.j is
selected in block 15a which minimizes the squared Euclidean error
.parallel.e.sub.j .parallel..sup.2 resulting from a comparison of vectors
z.sub.n and every member of the set {z.sub.j }. An index I.sub.n having
log.sub.2 N bits which identifies the optimal c.sub.j is transmitted for
each input vector s.sub.n, along with G.sub.j and the synthesis filter
parameters {a.sub.i }, {b.sub.i }, and P associated with the current input
frame.
The transfer functions W(z), H.sub.l (z), and H.sub.s (z) of the
time-varying recursive filters 10, 13 and 14a,b are given by
##EQU1##
the a.sub.i are predictor coefficients obtained by a suitable LPC (linear
predictive coding) analysis method of order p, the b.sub.i are predictor
coefficients of a long-term LPC analysis of order q=2J+1, and the integer
lag term P can roughly be described as the sample delay corresponding to
one pitch period. The parameter .gamma. (0.ltoreq..gamma..ltoreq.1)
determines the amount of perceptual weighting applied to the error signal.
The parameters {a.sub.i } are determined by a short-term LPC analysis 17
of a block of vectors, such as a frame of four vectors, each vector
comprising 40 samples. The block of vectors is stored in an input buffer
(not shown) during this analysis, and then processed to encode the vectors
by selecting the best match between a preprocessed input vector z.sub.n
and a synthetic vector z.sub.j, and transmitting only the index of the
optimal excitation c.sub.j. After computing a set of parameters {a.sub.i }
(e.g., twelve of them), inverse filtering of the input vector s.sub.n is
performed using a short-term inverse filter 18 to produce a residual
vector d.sub.n. The inverse filter has a transfer function equal to P(z).
Pitch predictive analysis (long-term LPC analysis) 19 is then performed
using the vector d.sub.n, where d.sub.n represents a succession of
residual vectors corresponding to every vector s.sub.n of the block or
frame.
The perceptual weighting filter W(z) has been moved from its conventional
location at the output of the error subtraction operation (adder 11) to
both of its input branches. In this case, s.sub.n will be weighted once by
W(z) (prior to the start of an excitation codebook search). In the second
branch, the weighting function W(z) is incorporated into the short-term
synthesizer channel now labeled short-term weighted synthesizer 14. This
configuration is mathematically equivalent to the conventional design, but
requires less computation. A desirable effect of moving W(z) is that its
zeros exactly cancel the poles of the conventional short-term synthesizer
14a (LPC filter) 1/P(z), producing the pth order weighted synthesis
filter.
##EQU2##
This arrangement requires a factor of 3 less computations per codevector
than the conventional approach since only k(p+q) multiply/adds are
required for filtering a codevector instead of k(3p+q) when W(z) weights
the error signal directly. The structure of FIG. 1 is otherwise the same
as conventional prior art VXC coders.
Computation can be further reduced by removing the effect of the memory in
the filters 13 and 14 (having the transfer functions H.sub.l (z) and
H.sub.s (z)) on the selection of an optimal excitation for the current
vector of input speech. This is accomplished using a very low-complexity
technique to preprocess the weighted input speech vector once prior to the
subsequent codebook search, as described in the last section. The result
of this procedure is that the initial memory in these filters can be set
to zero when synthesizing {z.sub.j } without affecting the choice of the
optimal codevector. Once the optimal cod-evector is determined, filter
memory from the previous vector can be updated for encoding the subsequent
vector. This approach also allows the speech synthesis operation to be
efficiently expressed as a matrix-vector product, as will now be
described.
For this method, called Sparse Vector Fast Search (SVFS), a new formulation
of the LPC synthesis and weighting filters 13 and 14 is required. The
following shows how a suitable algebraic manipulation and an appropriate
but modest constraint on the Gaussian-like codevectors leads to an overall
reduction in codebook search complexity by a factor of approximately ten.
The complexity reduction factor can be increased by varying a parameter of
the codebook construction process. The result is that the performance
versus complexity characteristic exhibits a threshold effect that allows a
substantial complexity saving before any perceptual degradation in quality
is incurred. A side benefit of this technique is that memory storage for
the excitation vectors is reduced by a factor of seven or more.
Furthermore, codebook search computation is virtually independent of LPC
filter order, making the use of high-order synthesis filters more
attractive.
It was noted above that memory terms in the infinite impulse response
filters H.sub.l (z) and H.sub.s (z) can be set to zero prior to
synthesizing {z.sub.j }. This implies that the output of the filters 13
and 14 can be expressed as a convolution of two finite sequences of length
k, scaled by a gain:
z.sub.j (m)=G.sub.j (h(m)* c.sub.j (m)), (2)
z.sub.j (m) is a sequence of weighted synthetic speech samples, h(m) is the
impulse response of the combined short-term, long-term, and weighting
filters, and c.sub.j (m) is a sequence of samples for the jth excitation
vector.
A matrix representation of the convolution in equation (2) may be given as:
z.sub.j =G.sub.j Hc.sub.j, (3)
where H is a k by lower triangular matrix whose elements are from h(m):
##EQU3##
Now the weighted distortion from the jth codevector can be expressed simply
as
.parallel.e.sub.j .parallel..sup.2 =.parallel.z.sub.n -z.sub.j
.parallel..sup.2 =.parallel.z.sub.n -Hc.sub.j .parallel..sup.2 (5)
In general, the matrix computation to calculate z.sub.j requires k(k+1)/2
operations of multiplication and addition versus k(p+q) for the
conventional linear recursive filter realization For the chosen set of
filter parameters (k=40, p+q=19), it would be slightly more expensive for
an arbitrary excitation vector c.sub.j to compute .parallel.e.sub.j
.parallel. using the matrix formulation since (k+1)/2>p+q. However, if
each c.sub.j is suitably chosen to have only N.sub.p pulses per vector
(the other components are zero), then equation (5) can be computed very
efficiently. Typically, N.sub.p /k is 0.1. More specifically, if the
matrix-vector product Hc.sub.j is calculated using:
For m=0 to k-1
If c.sub.j (m)=0, then
Next m
otherwise
For i=m to k-1
z.sub.j (i)=z.sub.j (i)+c.sub.j (m) h(k).
Then the average computation for Hc.sub.j is N.sub.p (k+1)/2 multiply/adds,
which is less than k(p+q) if N.sub.p <37 (for the k, p, and q given
previously).
A very straightforward pulse codebook construction procedure exists which
uses an initial set of vectors whose components are all nonzero to
construct a set of sparse excitation codevectors. This procedure, called
center-clipping, is described in a later section. The complexity reduction
factor of this SVFS is adjusted by varying N.sub.p, a parameter of the
codebook design process.
zeroing of selected codevector components is consistent with results
obtained in Multi-Pulse LPC (MPLPC) [B. S. Atal and J. R. Remde "A New
Model of LPC Excitation for Producing Natural-Sounding Speech at Low Bit
Rates" Proc. Int'l. Conf. on Acoustics, Speech, and Signal Processing,
Paris, May 1982], since it has been shown that only about 8 pulses are
required per pitch period (one pitch prriod is typically 5 ms for a female
speaker) to synthesize natural-sounding speech. See S. Singhal and B. S.
Atal, "Improving Performance of Multi-Pulse LPC Coders at Low Bit Rates,"
Proc. Int'l. Conf. on Acoustics, Speech and Signal Processing, San Diego,
March 1984. Even more encouraging, simulation results of the present
invention indicate that reconstructed speech quality does not start to
deteriorate until the number of pulses per vector drops to 2 or 3 out of
40. Since, with the matrix formulation, computation decreases as the
number of zero components increases, significant savings can be realized
by using only 4 pulses per vector. In fact, when N.sub.p =4 and k=40,
filtering complexity reduction by a factor of ten is achieved.
FIG. 1a shows plots of segmental SNR (SNR.sub.seg) and overall codebook
search complexity versus number of pulse per vector, N.sub.p. It is noted
that as N.sub.p decreases, SNR.sub.seg does not start to drop until
N.sub.p reaches 3. In fact, informal listening tests show that the
perceptual quality of the reconstructed speech signal actually improves
slightly as N.sub.p is reduced from 40 to 4 and at the same time, the
filtering computation complexity drops significantly.
It should also be noted that the required amount of codebook memory can be
greatly reduced by storing only N.sub.p pulse amplitudes and their
associated positions instead of k amplitudes (most of which are zero in
this scheme). For example, memory storage reduction by a factor of 7.3 is
achieved when k=40, N.sub.p =4, and each codevector component is
represented by a 16-bit word.
The second simplification (improvement), Spectral Classification, also
reduces overall codebook search effort by a factor of approximately ten.
It is based on the premise that it is possible to perform a precomputation
of simple to moderate complexity using the input speech to eliminate a
large percentage of excitation codevectors from consideration before an
exhaustive search is performed.
It has been shown by other researchers that for a given speech frame, the
number of excitation vectors from a codebook of size 1024 which produce
acceptably low distortion is small (approximately 5). The goal in this
fast-search scheme, is to use a quick but approximate procedure to find a
number N.sub.c of "good" candidate excitation vectors (N.sub.c <N) for
subsequent use in a reduced exhaustive search of N.sub.c codevectors. This
two-step operation is presented in FIG. 4a.
In Step 1, the input vector z.sub.n is compared with z.sub.j to screen
codevectors in block 40 and produce a set of N.sub.c candidate vectors to
use in a reduced codevector search. Refer to FIG. 4b for an expanded view
of block 40. The N.sub.c surviving codevectors are selected by making a
rough classification of the gain-normalized spectral shape of the current
speech frame into one of M.sub.s classes. One of M.sub.s corresponding
codebooks (selected by the classification operation) is then used in a
simplified speech synthesis procedure to generate z.sub.j. The excitation
vectors N.sub.c producing the lowest distortions are selected in block 40
for use in Step 2, the reduced exhaustive search using the scalar 30,
long-term synthesizer 26, and short-term weighted synthesizer 25 (filters
25a and 25b in cascade as before). The only thing different is a reduced
codevector set, such as 30 codevectors reduced from 1024. This is where
computational savings are achieved.
Spectral classification of the current speech frame in block 40 is
performed by quantizing its short-term predictor coefficients using a
vector quantizer 42 shown in FIG. 4b with M.sub.s spectral shape
codevectors (typically M.sub.s= 4 to 8). This classification technique is
very low in complexity (it comprises less than 0.2% of the total codebook
search effort). The vector quantizer output (an index) selects one of
M.sub.s corresponding codebooks to use in the speech synthesis procedure
(one codebook for each spectral class). To construct each shaped cookbook,
Gaussian-like codevectors from a pulse excitation codebook 20 are input to
an LPC synthesis filter 25a representing the codebook's spectral class.
The "shaped" codevectors are precomputed off-line and stored in the
codebooks 1, 2 . . . M.sub.s. By calculating the short-term filtered
excitation off-line, this computational expense is saved in the encoder.
Now the candidate excitation vectors from the original Gaussian-like
codebook can be selected simply by filtering the shaped vectors from the
selected class codebook with H.sub.l (z), and retaining only those N.sub.c
vectors which produce the lowest weighted distortion. In Step 2 of
Spectral Classification, a final exhaustive search over these N.sub.c
vectors (to determine the optimal one) is conducted using quantized values
of the predictor coefficients determined by LPC analysis of the current
speech frame.
Computer simulation results show that with M.sub.s =4, N.sub.c can be as
low as 30 with no loss in perceptual quality of the reconstructed speech,
and when N.sub.c =10, only a very slight degradation is noticeable. FIG.
1b summarizes the results of these simulations by showing how SNR.sub.seg
and overall codebook search complexity change with N.sub.c. Note that the
drop in SNR.sub.seg as N.sub.c is reduced does not occur until after the
knee of the complexity versus N.sub.c curve is passed.
The sparse-vector and spectral classification fast codebook search
techniques for VXC have each been shown to reduce complexity by an order
of magnitude without incurring a loss in subjective quality of the
reconstructed speech signal. In the sparse-vector method, a matrix
formulation of the LPC synthesis filters is presented which possesses
distinct advantages over conventional all-pole recursive filter
structures. In spectral classification, approximately 97% of the
excitation codevectors are eliminated from the codebook search by using a
crude identification of the spectral shape of the current frame. These two
methods can be combined together or with other compatible fast-search
schemes to achieve even greater reduction.
These techniques for reducing the complexity of Vector Excitation Coding
(VXC) discussed above nn general will now be described with reference to a
particular embodiment called PVXC utilizing a pulse excitation (PE)
codebook in which codevectors have been designed as just described with
zeroing of selected codevector components to leave, for example, only four
pulses, i.e., nonzero samples, for a vector of 40 samples. It is this
pulse characteristic of PE codevectors that suggest the name "pulse vector
excitation coder" referred to as PVXC.
PVXC is a hybrid speech coder which combines an analysis-by-synthesis
approach with conventional waveform compression techniques. The basic
structure of PVXC is presented in FIG. 2. The encoder consists of an
LPC-based speech production model and an error weighting function W(z).
The production model contains two time-varying, cascaded LPC synthesis
filters H.sub.s (z) and H.sub.l (z) describing the vocal tract, a codebook
20 of N pulse-like excitation vectors c.sub.j, and a gain term G.sub.j. As
before, H.sub.s (z) describes the spectral envelope of the original speech
signal s.sub.n, and H.sub.l (z) is a long-term synthesizer which
reproduces the spectral fine structure (pitch). The transfer functions of
H.sub.s (z) and H.sub.l (z) are given by H.sub.s (z)=1/P.sub.s (z) and
H.sub.l (z)=1/P.sub.l (z) where
##EQU4##
Here, a.sub.i and b.sub.i are the quantized short and long-term predictor
coefficients, respectively, P is the "pitch" term derived from the
short-term LPC residual signal (20.ltoreq.P.ltoreq.147), and p and q
(=2J+1) are the short and long-term predictor orders, respectively. Tenth
order short-term LPC analysis is performed on frames of length L=160
samples (20 ms for an 8 kHz sampling rate). P.sub.l (z) contains a 3-tap
predictor (J=1) which is updated once per frame. The weighting filter has
a transfer function W(z)=P.sub.s (z)/P.sub.s (z/.gamma.), where P.sub.s
(z) contains the unquantized predictor parameters and
0.ltoreq..gamma..ltoreq.1. The purpose of the perceptual weighting filter
W(z) is the same as before.
Referring to FIG. 2, the basic structure of a PVXC system (encoder and
decoder) is shown with the encoder (transmitter) in the upper part
connected to a decoder (receiver) by a channel 21 over which a pulse
excitation (PE) codevector index and gain is transmitted for each input
vector s.sub.n after encoding in accordance with this invention. Side
information, consisting of the parameters Q{a.sub.i }, Q{b.sub.i },
QG.sub.j and P, are transmitted to the decoder once per frame (every L
input samples). The original speech input samples s, converted to digital
form in an analog-to-digital converter 22, are partitioned into a frame of
L/k vectors, with each vector having a group of k successive samples. More
than one frame is stored in a buffer 23, which thus stores more than 160
samples at a time, such as 320 samples.
For each frame, an analysis section 24 performs short-term LPC analysis and
long-term LPC analysis to determine the parameters {a.sub.i }, {b.sub.i }
and P from the original speech contained in the frame. These parameters
are used in a short-term synthesizer 25a comprised of a digital filter
specified by the parameters {a.sub.i }, and a perceptual weighting filter
25b, and in a long-term synthesizer 26 comprised of a digital filter
specified by four parameters {b.sub.i } and P. These parameters are coded
using quantizing tables and only their indices Q{a.sub.i } and Q{b.sub.i }
are sent as side information to the decoder which uses them to specify the
filters of long-term and short-term synthesizers 27 and 28, respectively,
in reconstructing the speech. The channel 21 includes at its encoder
output a multiplexer to first transmit the side information, and then the
codevector indices and gains, i. e., the encoded vectors of a frame,
together with a quantized gain factor QG.sub.j computed for each vector.
The channel then includes at its output a demultiplexer to send the side
information to the long-term and short-term synthesizers in the decoder.
The quantized gain factor QG.sub.j of each vector is sent to a scaler 29
(corresponding to a scaler 30 in the encoder) with the decoded codevector.
After the LPC analysis has been competed for a frame, the encoder is ready
to select an appropriate pulse excitation from the codebook 20 for each of
the original speech vectors in the buffer 23. The first step is to
retrieve one input vector from the buffer 23 and filter it with the
perceptual weighting filter 33. The next step is to find | | |