|
Claims  |
|
|
I claim:
1. Speech processing apparatus comprising
speech analyzer means responsive to input speech signals from a source of
input speech signals for generating a plurality of parameter signals
representing said input speech signals in accordance with a speech model,
at least one of said parameter signals being quantized as one of p levels,
channel encoder means comprising
encoder memory means for storing an encoding table defining a mapping from
each of said p levels to a unique one of p, m-bit label signals, where
p<k=2.sup.m, and
means responsive to said speech analyzer means for transmitting, over a
channel to a destination, the one of said p label signals that is
associated with said one of said p levels in said encoding table,
channel decoder means comprising
decoder memory means for storing a decoding table defining the inverse of
said encoding table mapping and
means, responsive to a label signal received at said destination from said
channel, for decoding said received label signal as the one of said p
levels associated with said received label signal in said decoding table
inverse mapping when said received label signal is one of said p label
signals, and for decoding said received label signal in accordance with an
error routine when said received label signal is one of the k-p, m-bit
label signals other than said p label signals, and
speech synthesizer means responsive to said decoding means for synthesizing
speech based at least in part on said decoded label signal,
wherein said encoding table mapping stored by said encoder memory means and
said decoding table inverse mapping stored by said decoder memory means
are obtained to minimize the effect of channel errors and are obtained
using simulated annealing based on a probability distribution of said p
levels for said one parameter signal.
2. Speech processing apparatus in accordance with claim 1 wherein said
encoder memory means further stores other mappings and said decoder memory
means further stores other inverse mappings, said other mappings and said
other inverse mappings being for use in communication of other parameter
signals from said source over said channel to said destination, said other
mappings and said other inverse mappings being obtained, concurrently with
said encoding table mapping and said decoding table inverse mapping, using
said simulated annealing.
3. Speech processing apparatus in accordance with claim 2 wherein said
simulated annealing minimizes an overall error measure for said one
parameter signal and said other parameter signals.
4. Speech processing apparatus in accordance with claim 1 wherein
said decoding table stored by said decoder memory means defines an
additional mapping from each of said k-p label signals,
said decoding means decodes said received label signal in accordance with
said additional mapping when said received label signal is one of said k-p
label signals, and
said decoding table additional mapping is also obtained using said
simulated annealing to minimize the effect of channel errors based on said
probability distribution.
5. Speech processing apparatus in accordance with claim 4 wherein said
inverse and additional mappings are obtained concurrently using said
simulated annealing.
6. Speech processing apparatus in accordance with claim 1 wherein
said decoding means decodes said received label signal as a default level
when said received label signal is one of said k-p label signals.
7. Speech processing apparatus in accordance with claim 1 wherein
said decoding means decodes said received label signal based on information
received over said channel other than said received label signal when said
received label signal is one of said k-p label signals.
8. Speech processing apparatus in accordance with claim 1 wherein
said decoding means decodes said received label signal as the same level
that was obtained from a previous communication of said one parameter
signal over said channel when said received label signal is one of said
k-p label signals.
9. Speech processing apparatus in accordance with claim 1
said decoding table stored by said decoder memory means defines an
additional mapping from each of certain ones of said k-p label signals,
said decoding means decodes said received label signal in accordance with
said additional mapping when said received label signal is one of said
certain ones of said k-p label signals, and
said decoding means decodes said received label signal as a default level
when said received label signal is one of said k-p label signals other
said certain ones.
10. Speech processing apparatus in accordance with claim 1
said decoding table stored by said decoder memory means defines an
additional mapping from each of certain ones of said k-p label signals,
said decoding means decodes said received label signal in accordance with
said additional mapping when said received label signal is one of said
certain ones of said k-p label signals, and
said decoding means decodes said received label signal based on information
received over said channel other than said received label signal when said
received label signal is one of said k-p label signals other said certain
ones.
11. Speech processing apparatus in accordance with claim 1
said decoding table stored by said decoder memory means defines an
additional mapping from each of certain ones of said k-p label signals,
said decoding means decodes said received label signal in accordance with
said additional mapping when said received label signal is one of said
certain ones of said k-p label signals, and
said decoding means decodes said received label signal as the same level
that was obtained from a previous communication of said one parameter
signal over said channel when said received label signal is one of said
k-p label signals other said certain ones.
12. Speech processing apparatus in accordance with claim 1 wherein said p
label signals are selected from k, m-bit label signals as a result of said
simulated annealing.
13. Speech processing apparatus in accordance with claim 1 wherein said
model is a code excited linear prediction model.
14. Speech processing apparatus in accordance with claim 1 wherein said
encoding table mapping and said decoding table mapping are obtained to
minimize distortion in said synthesized speech.
15. Speech processing apparatus in accordance with claim 1 wherein
p>2.sup.m-1.
16. Speech processing apparatus comprising
speech analyzer means responsive to input speech signals from a source of
input speech signals for generating a plurality of parameter signals
representing said input speech signals in accordance with a speech model,
at least one of said parameter signals being quantized as one of p levels,
channel encoder means comprising
encoder memory means for storing an encoding table defining a mapping from
each of said p levels to a unique one of p, m-bit label signals, where
p<k=2.sup.m, and
means responsive to said speech analyzer means for transmitting, over a
channel to a destination, the one of said p label signals that is
associated with said one of said p levels in said encoding table,
channel decoder means comprising
decoder memory means for storing a decoding table defining the inverse of
said encoding table mapping and defining an additional mapping from each
of certain ones of the k-p, m-bit label signals other than said p label
signals and
means, responsive to a label signal received at said destination from said
channel, for decoding said received label signal as the one of said p
levels associated with said received label signal in said decoding table
inverse mapping when said received label signal is one of said p label
signals, and for decoding said received label signal as defined by said
additional mapping when said received label signal is one of said certain
ones of said k-p label signals, and
speech synthesizer means responsive to said decoding means for synthesizing
speech based at least in part on said decoded label signal,
wherein said encoding table mapping stored by said encoder memory means and
said decoding table inverse mapping stored by said decoder memory means
are obtained to minimize the effect of channel errors and are obtained
based on a probability distribution of said p levels for said one
parameter signal,
wherein said inverse and additional mappings are such that at least one of
said p label signals differs in b bits, 1<=b<m, from a label signal which
maps into the same level as said at least one of said p label signals and
which also differs in b bits from a label signal which maps into a level
other than said same level.
17. Speech processing apparatus in accordance with claim 16 wherein
said decoding means decodes said received label signal as a default level
when said received label signal is one of said k-p label signals other
than said certain ones.
18. Speech processing apparatus in accordance with claim 16 wherein
said decoding means decodes said received label signal based on information
received over said channel other than said received label signal when said
received label signal is one of said k-p label signals other than said
certain ones.
19. Speech processing apparatus in accordance with claim 16 wherein
said decoding means decodes said received label signal as the same level
that was obtained from a previous communication of said one parameter
signal over said channel when said received label signal is one of said
k-p label signals other than said certain ones.
20. Speech processing apparatus in accordance with claim 16 where
p>2.sup.m-1.
21. Speech processing apparatus in accordance with claim 16 wherein said
model is a code excited linear prediction model.
22. Speech processing apparatus in accordance with claim 16 wherein said
encoding table mapping and said decoding table mapping are obtained to
minimize distortion in said synthesized speech.
23. Speech processing apparatus comprising
speech analyzer means responsive to input speech signals from a source of
input speech signals for generating a plurality of parameter signals
representing said input speech signals in accordance with a speech model,
at least one of said parameter signals being quantized as one of p levels,
channel encoder means comprising
encoder memory means for storing an encoding table defining a mapping from
each of said p levels to a unique one of p, m-bit label signals, where
p<=k=2.sup.m, and
means responsive to said speech analyzer means for transmitting, over a
channel to a destination, the one of said p label signals that is
associated with said one of said p levels in said encoding table,
channel decoder means comprising
decoder memory means for storing a decoding table defining the inverse of
said encoding table mapping and
means, responsive to a label signal received at said destination from said
channel, for decoding said received label signal as the one of said p
levels associated with said received label signal in said decoding table
inverse mapping when said received label signal is one of said p label
signals, and
speech synthesizer means responsive to said decoding means for synthesizing
speech based at least in part on said decoded label signal,
wherein said encoding table mapping stored by said encoder memory means and
said decoding table inverse mapping stored by said decoder memory means
are obtained to minimize the effect of channel errors and are obtained
using simulated annealing based on a probability distribution of said p
levels for said one parameter signal.
24. Speech processing apparatus in accordance with claim 23 wherein said
model is a code excited linear prediction model.
25. Speech processing apparatus in accordance with claim 23 wherein said
encoding table mapping and said decoding table mapping are obtained to
minimize distortion in said synthesized speech. |
|
|
|
|
Claims  |
|
|
Description  |
|
|
TECHNICAL FIELD
This invention relates to information processing and communication.
BACKGROUND AND PROBLEM
The code excited linear predictive (CELP) speech compression procedure has
been shown to provide excellent speech quality at low bit rates. Since its
original introduction in 1984, much effort has been spent to make the
procedure feasible for commercial applications. Thus, while the original
procedure was computationally extremely expensive, many different
techniques are now available to reduce the computational effort. Its
current level of maturity makes the CELP procedure desirable for many
applications where bandwidth is at a premium, such as voice mail/storage,
secure telephony and mobile telephony.
In some applications the CELP procedure will encounter channel errors.
Efforts to minimize the effect of channel errors on speech compression
procedures can be divided into methods which change the robustness of the
source coder, by taking advantage of redundancies in the transmitted
information, and methods which add error correction and/or error detection
by means of a separate channel coder. Conventional implementations of the
latter approach add a channel coder which maps selected bits of the
quantization indices of a compression procedure into generic
error-correction/detection codes which do not depend on the source. That
this procedure is not optimal is suggested by the fact that the bits to be
protected by the error correcting codes are hand picked, based on a
judgement of their sensitivity. The separation between source and channel
coders is justified if an arbitrarily complex coder-decoder design is
optimized for a channel of a particular capacity (usually a worst case
channel). Then the source coder rate can be matched to the capacity of
this channel, resulting in suboptimal performance for channels of higher
or lower capacity (or equal capacity, but with different characteristics).
Speech coders usually encounter a variety of error conditions, and in many
cases low error rates are prevalent. It is desirable to have a speech
coder which exploits maximally the prevalent channels and decreases
minimally in performance with diminishing channel capacity. To obtain this
behavior, the source distortion must be considered in the design of the
channel coder.
As an illustration that the source distortion should be considered in
optimizing a channel code which is used in channels of various error
rates, consider the example of Table 1. A four level scalar quantizer, of
which each level has identical a-priori probability (no redundancy in the
transmitted bit stream), is encoded with three different encoding schemes.
Assume that virtually all channels are without errors, except a few in
which a significant random error rate occurs. Table 1 shows the well known
L1 and L2 error criteria for single bit errors (two bit errors per code
word are exceedingly unlikely at low error rates) per codeword per error
for the three encoding schemes. All codes are optimized for channels with
zero error rate and have zero redundancy, but code 1 will result in the
lowest L2 distortion, and code 1 and code 2 result in the lowest L1
distortion for noisy channels.
TABLE 1
______________________________________
Four-Level Quantizer Example
code 1 code 2 code 3
______________________________________
quantizer level
0.0 00 01 10
1.0 01 00 01
4.0 10 10 00
9.0 11 11 11
error criterion
L1 4.5 4.5 6.0
L2 26.5 29.0 42.5
______________________________________
This example makes clear that a coder optimized for a certain channel (a
channel with no bit errors in this case) can be further optimized to
enhance performance for channels of lower quality by considering the
source quality.
A technique known as pseudo-Gray coding, described in J-H. Chen, G.
Davidson, A. Bersho, and K. Zeger, "Speech Coding for the Mobile Satellite
Experiment", Proc. IEEE Int. Conf. on Communications, 756-763, (June
1987), is used to optimize the arrangement of a codebook to protect
against the effects of channel errors. The Chen procedure takes as input a
codebook and yields a rearrangement of the codevectors that minimizes the
expected time average bit-error distortion. The utility of the Chen
procedure is somewhat limited however because it does not include the
effects of redundancy in the optimization. This is a serious limitation
since in most applications where channel errors are at all significant,
some redundancy is desirable despite the typically low bit rates, e.g.,
4.8 kilobits per second. Furthermore, the Chen procedure uses a gradient
optimization technique which involves iteratively switching the positions
of codevectors to reduce the expected value of the bit-error distortion
until a locally optimal state is reached. However, since the function
being optimized typically has more than one local minimum, the Chen
procedure will frequently result in sub-optimum performance.
In view of the foregoing, a recognized need exists in the art for an
optimized, source dependent channel coder where the error protective
effects of redundancy are included in the optimization and where the
resulting code is more than locally optimal.
SOLUTION
This need is met and a technical advance is achieved in accordance with the
principles of the invention in a parameter communication arrangement where
a parameter that is transmitted over a channel using m-bit codewords or
labels is quantized before transmission as one of only p levels, where,
significantly, p<k=2.sup.m. Since only p labels are needed to transmit the
p levels, the unused k-p labels are advantageously available to provide
redundancy. The receiver decodes the redundant labels in accordance with
an error routine. An encoding table mapping from the p levels to p labels
and a decoding table inverse mapping from the p labels to p levels are
obtained using an optimization procedure to minimize the effect of channel
errors. The optimization is based on the probability distribution for the
p levels such that a relatively high proportion of the error protection
made available by having redundant labels inures to the benefit of
parameter levels which are more likely to be transmitted. The optimization
procedure is a well known technique referred to as simulated annealing
which is for the first time applied to source dependent channel coding and
which provides a degree of randomness in the perturbation of labels which
is gradually reduced to obtain a code which is globally optimum rather
than only locally optimum. Since low bit rates are desirable in many
applications, a degree of redundancy is afforded by having the number of
quantized levels, p, between 2.sup.m-1 and 2.sup.m in illustrative
embodiments herein. The expense of such an arrangement in terms of
transmitted bits is less than that of simple parity error detection.
A method in accordance with the invention is used to communicate a
parameter from a source over a channel to a destination. The parameter is
quantized at the source as one of p levels. The term quantization level as
used herein refers to either a scalar quantization value, described by a
single number, or a quantized vector value, described by an ordered set of
numbers. The label that is transmitted over the channel is the one of p,
m-bit labels that is associated with the quantized level in an encoding
table defining a mapping from each of the p levels to a unique one of the
p labels, where p<k=2.sup.m. When the m-bit label received at the
destination is one of the p labels, it is decoded as the level associated
with that label in a decoding table defining the inverse of the encoding
table mapping. When the received label is one of the k-p labels other than
the p labels, it is decoded in accordance with an error routine. The
mapping of the encoding table and the inverse mapping of the decoding
table are obtained to minimize the effect of channel errors and are
obtained using simulated annealing based on a probability distribution of
the p levels for the parameter.
In one illustrative embodiment, the error routine comprises error
correction and the received label is decoded as defined by an additional
mapping of the decoding table from each of the k-p redundant labels. The
encoding table mapping and the decoding table inverse and additional
mappings are obtained concurrently as the result of a single, simulated
annealing optimization.
In other illustrative embodiments, the error routine involves error
detection and the substitution of another level, for example, a default
level or a level based on information received over the channel other than
the received label, e.g., the same level obtained from a previous
communication of the parameter.
In a further illustrative embodiment, the error routine is a combination of
the above error correction and error detection and substitution methods.
Certain of the redundant labels are decoded using an additional mapping of
the encoding table and the other redundant labels are decoded as
substitute levels. The selections of which redundant labels result in
error correction and which ones result in error detection and substitution
are obtained as a result of the single, simulated annealing optimization.
In the exemplary embodiments herein, the parameter is obtained at the
source by analyzing input speech in accordance with a code excited linear
prediction (CELP) model; the result obtained by decoding the received
label is used at the destination to generate synthetic speech also in
accordance with the CELP model. Example parameters are the gain factors
and indices for the adaptive and stochastic codebooks used in an
illustrative CELP speech processing arrangement. The encoding table and
decoding table mappings are obtained to minimize distortion in the
synthetic speech generated at the destination.
Another alternative embodiment uses a single, simulated annealing procedure
to obtain optimized encoding and decoding tables for each of a number of
parameters, where the error measure used in the optimization is an overall
error measure.
In accordance with another aspect of the invention, a parameter is
quantized at the source as one of p levels. The label that is transmitted
over the channel is the one of p, m-bit labels that is associated with the
quantized level in an encoding table defining a mapping from each of the p
levels to a unique one of the p labels, where p<k=2.sup.m. When the m-bit
label received at the destination is one of the p labels, it is decoded as
the level associated with that label in a decoding table defining the
inverse of the encoding table mapping. When the received label is one of
the k-p labels other than the p labels, it is decoded in accordance with
an error routine. When the received label is one of at least certain ones
of the k-p other labels, it is decoded as defined by an additional mapping
of the decoding table. The mapping of the encoding table and the inverse
mapping of the decoding table are obtained to minimize the effect of
channel errors and are obtained based on a probability distribution of the
p levels for the parameter. The inverse and additional mappings are such
that at least one of the p labels differs in b bits, 1<=b<m, from a label
which maps into the same level as the one of the p labels and which also
differs in b bits from a label which maps into a level other than that
same level.
In accordance with still another aspect of the invention, a parameter is
quantized at the source as one of p levels. The label that is transmitted
over the channel is the one of p, m-bit labels that is associated with the
quantized level in an encoding table defining a mapping from each of the p
levels to a unique one of the p labels, where p<=k=2.sup.m. When the m-bit
label received at the destination is one of the p labels, it is decoded as
the level associated with that label in a decoding table defining the
inverse of the encoding table mapping. The mapping of the encoding table
and the inverse mapping of the decoding table are obtained to minimize the
effect of channel errors using simulated annealing based on a probability
distribution of the p levels for the parameter.
DRAWING DESCRIPTION
FIG. 1 is a block diagram of an exemplary speech coding arrangement using
the channel coding method of the present invention;
FIG. 2 illustrates the quantization of an arbitrary parameter X of the type
generated by the speech analyzer of FIG. 1;
FIG. 3 is a probability distribution for the parameter X;
FIG. 4 is an encoding table mapping for parameter X as obtained from a
simulated annealing optimization procedure and used in the channel encoder
of FIG. 1;
FIG. 5 is a decoding table inverse mapping for parameter X as obtained from
the simulated annealing procedure and used in the channel decoder of FIG.
1;
FIG. 6 is a decoding table additional mapping for parameter X as obtained
from the simulated annealing procedure and used in the channel decoder of
FIG. 1 for the case where error correction is performed on redundant
labels;
FIGS. 7 and 8 are diagrams depicting the inputs, outputs, and associated
error routines for simulated annealing procedures for a single parameter
and multiple parameters respectively, which procedures are described in
detail with reference to Tables 2-4 herein, and
FIGS. 9 through 15 are data curves used in describing the performance of
channel codes illustrating the present invention.
DETAILED DESCRIPTION
1. Introduction
An illustrative speech processing arrangement in accordance with the
invention is shown in block diagram form in FIG. 1. Incoming analog speech
signals are converted to digitized speech samples by an A/D converter 50.
The digitized speech samples from converter 50 are processed by speech
analyzer 100, which in the present example uses the CELP speech model for
analysis. The results obtained by analyzer 100 are a number of parameters
which are transmitted to a channel encoder 200 for encoding and
transmission over a channel 300. Advantageously, channel 300 may be a
communication transmission path or may be storage media so that voice
synthesis may be provided for various applications at a later point in
time. A channel decoder 400 receives the quantized parameters from channel
300, decodes them, and transmits the decoded parameters to a speech
synthesizer 500. Synthesizer 500 processes the parameters using the CELP
speech model to generate digital, synthetic speech samples which are in
turn processed by a D/A converter 550 to reproduce the incoming analog
speech signals. The present invention focuses on the channel encoding and
decoding functions. An encoding table 210 within encoder 200 and a
decoding table 410 within decoder 400 are obtained as the result of an
optimization procedure referred to as simulated annealing to minimize the
effect of channel errors in a manner described in detail herein.
In the present example, speech analyzer 100 and speech synthesizer 500
implement a particular CELP procedure referred to as stochastically
excited linear prediction (SELP) as described in W. B. Kleijn, D. J.
Krasinski, and R. H. Ketchum, "An Efficient Stochastically Excited Linear
Predictive Coding Algorithm for High Quality Low Bit Rate Transmission of
Speech", Speech Communication, Vol. VII, 305-316, 1988. The SELP procedure
for speech coding offers good performance at bit rates as low as 4.8
kbit/s. Linear predictive coding (LPC) techniques remove the short-term
correlation from the speech. A pitch loop removes long-term correlation,
producing a noise-like residual, which is vector quantized. Parameters
describing the LPC filter coefficients, the long-term predictor, and the
vector quantization are obtained by analyzer 100. Several improvements to
the SELP procedure are implemented which result in better speech quality
and higher computational efficiency. In its closed-loop form, the pitch
loop can be interpreted as a vector quantization of the desired excitation
signal with an adaptive codebook populated by previous excitation
sequences. To better model the non-stationarity of speech, the adaptive
codebook is extended with a special set of candidate vectors which are
transforms of other codebook entries. The second stage vector quantization
is performed using a fixed stochastic codebook. In its original form, the
SELP procedure requires a large computational effort. A recursive
procedure is employed which performs a very fast search through the
adaptive codebook. In this method, the error criterion is modified and th
resulting symmetries are exploited. The same fast vector quantization
procedure is applied to the stochastic codebook.
As mentioned previously, this invention relates to optimized channel
encoding and decoding of parameters such as the codebook indices and gain
factors obtained by speech analyzer 100. FIG. 2 illustrates the
quantization of an arbitrary parameter, X, as one of six levels 0, 1, 2,
3, 4, and 5. At time t.sub.1, for example, X is quantized as level 5, at
time t.sub.2 as level 2, and at time t.sub.3 as level 4. Since X is to be
transmitted using a three-bit label and since only six of the possible
eight labels are needed to transmit the six levels, two labels are
available to provide redundancy. The probability distribution for
parameter X is given in FIG. 3, where the levels 0, 1, 2, 3, 4, and 5 have
finite probabilities, P(0), P(1), P(2), P(3), P(4), and P(5) and levels 6
and 7 each have zero probability. In a first exemplary embodiment, the
redundant labels are used to provide error correction. As functionally
depicted in FIG. 7, the probability distribution of parameter X is
provided as input to a simulated annealing procedure described in detail
herein. The simulated annealing procedure produces as its output the
mappings given for example by FIGS. 4, 5, and 6. FIG. 4 illustrates a
particular mapping for parameter X in encoding table 210 where the levels
0, 1, 2, 3, 4, and 5 are mapped into the three-bit lables 010, 110, 111,
001, 000, and 100 respectively. FIG. 5 illustrates the inverse mapping for
parameter X in decoding table 410 where the labels 010, 110, 111, 001,
000, and 100 are mapped back into the levels 0, 2, 3, 4, and 5
respectively. Since the error routine used in this first exemplary
embodiment is error correction, an additional mapping as given by FIG. 6
is included in decoding table 410 for parameter X. When the labels 011 and
101 are received, channel decoder 400 knows that a channel error was made
since the labels 011 and 101 are redundant and are not transmitted by
encoder 200. The result of the simulated annealing procedure in this
embodiment is that the label 011 is mapped into level 0 and the label 101
is mapped in level 5.
In a second exemplary embodiment, the error routine comprises error
detection and the substitution of the level obtained during the previous
communication of parameter X. In a third exemplary embodiment, the error
routine comprises error detection and the substitution of a default level,
e.g., level 0. In both the second and third embodiments, no additional
mapping for parameter X is required in decoding table 410 like that of
FIG. 6 when the error routine was error correction. However, the error
routine operation when a redundant label is received is used in
determining the error measure which is minimized by the simulated
annealing optimization.
In a fourth exemplary embodiment, the error routine is a combination of the
above error correction and error detection and substitution methods.
Certain of the redundant labels, for example label 011 in the simple,
three-bit label case described, are decoded using an additional mapping of
the encoding table and the other redundant labels, label 101 in the
example, are decoded as substitute levels. The selections of which
redundant labels result in error correction and which ones result in error
detection and substitution are obtained as a result of the single,
simulated annealing optimization.
In a fifth exemplary embodiment, a single, simulated annealing procedure is
used to obtain optimized encoding and decoding tables for each of a number
of parameters as functionally depicted in FIG. 8, where the error measure
used in the optimization is an overall error measure.
The next section describes in detail the method of measuring the immediate
effect of decoding errors in the excitation function of CELP caused by
channel errors. For reference purposes this section includes a brief
description of the CELP procedure used. In section 3 a description of
simulated annealing for the optimization of a source-dependent channel
encoding is provided. The presented simulated annealing procedures are
applicable to the coding of parameters of many (speech) compression
procedures, but the focus here is their application to the CELP speech
coding procedure. Section 4 studies the error sensitivity of the codebook
gains. It applies the simulated annealing procedures to the channel coding
of these parameters. In section 5 the focus shifts to the channel encoding
of the codebook indices, to which the simulated annealing procedure is
applied. Included with the discussion of the codebook indices is an
example of the effect that the probability distribution has on the
performance of the annealing procedures. This is followed by a conclusion
section. Finally, several appendices with tables containing optimal codes
for some of the CELP parameters are provided.
2. An Error Criterion for the Effect of Channel Errors on CELP
2.1. Description of the CELP Procedure
The CELP procedure used here is identical to that described in W. B.
Kleijn, D. J. Krasinski, and R. H. Ketchum, "An Efficient Stochastically
Excited Linear Predictive Coding Algorithm for High Quality Low Bit Rate
Transmission of Speech", Speech Communication, Vol. VII, 305-316, 1988. It
efficiently encodes a digitized (usually sampled at a rate of 8000 Hz)
speech signal on a frame by frame basis. Synthetic speech is generated by
filtering an excitation signal. The filter adds the short term correlation
to the signal, roughly modeling the effect of the vocal tract and the
mouth. It is determined from a linear predictive (LP) analysis of the
original speech signal. For transmission, the filter coefficients, are
quantized with 35 bits using absolute line spectral frequencies (this
method exhibits low sensitivity to channel errors). The ideal excitation
signal segment which renders synthetic speech identical to the original
speech for the present frame is vector quantized to facilitate
transmission. The LP-analysis window length and the update intervals are
240 samples while a frame length of 60 samples is used for the vector
quantization of the ideal excitation vector.
The target (or ideal) excitation vector for a frame, which results in a
perfect match of the original speech (when it is filtered through the
inverse LPC filter) is vector quantized, using a shape-gain vector
quantizer, in two sequential stages. The candidate vectors of the two
codebooks are selected to minimize a squared error criterion on the
synthetic speech. Because of the finite size of a frame, the impulse
response of the inverse LPC filter can be truncated and described by a
finite impulse response (FIR) filter. The FIR filtering operation can be
written as a matrix multiplication of a Toeplitz matrix H, which describes
the filter, and a vector describing the excitation. If t is the target
excitation vector, s the candidate excitation vector, and .lambda. the
scaling of the excitation vector, then the mismatch of speech and
synthetic speech is the vector H(.lambda.s-t). Thus, the error criterion
to be minimized can be written as (.lambda.s-t).sup.T H.sup.T
H(.lambda.s-t). The square matrix H.sup.T H is referred to as the spectral
weighting matrix.
For the first stage an "adaptive" codebook is used which contains synthetic
excitation functions of the recent past. It uses 4 bits for the gain and 7
or 8 bits for the index. The adaptive codebook is updated after each
frame, and allows the excitation to become periodic in nature,
facilitating the description of voiced speech. The second stage consists
of a search through a fixed codebook, which further refines the excitation
function resulting from the search through the adaptive codebook. The
stochastic codebook consists of overlapping entries, with adjacent
candidates separated by a shift of two samples. Its samples have a
Gaussian distribution, center clipped at 1.5 standard deviation. Four bits
are used for the gain and 8 bits for the indices. To improve the coding
efficiency, the dynamic range of the stochastic codebook gain is reduced
by multiplying the stochastic codebook by a scale factor prior to
calculating the gain factor. The scale factor is based on energy of the
contribution of the adaptive codebook to the present frame.
Thus, without error protection bits, the procedure used in the following
sections requires 4233 or 4366 bits/second for a 7 or 8 bit adaptive
codebook, respectively.
2.2. Definition of the Error Criterion
In this description, the effect of channel errors on the parameters
describing the CELP excitation function and methods for performance
improvement are discussed. To evaluate the performance of a CELP procedure
under such conditions, an appropriate error criterion must be defined. A
natural method would be to compare the signal to noise ratios (using the
original speech as reference) of the synthetic speech with and without
channel errors, or to compute the signal to noise ratio of the synthetic
speech with channel errors using the synthetic speech without channel
errors as reference. To evaluate the effect of channel errors on
particular parameters, those parameters could be perturbed in a systematic
way, while the other parameters are left untouched.
A difficulty with measuring the channel errors on synthetic speech which is
corrupted by channel errors is that the result is dependent both on the
size of the decoding error, and on the attenuation rate of the resulting
distortion over the following frames. However, it may be argured that this
method does provide a good measure of the overall channel error
performance of a CELP procedure, provided that the errors are introduced
such that they do not interfere with each other. Thus, one can evaluate
the channel error performance of a particular encoding bit by
systematically disturbing that bit in frames which are sufficiently far
apart. (Note that at a 1% error rate the probability that the same
parameter will be disturbed in adjacent frames is negligibly small. This
can be expected to provide a better measure of the performance than
perturbing the same bit in every successive frame, which may result in
interaction of the errors in successive frames.
However, although systematically disturbing particular bits in frames
sufficiently far apart may provide a satisfactory error criterion, it
results in very laborious evaluations. This makes this error criterion
unsuitable for the combinatorial optimization required to find a good
channel code. Instead of using an evaluation criterion which operates on
the output speech, including distortion over following frames, a criterion
is introduced which maintains the features of the distance measure used in
the CELP procedure to select the best candidate from the codebooks. By
evaluating synthetic speech quality on a per frame basis the effect of the
persistence of the distortion over time is eliminated.
The focus here will be on errors at low channel error rates (i.e. 1% or
less); thus, it can safely be assumed that multiple bit errors are highly
improbable in the description of a single parameter describing the CELP
excitation. However, the following procedures are easily generalized to
include more bit errors per parameter.
Let us define c.sup.(i) to be the channel code of a particular excitation
parameter (codebook index or codebook gain), with quantization index i.
The value or vector associated with the quantization index i is denoted as
r.sub.i,i. (The meaning of the double subscript will become clear later.)
The probability that channel code c.sup.(i) occurs is denoted as
P(c.sup.(i)). Note that, in the case of no redundant codes, P(c.sup.(i))
is the probability that index i provides the parameter with its best fit.
The target (optimal) parameter or vector is denoted as t. At the analyzer
t is matched by the quantized parameter r.sub.i,i. In general, assume that
r.sub.i,i can be the result of multiple quantizations; for example a
vector may have a shape index as well as a gain index. If the quantization
index is changed from the transmitted value i to j, then denote the
resulting parameter or vector at the receive end by r.sub.i,j. Thus, in
the parameter r.sub.i,j the index i indicates that the other quantizations
describing r.sub.i,j were associated with the index i and not with the
received index j.
In this description the target t is always the target excitation vector for
the particular codebook search. If the codebook candidate index is
considered, then r.sub.i,j is the excitation shape vector of index j but
with the gain properly quantized for the excitation vector i. If the
codebook gain is considered, r.sub.i,j denotes the gain indexed j with the
codebook index obtained from the search. Let us denote by D.sub.i
(r.sub.i,j) the mean distance between the parameter r.sub.i,j and the
target (optimal) value of the parameter t. Note that D.sub.i (r.sub.i,j)
describes a function of j. That is, D.sub.i (r.sub.i,j) is a penalty
function for changing the transmission index from i to j under the
constraint that the quantization index i is associated with the parameter
level or codebook entry of best fit. Further, N is the number of entries
or levels, and M denotes the number of bits of a transmission label. An
appropriate error criterion, which describes the sensitivity of the
parameter encoding to single bit errors is now:
##EQU1##
where f(c.sup.(i),k) is the index of the parameter associated with the
code word obtained by flipping the k'th bit of code word c.sup.(i). During
optimization of the encodings {c}, one compares the criterion .epsilon.
for various encoding configuration. Thus, the reference level D.sub.i
(r.sub.i,i) is of no significance, and can be omitted from criterion (1).
The error criterion used in the codebook selection process of the CELP
procedure cannot be used directly for the penalty function D.sub.i () of
equation (1) because the latter is a statistical average of the
performance, while the former is evaluated for individual frames only.
However, the CELP error criterion can be used as a starting point in the
selection of a proper penalty function, which can be evaluated quickly.
The selection process of the CELP procedure uses a least-squares error
distance measure. The selection process will give identical results if the
least-squares criterion is replaced by a signal to noise ratio, or its
logarithm. This is important since the least-squares error criterion is
not appropriate for averaging over a large number of frames; it weighs
frames with large absolute error unduly heavily. To eliminate this
problem, the mean logarithmic segmental signal to noise ratio is commonly
used to evaluate the objective performance of the CELP and multi-pulse
procedures. Thus, it is reasonable to choose D.sub.i (r.sub.i,j) to be the
mean logarithmic segmental signal to noise ratio of the distorted speech
signal generated with the parameter value or vector r.sub.i,j.
The criterion used for the vector quantization of CELP is commonly modified
to better model the perceived error. Due to masking, errors in spectral
regions with high | | |