|
|
|
| United States Patent | 5602961 |
| Link to this page | http://www.wikipatents.com/5602961.html |
| Inventor(s) | Kolesnik; Victor D. (St. Petersburg, RU);
Trofimov; Andrey N. (St. Petersburg, RU);
Bocharova; Irina E. (St. Petersburg, RU);
Krachkovsky; Victor Y. (St. Petersburg, RU);
Kudryashov; Boris D. (St. Petersburg, RU);
Ovsjannikov; Eugeny P. (St. Petersburg, RU);
Trojanovsky; Boris K. (St. Petersburg, RU);
Kovalov; Sergei I. (St. Petersburg, RU) |
| Abstract | An apparatus and method of coding speech. The apparatus includes a first
circuit being coupled to receive a first signal, the first signal
corresponds to the speech signal. The first circuit is for generating a
first set of parameters corresponding to the first frame. The apparatus
includes a second circuit, being coupled to receive a second signal and
the first set of parameters, the second signal corresponding to the speech
signal, and the second circuit is for generating a third signal. The
apparatus further includes a pulse train analyzer, being coupled to the
second circuit, for generating a third match value, a third set of
parameters, and a third excitation value. The apparatus further including
a fourth circuit, being coupled to the second circuit, for generating a
fourth match value, a fourth set of parameters, and a fourth excitation
value. The apparatus further including a fifth circuit, being coupled to
the third circuit and the fourth circuit, for selecting a mode
corresponding to a match value. The apparatus further including a sixth
circuit, being coupled to the fifth circuit, for selecting a selected set
of parameters and a selected excitation corresponding to the mode. The
apparatus further including a seventh circuit, being coupled to the first
circuit and the sixth circuit, for generating an encoded signal responsive
to the selected set of parameters and the mode. |
|
|
|
Title Information  |
|
|
|
|
|
Drawing from US Patent 5602961 |
|
|
Method and apparatus for speech compression using multi-mode code
excited linear predictive coding |
|
|
|
|
|
| Publication Date |
February 11, 1997 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Title Information  |
|
|
Claims  |
|
|
What is claimed is:
1. An apparatus for processing an input signal, said input signal including
a frame, said apparatus comprising:
a first circuit coupled to receive a first signal, said first signal
corresponding to said input signal, said first circuit for generating a
first set of parameters corresponding to said frame;
a second circuit coupled to receive said first signal and said first set of
parameters, said second circuit for generating a second signal;
a pulse train analyzer, coupled to said second circuit, said pulse train
analyzer for generating a first match value, a second set of parameters,
and a first excitation value;
a fourth circuit, coupled to said second circuit, said fourth circuit for
generating a second match value, a third set of parameters, and a second
excitation value, said fourth circuit including an adaptive codebook and
an adaptive codebook analyzer, said adaptive codebook being coupled to
said adaptive codebook analyzer;
a fifth circuit, coupled to said pulse train analyzer and said fourth
circuit, for determining a set of admissible excitation search modes based
upon a prior excitation search mode, and said fifth circuit further for
selecting an excitation search mode from said set of admissible excitation
search modes;
a sixth circuit, coupled to said fifth circuit, for selecting a selected
set of parameters and a selected excitation corresponding to said
excitation search mode, and
a seventh circuit, coupled to said first circuit and said sixth circuit,
for generating an encoded signal responsive to said selected set of
parameters and said excitation search mode.
2. The apparatus of claim 1 further comprising:
an eighth circuit, coupled to said second circuit, said eighth circuit for
generating a third match value, a fourth set of parameters, and a third
excitation value, and
wherein, said fifth circuit is coupled to said eighth circuit.
3. The apparatus of claim 2 wherein said eighth circuit further includes a
stochastic codebook analyzer for generating said fourth set of parameters.
4. The apparatus of claim 2 wherein said eighth circuit includes a trellis
codebook analyzer for generating said fourth set of parameters.
5. The apparatus of claim 2 wherein said first set of parameters includes
linear prediction coefficients (LPCs) corresponding to said frame, and
wherein said second circuit is coupled to receive said LPCs and is for
performing ringing removal and perceptual weighting of said first signal
to generate said second signal.
6. The apparatus of claim 3 wherein each of said second, third, and fourth
set of parameters includes an index parameter and a gain parameter.
7. The apparatus of claim 4 wherein said frame includes a subframe, and
wherein said second set of parameters corresponds to said subframe.
8. The apparatus of claim 7 wherein said second set of parameters include a
pitch parameter, an index parameter, and a phase parameter, and wherein
the index parameter includes an index to a shape pulse.
9. The apparatus of claim 7 wherein an index parameter of said third set of
parameters includes an index to said adaptive codebook.
10. The apparatus of claim 7 wherein said eighth circuit includes a short
adaptive codebook.
11. The apparatus of claim 7 wherein said fifth circuit is for weighting
said first, second and third match values prior to selecting said
excitation search mode.
12. The apparatus of claim 11 wherein said first match value is weighted by
an amount between 0.7-0.9, wherein said second match value is weighted by
an amount between 1.1-1.3, and wherein said third match value is weighted
by an amount between 0.8-1.0.
13. The apparatus of claim 7 wherein said input signal includes a previous
subframe, said previous subframe having said previous excitation search
mode, and said fifth circuit is for selecting said excitation search mode
responsive to said previous subframe.
14. The apparatus of claim 7 wherein said input signal includes digitized
speech.
15. The apparatus of claim 7 further comprising a filter circuit coupled to
receive said input signal and for generating said first signal.
16. The apparatus of claim 7 further comprising a line spectrum pair
circuit, being coupled to said first circuit and said seventh circuit, for
generating line spectrum pair parameters from said first set of
parameters, wherein said seventh circuit includes a multiplexing circuit,
and wherein said seventh circuit is for multiplexing said line spectrum
pair parameters with said selected set of parameters and said selected
excitation.
17. The apparatus of claim 2 wherein said fifth circuit is further
configured to select said excitation search mode corresponding to one of
said set of admissible excitation search modes requiring the least number
of bits and complying with a predetermined error threshold.
18. A multi-mode linear predictive coder for processing digital speech
signals, said digital speech signals being partitioned into frames of a
first predetermined length, where each frame is partitioned into subframes
of a second predetermined length, said coder comprising:
a short-term prediction analyzer responsive to said digital speech signals,
said short-term prediction analyzer for generating linear prediction
parameters and line spectrum parameters;
a variable rate encoder, coupled to said short-term prediction analyzer,
for coding differences of said line spectrum parameters by a predetermined
variable rate code;
a ringing removal and perceptual weighting circuit for ringing removal and
perceptual weighting said digital speech signals to produce predistorted
speech vectors for successive subframes;
a multi-mode excitation analyzer, coupled to said ringing removal and
perceptual weighting circuit, for generating a set of excitations, a set
of match values, and a set of parameters, each excitation in said set of
excitations corresponding to a maximal value of a match function in said
set of match values;
a pause analyzer, responsive to said digital speech signals, for pause
detecting and producing a pause mode signal;
a comparator and controller, coupled to said multi-mode excitation analyzer
and said pause analyzer, for weighting and comparing said match function
values for each of a plurality of excitation search modes, and for
generating a current excitation search mode corresponding to one of said
plurality of excitation search modes with a maximal weighted match
function value;
a selector of parameters, coupled to said multi-mode excitation analyzer,
for generating selected parameters from said set of parameters
corresponding to said current excitation search mode; and
a selector of excitations, coupled to said multi-mode excitation analyzer,
for selecting a current excitation from said set of excitations
corresponding to said current excitation search mode.
19. The multi-mode linear predictive coder as recited in claim 18, wherein
said multi-mode excitation analyzer further comprises:
an adaptive codebook (ACB) analyzer, coupled to said ringing removal and
perceptual weighting circuit, for generating an ACB excitation, an ACB
match function and ACB parameters for each subframe in said frame;
a pulse train analyzer, coupled to said tinging removal and perceptual
weighting circuit, for generating a pulse excitation, a pulse match
function and pulse parameters;
a shortened adaptive codebook (SACB) analyzer, coupled to said ringing
removal and perceptual weighting circuit, for generating a SACB codebook
excitation and SACB parameters; and
a stochastic analyzer, coupled to said ringing removal and perceptual
weighting circuit, said stochastic analyzer for generating a stochastic
gain, a stochastic codeword index, a stochastic excitation, and a
stochastic match function, said stochastic excitation corresponding to
said SACB excitation.
20. The multi-mode linear predictive coder of claim 19 wherein said
stochastic analyzer is a trellis analyzer, and wherein said stochastic
gain is a trellis gain, said stochastic codeword index is a trellis
codeword index, said stochastic excitation is a trellis excitation, and
said stochastic match function is a trellis match function.
21. A method of selecting encoding parameters, said method for use in a
speech synthesizer to improve the subjective speech quality, said method
comprising the steps of:
constructing a pulse based upon the time inversion of a pulse response of a
response filter;
generating an excitation vector in the form of multiple pitch spaced pulses
using a set of pitch values, a set of phase values, and said pulse, said
set of pitch values and said set of phase values derived from a
perceptually weighted speech signal;
computing energy values and correlation values, said energy values
determined using a filtered vector, said correlation values representing
the correlation between said filtered vector and said perceptually
weighted speech signal, said filtered vector corresponding to said
excitation vector; and
selecting the pulse excitation from said excitation vector corresponding to
correlation values and energy values that maximize a pulse mode match
function.
22. The method of claim 21 wherein said method further comprises the step
of receiving a set of linear prediction coefficients (LPCs), said LPCs
defining a linear prediction (LP) analysis filter of order m, and said
step of constructing a pulse uses the following equations:
A(z)=1-a.sub.1 z.sup.-1 -a.sub.2 z.sup.-2 - . . . -a.sub.m z.sup.-m ;
U(z)=(1-.delta.z.sup.-1)/A(.alpha.z);
V.sub.0,n-1 (z)=z.sup.n-1 U.sub.0,n-1 (z.sup.-1);
W(z)=(V.sub.n-m,n-1 (z)+z.sup.-n U.sub.0,d (z))A(.beta.z); and
V.sub.n,m-1 (Z)=W.sub.n,M-1 (Z); where X.sub.i,j (z) represents the
polynomial X.sub.i,j (z)=X.sub.i z.sup.-i +x.sub.i+1 z.sup.-(i+1) +. . .
+x.sub.j z.sup.-j, j>i, where A(z) denotes the Z-transform for the LP
analysis filter, where a.sub.i represents one linear prediction
coefficient of said set of LPCs, where samples of said pulse are
represented by V.sub.i (z), where n<M, where .alpha. and .delta. are
empirically chosen constants, 0.ltoreq..alpha.,.delta..ltoreq.1, where
.beta. is an empirically chosen constant, 0.ltoreq..beta..ltoreq.1, and
where d, d.gtoreq.0, is a fixed constant.
23. The method of claim 22 wherein .alpha. is in the range 0.9 to 0.98,
.delta. is in the range 0.55 to 0.75, and .beta. is in the range 0.6 to
0.8.
24. A pulse train analyzer for use in a speech synthesizer comprising:
a pulse generator coupled to receive a set of pitch values, a set of phase
values, and a set of linear prediction coefficients (LPCs), said set of
pitch values and said set of phase values derived from a perceptually
weighted speech signal, said set of LPCs derived from an input speech
signal, said pulse generator producing an excitation vector based upon
said set of pitch values, said set of phase values, and said set of LPCs;
a correlation circuit coupled to said pulse generator and further coupled
to receive said perceptually weighted speech signal, said correlation
circuit using a pulse mode match function to determine a set of match
values, said set of match values based upon said excitation vector and
said perceptually weighted speech signal; and
a pulse train selector coupled to receive said set of match values, said
pulse train selector selecting the excitation from said excitation vector
that corresponds to the maximal value in said set of match values as a
selected pulse excitation.
25. The pulse train analyzer of claim 24 said correlation circuit further
comprising:
a response filter coupled to said pulse generator producing a pulse
response corresponding to said excitation vector;
a correlator coupled to receive said perceptually weighted speech signal
and coupled to said response filter, said correlator computing correlation
values between said pulse response and said perceptually weighted speech
signal;
an energy calculator coupled to said response filter computing energy
values using said pulse response; and
a match function calculator coupled to said correlator and said energy
calculator to produce said set of match values using said pulse mode match
function, said set of match values based upon applying said pulse mode
match function to said correlation values and said energy values.
26. The pulse train analyzer of claim 25 said pulse generator further
comprising:
a pulse train generator coupled to receive said set of pitch values and
said set of phase values, said set of pitch values and said set of phase
values derived from said perceptually weighted speech signal, said pulse
train generator producing said excitation vector in the form of multiple
pitch spaced pulses based upon said set of pitch values, said set of phase
values, and a pulse; and
a pulse shape generator coupled to said pulse train generator, said pulse
shape generator producing a pulse using a formula corresponding to the
time inversion of the pulse response. |
|
|
|
|
Claims  |
|
|
Description  |
|
|
BACKGROUND OF THE INVENTION
1. Field of Invention
The present invention generally relates to speech coding at low bit rates
(in a range 2.4-4.8 kb/s). In particular, the present invention relates to
improving excitation generating and linear predicting coefficient coding
directed at the reduction of the number of data bits for coded speech.
2. Description of Related Art
Digital speech communication systems including voice storage and voice
response facilities utilize signal compression to reduce the bit rate
needed for storage and/or transmission. As it is well known in the art, a
speech pattern contains redundancies that are not essential to its
apparent quality. Removal of redundant components of the speech pattern
significantly lowers the number of bits required to synthesize the speech
signal. A goal of effective digital speech coding is to provide an
acceptable subjective quality of synthesized speech at low bit rates.
However, the coding must also be fast enough to allow for real time
implementation.
One method used to partially achieve these goals is based on the standard
Linear Prediction (LP) technique. The characteristic features of this
technique are the following. The sampled and quantized speech signal is
partitioned into successive intervals (frames), then a set of parameters
representative of the interval speech is generated. The parameter set
includes linear prediction coefficients (LPCs) which determine an LP
filter, and the best excitation signal. The best LPCs and excitation are
then used to produce a synthesized signal close to the original speech
signal. This is done on a per frame basis.
The best excitation is typically found through a look-up in a table, or
codebook. The codebook includes vectors whose components are consecutive
excitation samples. Each vector contains the same number of excitation
samples as there are speech samples in a frame.
One of the most effective approaches of this type is the Code Excited
Linear Prediction (CELP) method which was disclosed in "Predictive Coding
of Speech at Low Bit Rates", Atal B.S., IEEE Transactions on
Communications, vol. COM-30, No. 4, (April, 1982), 600-614.
FIG. 1 illustrates how a CELP implementation generates the best excitation
for an LP filter such that the output of the filter closely approximates
input speech.
In each frame the input speech signal is pre-filtered by a fixed digital
pre-filter 100. Next, the pre-filtered speech is processed by linear
prediction analyzer 101 to estimate the linear predictive filter A(z) of a
prescribed order. Each frame is broken into a predetermined number of
subframes. This allows excitations to be generated for each subframe. Each
speech vector, for a given subframe, is passed through the ringing removal
and perceptual weighting module 102. The speech signal is perceptually
predistorted by a linear filter with the transfer function
W(z)=A(z)/A(.gamma.z) for some .gamma.. The output w, of module 102, is
analyzed by the long-term prediction analyzer 103 to obtain a periodic
(pitch) component p relating to the excitation. The best pitch excitation
is found by searching the index (code word number) I.sub.A in an adaptive
codebook (ACB) and computing the optimal gain factor g.sub.A. These
jointly minimize the squared norm
.vertline..vertline.d.vertline..vertline..sup.2 of the vector
d=w-bg.sub.A, where b denotes the response of the synthesis filter
1/A(z.gamma.) 104 excited by p. For this purpose, an exhaustive search in
an ACB is performed to find the maximal value of the match function:
M=(w,b).sup.2 /(b,b).
The optimal gain value is determined as follows:
g.sub.A =(w,b)/(b,b).
The residual vector u=w-b g.sub.A from the output of adder 105 enters the
stochastic codebook analyzer 108. Here the best residual excitation index
I.sub.S, and the optimal gain factor g.sub.s, are found. These jointly
minimize the squared norm .vertline..vertline.d.vertline..vertline..sup.2
of the error vector d=u-rg.sub.s, where r denotes the response of the
stochastic codebook analyzer 108's synthesis filter excited by the code
word c, from the precomputed stochastic codebook 109. Using the multiplier
106, multiplier 110, and adder 107, we obtain the resulting excitation
vector e for a given subframe as the following sum:
e=pg.sub.A +cg.sub.s.
For the CELP speech coding technique, the synthesized speech quality
rapidly degrades as data rates are reduced. For example, at 4.8 kb/s, a
10-bit codebook is generally used. However, at 2.4 kb/s, the number of
bits of the codebook must be decreased to 5. Since 5 bits are too small to
cover many types of speech signals, the speech quality is abruptly
degraded at a bit rate lower than 4.8 kb/s.
Various improvements of the CELP technique exist. These techniques attempt
to provide acceptable speech compression at data rates below 4800 bps.
Such techniques are reported in the following references:
Zinser R. L., Koch S. R. "CELP coding at 4.0 kb/sec and below: improvements
to FS-1016." Proceedings of the 1992 IEEE International Conference on
Acoustics, Speech, and Signal Processing, pp. I-313 through I-316, March
1992;
Wang S., Gersho A. "Improved phonetically-segmented vector excitation
coding at 3.4 kb/s." Proceedings of the 1992 IEEE International Conference
on Acoustics, Speech, and Signal Processing, pp. I-349 through I-352,
March 1992;
J. Ha | | |