|
Description  |
|
|
MICROFICHE APPENDIX
Included in this application is Microfiche Appendix A. The total number of
microfiche is 1 sheet and the total number of frames is 37.
Cross-Reference to Related Application
The following application was filed concurrently with this application and
is assigned to the same assignees as this application:
R. H. Ketchum, et al, "Code Excited Linear Predictive Vocoder Using Virtual
Searching," Ser. No. 067,650.
TECHNICAL FIELD
This invention relates to low bit rate coding and decoding of speech and in
particular to an improved code excited linear predictive vocoder.
BACKGROUND OF THE INVENTION
Code excited linear predictive coding (CELP) is a well-known technique.
This coding technique synthesizes speech by utilizing encoded excitation
information to excite a linear predictive (LPC) filter. This excitation is
found by searching through a table of candidate excitation vectors on a
frame-by-frame basis.
LPC analysis is performed on the input speech to determine the LPC filter.
The analysis proceeds by comparing the outputs of the LPC filter when it
is excited by the various candidate vectors from the table or codebook.
The best candidate is chosen based on how well its corresponding
synthesized output matches the input speech. After the best match has been
found, information specifying the best codebook entry and the filter are
transmitted to the synthesizer. The synthesizer has a similar codebook and
accesses the appropriate entry in that codebook, using it to excite the
same LPC filter.
The codebook is made up of vectors whose components are consecutive
excitation samples. Each vector contains the same number of excitation
samples as there are speech samples in a frame. The vectors can be
constructed in one of two ways. In the first method, disjoint sets of
samples are used to define the vectors. In the second method, the
overlapping codebook, the vectors are defined by shifting a window along a
linear array of excitation samples.
The excitation samples used in the vectors in the CELP codebook can come
from a number of possible sources. One particular example is
Stochastically Excited Linear Prediction (SELP) method, which uses white
noise, or random numbers, as the samples. Another method is to use an
adaptive codebook. In such a scheme, the synthetic excitation determined
for the present frame is used to update the codebook for future frames.
This procedure allows the excitation codebook to adapt to the speech.
A problem with the CELP techniques for coding speech is that each
excitation set of information in the codebook must be used to excite the
LPC filter and then the excitation results must be compared utilizing an
error criterion. Normally, the error criterion used is to determine the
sum of the squared difference between the original and the synthesized
speech samples resulting from the excitation information for each set of
information. These calculations involve the convolution of each set of
excitation information stored in the codebook with the LPC filter. The
calculations are performed by using vector and matrix operations of the
excitation information and the LPC filter. The problem is the large number
of calculations, approximately 500 million multiply-add operations per
second for a 4.8 Kbps vocoder, that must be performed.
SUMMARY OF THE INVENTION
The following problem is solved and a technical advance is achieved by a
vocoder that utilizes a highly efficient CELP computational unit. The
computational unit utilizes a finite impulse response LPC filter and an
overlapping codebook to perform the calculations for the CELP operations
in a recursive manner. For each excitation vector accessed from the
overlapping codebook, only two sample points of the accessed vector must
have arithmetic operations performed on them to evaluate the new vector
rather than all of the samples of the accessed excitation vector in prior
art methods.
A method in accordance with this invention comprises the steps of: forming
a target set of excitation information in response to the present speech
frame, determining a set of filter coefficients in response to the same
speech frame, calculating a finite impulse response filter model in
response to the filter coefficients, recursively calculating error values
by sequentially applying each of a plurality of candidate sets of
excitation information stored in a table to the finite impulse response
filter to determine the error value between the response of the finite
impulse response filter to each of the excitation candidate sets and the
target excitation set, and communicating the filter coefficients and
information representing the location of the selected candidate set in the
table that had the smallest error value for reproduction of the speech
frame.
Advantageously, the method further comprises the steps of forming another
target excitation set by subtracting the original target excitation set by
the selected candidate excitation set, recursively calculating another
error value for each of another plurality of candidate excitation sets
stored in another table in response to the finite impulse response filter
and each of the other candidate sets and the other target excitation set,
selecting one of the other candidate sets having the smallest error value,
and communicating information representing the location in the other table
of the selected other candidate set for reproduction of speech for the
present frame.
Advantageously, the candidate excitation sets are stored in the table in an
overlapping manner whereby each candidate set differs from the previous
candidate set by only a first and a second subset of excitation
information and the step of recursively calculating comprises the steps of
removing the effects of the first subset of excitation information from
the error value of the previous candidate set to form a temporary error
value and adding in the effects of the second subset of excitation
information to the temporary error value to form the error value for the
present candidate excitation set under calculation.
Also, the step of forming a target excitation set comprises the steps of
calculating a ringing set of information for the previous frame,
subtracting that ringing set from the speech for the present frame to
generate an intermediate set, and whitening filtering based on the filter
coefficients for the present frame the intermediate set.
In addition, the step of calculating the ringing set comprises the step of
adding the selected candidate excitation set from each of the tables
together to form a synthesis excitation set; filtering based on the filter
coefficients the synthesis excitation set; and zero-impulse response
filtering based on the filter coefficients and the filtered synthesis
excitation set from the previous frame. Also, the method further comprises
the step of adding the synthesis excitation set into the first table in
order to update that table.
Advantageously, an apparatus in accordance with this invention has a
calculator that forms a target excitation set from the present frame, an
analyzer that determines a set of filter coefficients in response to the
present frame, a calculator that calculates finite impulse response filter
information from the filter coefficients, a recursive calculator that
calculates an error value for each of a plurality of candidate excitation
sets stored in a table in response to the finite impulse response filter
information and each of the stored candidate excitation sets and the
target excitation set, and an encoder that transfers the filter
coefficients and the location of the selected candidate excitation set in
the table that had the smallest value for reproduction by a decoder.
BRIEF DESCRIPTION OF THE DRAWING
FIG. 1 illustrates, in block diagram form, analyzer and synthesizer
sections of a vocoder which is the subject of this invention;
FIG. 2 illustrates, in graphic form, the formation of excitation vectors
from codebook 104 using the virtual search technique;
FIGS. 3 through 6 illustrate, in graphic form, the vector and matrix
operation which are the subject of this invention;
FIG. 7 illustrates, in greater detail, adaptive searcher 106 of FIG. 1;
FIG. 8 illustrates, in greater detail, virtual search control 708 of FIG.
7; and
FIG. 9 illustrates, in greater detail, energy calculator 709 of FIG. 7.
DETAILED DESCRIPTION
FIG. 1 illustrates, in block diagram form, a vocoder which is the subject
of this invention. Elements 101 through 112 represent the analyzer portion
of the vocoder, whereas, elements 151 through 157 represent the
synthesizer portion of the vocoder. The analyzer portion of FIG. 1 is
responsive to incoming speech received on path 120 to digitally sample the
analog speech into digital samples and to group those digital samples into
frames using well-known techniques. For each frame, the analyzer portion
calculates the LPC coefficients representing the formant characteristics
of the vocal tract and searches for entries from both the stochastic
codebook 105 and adaptive codebook 104 that best approximate the speech
for that frame along with scaling factors. The latter entries and scaling
information define excitation information as determined by the analyzer
portion. This excitation and coefficient information is then transmitted
by encoder 109 via path 145 to the synthesizer portion of the vocoder
illustrated in FIG. 1. Stochastic generator 153 and adaptive generator 154
are responsive to the codebook entries and scaling factors to reproduce
the excitation information calculated in the analyzer portion of the
vocoder and to utilize this excitation information to excite the LPC
filter that is determined by the LPC coefficients received from the
analyzer portion to reproduce the speech.
Consider now in greater detail the functions of the analyzer portion of
FIG. 1. LPC analyzer 101 is responsive to the incoming speech to determine
LPC coefficients using well-known techniques. These LPC coefficients are
transmitted to target excitation calculator 102, spectral weighting
calculator 103, encoder 109, LPC filter 110, and zero-input response
filter 111. Encoder 109 is responsive to the LPC coefficients to transmit
the latter coefficients via path 145 to decoder 151. Spectral weighting
calculator 103 is responsive to the coefficients to calculate spectral
weighting information in the form of a matrix that emphasizes those
portions of speech that are known to have important speech content. This
spectral weighting information is based on a finite impulse response LPC
filter. The utilization of a finite impulse response filter will be shown
to greatly reduce the number of calculations necessary for performing the
computations performed in searchers 106 and 107. This spectral weighting
information is utilized by the searchers in order to determine the best
candidate for the excitation information from the codebooks 104 and 105.
Target excitation calculator 102 calculates the target excitation which
searchers 106 and 107 attempt to approximate. This target excitation is
calculated by convolving a whitening filter based on the LPC coefficients
calculated by analyzer 101 with the incoming speech minus the effects of
the excitation and LPC filter for the previous frame. The latter effects
for the previous frames are calculated by filters 110 and 111. The reason
that the excitation and LPC filter for the previous frame must be
considered is that these factors produce a signal component in the present
frame which is often referred to as the ringing of the LPC filter. As will
be described later, filters 110 and 111 are responsive to the LPC
coefficients and calculated excitation from the previous frame to
determine this ringing signal and to transmit it via path 144 to
subtracter 112. Subtracter 112 is responsive to the latter signal and the
present speech to calculate a remainder signal representing the present
speech minus the ringing signal. Calculator 102 is responsive to the
remainder signal to calculate the target excitation information and to
transmit the latter information via path 123 to searcher 106 and 107.
The latter searchers work sequentially to determine the calculated
excitation also referred to as synthesis excitation which is transmitted
in the form of codebook indices and scaling factors via encoder 109 and
path 145 to the synthesizer portion of FIG. 1. Each searcher calculates a
portion of the calculated excitation. First, adaptive searcher 106
calculates excitation information and transmits this via path 127 to
stochastic searcher 107. Searcher 107 is responsive to the target
excitation received via path 123 and the excitation information from
adaptive searcher 106 to calculate the remaining portion of the calculated
excitation that best approximates the target excitation calculated by
calculator 102. Searcher 107 determines the remaining excitation to be
calculated by subtracting the excitation determined by searcher 106 from
the target excitation. The calculated or synthetic excitation determined
by searchers 106 and 107 is transmitted via paths 127 and 126,
respectively, to adder 108. Adder 108 adds the two excitation components
together to arrive at the synthetic excitation for the present frame. The
synthetic excitation is used by the synthesizer to produce the synthesized
speech.
The output of adder 108 is also transmitted via path 128 to LPC filter 110
and adaptive codebook 104. The excitation information transmitted via path
128 is utilized to update adaptive codebook 104. The codebook indices and
scaling factors are transmitted from searchers 106 and 107 to encoder 109
via paths 125 and 124, respectively.
Searcher 106 functions by accessing sets of excitation information stored
in adaptive codebook 104 and utilizing each set of information to minimize
an error criterion between the target excitation received via path 123 and
the accessed set of excitation from codebook 104. A scaling factor is also
calculated for each accessed set of information since the information
stored in adaptive codebook 104 does not allow for the changes in dynamic
range of human speech.
The error criterion used is the square of the difference between the
original and synthetic speech. The synthetic speech is that which will be
reproduced in the synthesizer portion of FIG. 1 on the output of LPC
filter 117. The synthetic speech is calculated in terms of the synthetic
excitation information obtained from codebook 104 and the ringing signal;
and the speech signal is calculated from the target excitation and the
ringing signal. The excitation information for synthetic speech is
utilized by performing a convolution of the LPC filter as determined by
analyzer 102 utilizing the weighting information from calculator 103
expressed as a matrix. The error criterion is evaluated for each set of
information obtained from codebook 104, and the set of excitation
information giving the lowest error value is the set of information
utilized for the present frame.
After searcher 106 has determined the set of excitation information to be
utilized along with the scaling factor, the index into the codebook and
the scaling factor are transmitted to encoder 109 via path 125, and the
excitation information is also transmitted via path 127 to stochastic
searcher 107. Stochastic searcher 107 subtracts the excitation information
from adaptive searcher 106 from the target excitation received via path
123. Stochastic searcher 107 then performs operations similar to those
performed by adaptive searcher 106.
The excitation information in adaptive codebook 104 is excitation
information from previous frames. For each frame, the excitation
information consists of the same number of samples as the sampled original
speech. Advantageously, the excitation information may consist of 55
samples for a 4.8 Kbps transmission rate. The codebook is organized as a
push down list so that the new set of samples are simply pushed into the
codebook replacing the earliest samples presently in the codebook. When
utilizing sets of excitation information out of codebook 104, searcher 106
does not treat these sets of information as disjoint sets of samples but
rather treats the samples in the codebook as a linear array of excitation
samples. For example, searcher 106 will form the first candidate set of
information by utilizing sample 1 through sample 55 from codebook 104, and
the second set of candidate information by using sample 2 through sample
56 from the codebook. This type of searching a codebook is often referred
to as an overlapping codebook.
As this linear searching technique approaches the end of the samples in the
codebook there is no longer a full set of information to be utilized. A
set of information is also referred to as an excitation vector. At that
point, the searcher performs a virtual search. A virtual search involves
repeating accessed information from the table into a later portion of the
set for which there are no samples in the table. This virtual search
technique allows the adaptive searcher 106 to more quickly react to
transitions from an unvoiced region of speech to a voiced region of
speech. The reason is that in unvoiced speech regions the excitation is
similar to white noise whereas in the voiced regions there is a
fundamental frequency. Once a portion of the fundamental frequency has
been identified from the codebooks, it is repeated.
FIG. 2 illustrates a portion of excitation samples such as would be stored
in codebook 104 but where it is assumed for the sake of illustration that
there are only 10 samples per excitation set. Line 201 illustrates that
the contents of the codebook and lines 202, 203 and 204 illustrate
excitation sets which have been formed utilizing the virtual search
technique. The excitation set illustrated in line 202 is formed by
searching the codebook starting at sample 205 on line 201. Starting at
sample 205, there are only 9 samples in the table, hence, sample 208 is
repeated as sample 209 to form the tenth sample of the excitation set
illustrated in line 202. Sample 208 of line 202 corresponds to sample 205
of line 201. Line 203 illustrates the excitation set following that
illustrated in line 202 which is formed by starting at sample 206 on line
201. Starting at sample 206 there are only 8 samples in the code book,
hence, the first 2 samples of line 203 which are grouped as samples 210
are repeated at the end of the excitation set illustrated in line 203 as
samples 211. It can be observed by one skilled in the art that if the
significant peak illustrated in line 203 was a pitch peak then this pitch
has been repeated in samples 210 and 211. Line 204 illustrates the third
excitation set formed starting at sample 207 in the codebook. As can be
seen, the 3 samples indicated as 212 are repeated at the end of the
excitation set illustrated on line 204 as samples 213. It is important to
realize that the initial pitch peak which is labeled as 207 in line 201 is
a cumulation of the searches performed by searchers 106 and 107 from the
previous frame since the contents of codebook 104 are updated at the end
of each frame. The statistical searcher 107 would normally arrive first at
a pitch peak such as 207 upon entering a voiced region from an unvoiced
region.
Stochastic searcher 107 functions in a similar manner as adaptive searcher
106 with the exception that it uses as a target excitation the difference
between the target excitation from target excitation calculator 102 and
excitation representing the best match found by searcher 106. In addition,
search 107 does not perform a virtual search.
A detailed explanation is now given of the analyzer portion of FIG. 1. This
explanation is based on matrix and vector mathematics. Target excitation
calculator 102 calculates a target excitation vector, t, in the following
manner. A speech vector s can be expressed as
s=Ht+z.
The H matrix is the matrix representation of the all-pole LPC synthesis
filter as defined by the LPC coefficients received from LPC analyzer 101
via path 121. The structure of the filter represented by H is described in
greater detail later in this section and is part of the subject of this
invention. The vector z represents the ringing of the all-pole filter from
the excitation received during the previous frame. As was described
earlier, vector z is derived from LPC filter 110 and zero-input response
filter 111. Calculator 102 and subtracter 112 obtain the vector t
representing the target excitation by subtracting vector z from vector s
and processing the resulting signal vector through the all-zero LPC
analysis filter also derived from the LPC coefficients generated by LPC
analyzer 101 and transmitted via path 121. The target excitation vector t
is obtained by performing a convolution operation of the all-zero LPC
analysis filter, also referred to as a whitening filter, and the
difference signal found by subtracting the ringing from the original
speech. This convolution is performed using well-known signal processing
techniques.
Adaptive searcher 106 searches adaptive codebook 104 to find a candidate
excitation vector r that best matches the target excitation vector t.
Vector r is also referred to as a set of excitation information. The error
criterion used to determine the best match is the square of the difference
between the original speech and the synthetic speech. The original speech
is given by vector s and the synthetic speech is given by the vector y
which is calculated by the following equation:
y=HL.sub.i r.sub.i +z,
where L.sub.i is a scaling factor.
The error criterion can be written in the following form:
e=(Ht+z-HL.sub.i r.sub.i -z).sup.T (Ht+z-HL.sub.i r.sub.i -z). (1)
In the error criterion, the H matrix is modified to emphasize those
sections of the spectrum which are perceptually important. This is
accomplished through well known pole-bandwidth widing technique. Equation
1 can be rewritten in the following form:
e=(t-L.sub.i r.sub.i).sup.T H.sup.T H(t-L.sub.i r.sub.i). (2)
Equation 2 can be further reduced as illustrated in the following:
e=t.sup.T H.sup.T Ht+L.sub.i r.sub.i.sup.T H.sup.T HL.sub.i r.sub.i
-2L.sub.i r.sub.i.sup.T H.sup.T Ht. (3)
The first term of equation 3 is a constant with respect to any given frame
and is dropped from the calculation of the error in determining which
r.sub.i vector is to be utilized from codebook 104. For each of the
r.sub.i excitation vectors in codebook 104, equation 3 must be solved and
the error criterion, e, must be determined so as to chose the r.sub.i
vector which has the lowest value of e. Before equation 3 can be solved,
the scaling factor, L.sub.i must be determined. This is performed in a
straight forward manner by taking the partial derivative with respect to
L.sub.i and setting it equal to zero, which yields the following equation:
##EQU1##
The numerator of equation 4 is normally referred to as the
cross-correlation term and the denominator is referred to as the energy
term. The energy term requires more computation than the cross-correlation
term. The reason is that in the cross-correlation term the product of the
last three elements needs only to be calculated once per frame yielding a
vector; and then for each new candidate vector, r.sub.i, it is simply
necessary to take the dot product between the candidate vector transposed
and the constant vector resulting from the computation of the last three
elements of the cross-correlation term.
The energy term involves first calculating Hr.sub.i then taking the
transpose of this and then taking the inner product between the transpose
of Hr.sub.i and Hr.sub.i. This results in a large number of matrix and
vector operations requiring a large number of calculations. The present
invention is directed towards reducing the number of calculations and
enhancing the resulting synthetic speech.
In part, the present invention realizes this goal by utilizing a finite
impulse response LPC filter rather than an infinite impulse response LPC
filter as utilized in the prior art. The utilization of a finite impulse
response filter having a constant response length results in the H matrix
having a different symmetry than in the prior art. The H matrix represents
the operation of the finite impulse response filter in terms of matrix
notation. Since the filter is a finite impulse response filter, the
convolution of this filter and the excitation information represented by
each candidate vector, r.sub.i, results in each sample of the vector
r.sub.i generating a finite number of response samples which are
designated as R number of samples. When the matrix vector operation of
calculating Hr.sub.i is performed which is a convolution operation, all of
the R response points resulting from each sample in the candidate vector,
r.sub.i, are summed together to form a frame of synthetic speech.
The H matrix representing the finite response filter is an N+R by N matrix,
where N is the frame length in samples, and R is the length of the
truncated impulse response in number of samples. Using this form of the H
matrix, the response vector Hr has a length of N+R. This form of H matrix
is illustrated in the following equation 5:
##EQU2##
Consider the product of the transpose of the H matrix and the H matrix
itself as in equation 6:
A=H.sup.T H. (6)
Equation 6 results in a matrix A which is N by N square, symmetric, and
Toeplitz as illustrated in the following equation 7.
##EQU3##
Equation 7 illustrates the A matrix which results from H.sup.T H operation
when N is five. One skilled in the art would observe from equation 5 that
depending on the value of R that certain of the elements in matrix A would
be 0. For example, if R=2 then elements A.sub.2, A.sub.3 and A.sub.4 would
be 0.
FIG. 3 illustrates what the energy term would be for the first candidate
vector r.sub.1 assuming that this vector contains 5 samples which means
that N equals 5. The samples X.sub.0 through X.sub.4 are the first 5
samples in adaptive codebook 104. The calculation of the energy term of
equation 4 for the second candidate vector r.sub.2 is illustrated in FIG.
4. The latter figure illustrates that only the candidate vector has
changed and that it has only changed by the deletion of the X.sub.0 sample
and the addition of the X.sub.5 sample.
The calculation of the energy term illustrated in FIG. 3 results in a
scalar value. This scalar value for r.sub.1 differs from that for
candidate vector r.sub.2 as illustrated in FIG. 4 only by the addition of
the X.sub.5 sample and the deletion of the X.sub.0 sample. Because of the
symmetry and Toeplitz nature introduced into the A matrix due to the
utilization of a finite impulse response filter, the scalar value for FIG.
4 can be easily calculated in the following manner. First, the
contribution due to the X.sub.0 sample is eliminated by realizing that its
contribution is easily determinable as illustrated in FIG. 5. This
contribution can be removed since it is simply based on the multiplication
and summation operations involving term 501 with terms 502 and the
operations involving terms 504 with term 503. Similarly, FIG. 6
illustrates that the addition of term X.sub.5 can be added into the scalar
value by realizing that its contribution is due to the operations
involving term 601 with terms 602 and the operations involving terms 604
with the terms 603. By subtracting the contribution of the terms indicated
in FIG. 5 and adding the effect of the terms illustrated in FIG. 6, the
energy term for FIG. 4 can be recursively calculated from the energy term
of FIG. 3. It would be obvious to one skilled in the art that this method
of recursive calculation is independent of the size of the vector r.sub.i
or the A matrix. These recursive calculations allow the candidate vectors
contained within adaptive codebook 104 or codebook 105 to be compared with
each other but only requiring the additional operations illustrated by
FIGS. 5 and 6 as each new excitation vector is taken from the codebook.
In general terms, these recursive calculations can be mathematically
expressed in the following manner. First, a set of masking matrices is
defined as I.sub.k where the last one appears in the kth row.
##EQU4##
In addition, the unity matrix is defined as I as follows:
##EQU5##
Further, a shifting matrix is defined as follows:
##EQU6##
For Toeplitz matrices, the following well known theorem holds:
S.sup.T AS=(I-I.sub.1)A(I-I.sub.1). (11)
Since A or H.sup.T H is Toeplitz, the recursive calculation for the energy
term can be expressed using the following nomenclature. First, define the
energy term associated with the r.sub.j+1 vector as E.sub.j+1 as follows:
E.sub.j+1 =r.sub.j+1.sup.T H.sup.T Hr.sub.j+1. (12)
In addition, vector r.sub.j+1 can be expressed as a shifted version of
r.sub.j combined with a vector containing the new sample of r.sub.j+1 as
follows:
r.sub.j+1 =Sr.sub.j +(I-I.sub.N-1)r.sub.j+1. (13)
Utilizing the theorem of equation 11 to eliminate the shift matrix S allows
equation 12 to be rewritten in the following form:
E.sub.j+1 =E.sub | | |