|
Claims  |
|
|
What is claimed is:
1. Apparatus for encoding speech comprising
means (330) for storing a set of signals each representative of a random
code and a set of index signals each identifying one of the random codes;
means (203 through 247 except 225 and 245) for partitioning the speech into
successive time frame interval portions and for forming a time-domain
signal representative of the portion of speech in each successive time
frame interval;
means (225, 245, 250) for generating at least one transform domain signal
from each such time-domain signal;
means (305) responsive to each random code signal for generating a
transform domain code signal corresponding thereto, via the same type of
transformation as in the aforesaid means for generating a transform domain
signal;
means (315 and 320, or 501 through 520 and 320) for cross-correlating
transform domain signals for each time frame interval with each of said
transform domain code signals to select one of the transform domain code
signals as yielding minimum error or maximum similarity as a
representative of the speech portion in the time-frame interval; and
means (325) for outputting the index signal corresponding to the random
code signal corresponding to the selected transform domain code signal.
2. Apparatus for encoding speech of the type claimed in claim 1 in which
the means for forming a time domain signal comprises means for forming
said signal as representative of the predictive parameters of the portion
of speech in each successive time frame interval;
the means for generating at least one trnsform domain signal comprises
means for generating a transform domain signal representative of the
predictive parameters from said time domain signal representative of the
predictive parameters; and
the means for generating at least one transform domain signal further
comprises means (225, 245) for generating a transform domain signal
representative of predictive characteristics for said portion of speech;
the means for cross-correlating includes means responsive to the predictive
characteristics representative signal for forming a signal (.gamma.)
representative of the relative scaling of the transform domain code signal
with respect to a transform domain signal representative of the predictive
parameters for each time frame interval; and
the outputting means comprises means for outputting the relative scaling
signal and the signal representative of the predictive parameters.
3. Apparatus for encoding speech of the type claimed in claim 2, in which
the means for forming a time domain signal as representative of the portion
of speech in each successive time frame interval comprises
means (209, 213, 215) for generating a set of signals representative of the
predictive parameters of the speech in each successive time frame
interval;
means (207, 211) for forming a signal representative of the predictive
residual for the speech in each successive time frame interval; and
means (217, 227, 222, 235, 240, 247) responsive to the predictive residual
generating means and to the predictive parameter signal generating means
for removing the contribution attributable to speech from the previous
time frame.
4. Apparatus for encoding speech of the type claimed in claim 3, in which
the means for partitioning and forming a time domain signal, further
includes
means (220, 230), responsive to the predictive residual generating means,
for producing pitch predictive parameters including contributions of
previous frames; and
the combining means of the outputting means is responsive to said means for
producing pitch predictive parameters.
5. Apparatus for encoding speech of the type claimed in either of claims 2
or 3 in which the cross-correlating means comprises
means (501) for cross-correlating all three of said
predictive-parameter-representative transform domain signal, said
transform domain signal representative of the relative scaling for the
portion of speech, and said transform domain code signal;
means (505, 510, 515, 520) responsive to the output of the means for
cross-correlating specifically and to one or more of the three signals for
producing the relative scaling signal (.gamma.) and for producing a
cross-correlatin error signal (E.sub.(k)).
6. Apparatus for encoding speech comprising
means (330) for storing a set of signals each representative of a random
code and set of index signals each identifying one of the random codes;
means (203 through 247 except 225 and 245) for partitioning the speech into
successive time frame interval portions and for forming a time-domain
signal representative of the portion of speech in each successive time
frame interval;
means (225, 245, 250) for generating at least one transform domain signal
from each such time-domain signal;
means (305) responsive to each random code signal for generating a
transform domain code signal corresponding thereto, via the same type of
transformation as in the aforesaid means for generating a transform domain
signal;
means (315 and 320 or 501 through 520 and 320) for responding in a
comparative fashion to transform domain signals for each time frame
interval and, for each such signal, to each of said transform domain code
signals to select one of the transform domain code signals as yielding
minimum error or maximum similarity as a representative of the speech
portion in the time frame interval; and
means (325) for outputting the index signal corresponding to the random
code signal corresponding to the selected transform domain code signal.
7. A method for encoding speech comprising the steps of
storing a set of signals each representative of a random code and a set of
index signals each identifying one of the random codes;
partitioning the speech into successive time frame interval portions;
forming a time-domain signal representative of the portion of speech in
each successive time frame interval;
generating at least one transform domain signal from each such time-domain
signal;
generating a transform domain code signal responsive to each random code
signal, via the same type of transformation as in the aforesaid steps of
generating a transform domain signal;
cross-correlating transform domain signals for each time frame interval
with each of said transform domain code signals to select one of the
transform domain code signals as yielding minimum error or maximum
similarity as a representative of the speech portion in the time-frame
interval; and
outputting the index signal corresponding to the random code signal
corresponding to the selected transform domain code signal.
8. A method for encoding speech of the type claimed in claim 7 in which the
step of forming a time domain signal comprises the step of forming said
signal as representative of the predictive parameters of the portion of
speech in each successive time frame interval;
the step of generating at least one transform domain signal comprises
generating a transform domain signal representative of the predictive
parameter from said time domain signal representative of the predictive
parameters; and
the step of generating at least one transform domain signal further
comprises step of generating a transform domain signal representative of
predictive characteristics for said portion of speech;
the step of cross-correlating includes the step of forming a signal
(.gamma.) representative of the relative scaling of the transform domain
code signal with respect to a transform domain signal representative of
the predictive parameters for each time frame interval in response to the
representative signal representative of the energy predictive
characteristics; and
the outputting means comprises means for outputting the relative scaling
signal and the signal representative of the predictive parameters.
9. A method for encoding speech of the type claimed in claim 8, in which
the step of forming a time domain signal as representative of the pattern
of the portion of speech in each successive time frame interval comprises
generating a set of signals representative of the predictive parameters of
the speech in each successive time frame interval;
forming a signal representative of the predictive residual for the speech
in each successive time frame interval; and
removing the contribution attributable to speech from the previous time
frame in response to the predictive residual generating means and to the
predictive parameter signal generating means.
10. A method for encoding speech of the type claimed in claim 9, in which
the partitioning step and the step of forming a time domain signal
includes
producing pitch predictive parameters including contributions of previous
frames in response to the predictive residual representative signal; and
the combining step also combines said pitch predictive parameters.
11. A method for encoding speech of the type claimed in either of claims 8
or 9 in which the cross-correlating step comprises
specifically cross-correlating all three of said
predictive-parameter-representative transform domain signal, said
transform domain signal representative of the relative scaling for the
portion of speech, and said transform domain code signal;
applying the output of the specifically cross-correlating step and one or
more of the three signals
to produce the relative scaling signal (.gamma.) and
a cross-correlation error signal (E.sub.(k)).
12. A method for encoding speech comprising
storing a set of signals each representative of a random code and a set of
index signals each identifying one of the random codes;
partitioning the speech into successive time frame interval portions;
forming a time-domain signal representative of the portion of speech in
each successive time frame interval;
generating at least one transform domain signal from each such time-domain
signal;
generating a transform domain code signal responsive to each random code
signal via the same type of transformation as in the aforesaid step of
generating a transform domain signal;
responding in a comparative fashion to transform domain signals for each
time frame interval and, for each such signal, to each of said transform
domain code signals to select one of the transform domain code signals as
yielding minimum error or maximum similarity as a representative of the
speech portion in the time frame interval; and
outputting the index signal corresponding to the random code signal
corresponding to the selected transform.
13. Apparatus for producing a speech message comprising
means for receiving a sequence of speech message signals for the successive
time intervals of the speech message, each time interval speech message
signal including a set of transform-domain-coded signals representative of
the time interval portion of the speech message, at least a portion of
which are index signals corresponding to a known set of random codes
means for storing said known set of random codes in one-for-one association
with the corresponding index signals
means for generating said random codes for each of the set of index
signals,
and means for controlling speech wave generation for said time interval in
response to said generated random codes.
14. Apparatus of the type claimed in claim 13
in which the storing means comprises means for storing the random codes
sequentially so that a first portion of each succeeding one is derived
from the latter portion of the preceding one.
15. A method for producing a speech message comprising
receiving a sequence of speech message signals for the successive time
intervals of the speech message, each time interval speech message signal
including a set of transform-domain-coded signals representative of the
time interval portion of the speech message, at least a portion of which
are index signals corresponding to a known set of random codes;
storing said known set of random codes in one-for-one association with the
corresponding index signals;
generating said codes sequentially for each of the set of index signals;
and controlling speech wave generation for said time interval in response
to said sequentially generated random codes. |
|
|
|
|
Claims  |
|
|
Description  |
|
|
Our invention relates to speech processing and more particularly to digital
speech coding arrangements.
Digital speech communication systems including voice storage and voice
response facilities utilize signal compression to reduce the bit rate
needed for storage and/or transmission. As is well known in the art, a
speech pattern contains redundancies that are not essential to its
apparent quality. Removal of redundant components of the speech pattern
significantly lowers the number of digital codes required to construct a
replica of the speech. The subjective quality of the speech replica,
however, is dependent on the compression and coding techniques.
One well known digital speech coding system such as disclosed in U.S. Pat.
No. 3,624,302 issued Nov. 30, 1971 includes linear prediction analysis of
an input speech signal. The speech signal is partitioned into successive
intervals of 5 to 20 milliseconds duration and a set of parameters
representative of the interval speech is generated. The parameter set
includes linear prediction coefficient signals representative of the
spectral envelope of the speech in the interval, and pitch and voicing
signals corresponding to the speech excitation. These parameter signals
may be encoded at a much lower bit rate than the speech signal waveform
itself. A replica of the input speech signal is formed from the parameter
signal codes by synthesis. The synthesizer arrangement generally comprises
a model of the vocal tract in which the excitation pulses of each
successive interval are modified by the interval spectral envelope
representative prediction coefficients in an all pole predictive filter.
The foregoing pitch excited linear predictive coding is very efficient and
reduces the coded bit rate, e.g., from 64 kb/s to 2.4 kb/s. The produced
speech replica, however, exhibits a synthetic quality that makes speech
difficult to understand. In general, the low speech quality results from
the lack of correspondence between the speech pattern and the linear
prediction model used. Errors in the pitch code or errors in determining
whether a speech intervals is voiced or unvoiced cause the speech replica
to sound disturbed or unnatural. Similar problems are also evident in
formant coding of speech. Alternative coding arrangements in which the
speech excitation is obtained from the residual after prediction, e.g.,
APC, provide a marked improvement because the excitation is not dependent
upon an inexact model. The excitation bit rate of these systems, however,
is at least an order of magnitude higher than the linear predictive model.
Attempts to lower the excitation bit rate in the residual type systems
have generally resulted in a substantial loss in quality.
The article "Stochastic Coding of Speech Signals at Very Low Bit Rates" by
Bishnu S. Atal and Manfred Schroeder appearing in the Proceedings of the
International Conference on Communications-ICC'84, May 1984, pp.
1610-1613, discloses a stochastic model for generating speech excitation
signals in which a speech waveform is represented as a zero mean Gaussian
stochastic process with slowly-varying power spectrum. The optimum
Gaussian innovation sequence is obtained by comparing a speech waveform
segment, typically 5 ms. in duration, to synthetic speech waveforms
derived from a plurality of random Gaussian innovation sequences. The
innovation sequence that minimizes a perceptual error criterion is
selected to represent the segment speech waveform. While the stochastic
model described in this article results in low bit rate coding of the
speech waveform excitation signal, a large number of innovation sequences
are needed to provide an adequate selection. The signal processing
required to select the best innovation sequence involves exhaustive search
procedures to encode the innovation signals, but such search arrangements
for code bit rates corresponding to 4.8 Kbit/sec code generation are very
time consuming even when processed onlarge, high speed scientific
computers. It is an object of the invention to provide improved speech
coding and synthesis of high quality at lower bit rates utilizing
arbitrary codes.
SUMMARY OF THE INVENTION
The foregoing object is realized by replacing the exhaustive search of
innovation sequence stochastic or other arbitrary codes of a speech
analyzer with an arrangement that converts the stochastic codes into
transform domain code signals and generates a set of transform domain
patterns from the transform codes for each time frame interval. The
transform domain code patterns are compared to the transfer of the time
interval speech pattern obtained from the input speech to select the best
matching stochastic code and an index signal corresponding to the best
matching stochastic code is output to represent the time frame interval
speech. Transform domain processing reduces the complexity and the time
required for code selection.
The index signal is applied to a decoder in which it is used to select a
stochastic code stored therein. In a predictive speech synthesizer, the
stochatic codes may represent the time frame speech pattern excitation
signal whereby the code bit rate is reduced to that required for the index
signals and the prediction parameters of the time frame. The stochastic
codes may be predetermined overlapping segments of a string of stochastic
numbers to reduce storage requirements.
The invention is directed to an arrangement for processing a speech message
in which a set of arbitrary value code signals such as random numbers
together with index signals indentifying the arbitrary value code signals
and signals representative of transforms of the arbitrary valued codes are
formed. The speech message is partitioned into time frame interval speech
patterns and a first signal representative of the speech pattern of each
successive time frame interval is formed responsive to the partitioned
speech. A plurality of second signals representative of time frame
interval patterns formed from the transform domain code signals are
generated. One of said artitrary code signals is selected for each time
frame interval jointly responsive to the first signal and the second
signals of the time frame interval and the index signal corresponding to
said selected transform signal is output.
According to one aspect of the invention, forming of the first signal
includes generating a third signal that is a transform domain signal
corresponding to the current time frame interval speech pattern and the
generation of each second signal includes producing a fourth signal that
is a transform domain signal corresponding to a time frame interval
pattern responsive to said transform domain code signals. Arbitrary code
selection comprises generating a signal representative of the similariti
es between said third and fourth signals and determining the index signal
corresponding to the fourth signal having the maximum similarities signal.
According to another aspect of the invention, the transform domain code
signals are frequency domain transform codes derived from the arbitrary
codes.
According to yet another aspect of the invention, the transform domain code
signals are Fourier transforms of the arbitrary codes.
According to yet another aspect of the invention, a speech message is
formed from the arbitrary codes by receiving a sequence of said outputted
index signals, each identifying a predetermined arbitrary code. Each index
signal corresponds to a time frame interval speech pattern. The arbitrary
codes are concatenated responsive to the sequence of said received index
signals and the speech message is formed responsive to the concatenated
codes.
According to yet another aspect of the invention, a speech message is
formed using a string of arbitrary value coded signals having
predetermined segments thereof identified by index signals. A sequence of
signals identifying predetermined segments of said string are received.
Each of said signals of the sequence corresponds to speech patterns of
successive time frame intervals. The predetermined segments of said
arbitrary valued code string are selected responsive to the sequence of
received identifying signals and the selected arbitrary codes are
concatenated to generate a replica of the speech message.
According to yet another aspect of the invention, the arbitrary value
signal sequences of the string are overlapping sequences.
BRIEF DESCRIPTION OF THE DRAWING
FIG. 1 depicts a speech encoder utilizing a prior art stochastic coding
arrangement;
FIGS. 2 and 3 depict a general block diagram of a digital speech encoder
usin arbitrary codes and transform domain processing that is illustrative
of the invention;
FIG. 4 depicts a detailed block diagram of digital speech encoding signal
processing arrangement that performs the functions of the circuit shown in
FIGS. 2 and 3;
FIG. 5 shows a block diagram of an error and scale factor generating
circuit useful in the arrangement of FIG. 3;
FIGS. 6-11 show flow chart diagrams that illustrate the operation of the
circuit of FIG. 4; and
FIG. 12 shows a block diagram of a speech decoder circuit illustrative of
the invention in which a string of random number codes form an overlapping
sequence of stochastic codes.
GENERAL DESCRIPTION
FIG. 1 shows a prior art digital speech coder arranged to use stochastic
codes for excitaion signals. Referring to FIG. 1, a speech pattern applied
to microphone 101 is converted therein to a speech signal which is band
pass filtered and sampled in filter and sampler 105 as is well known in
the art. The resulting samples are converted into digital codes by
analog-to-digital converter 110 to produce digitally coded speech signal
s(n). Signal s(n) is processed in LPC and pitch predictive analyzer 115.
The processing includes dividing the coded samples into successive speech
frame intervals and producing a set of parameter signals corresponding to
the signal s(n) in each successive frame. Parameter signals a(1), a(2), .
. . , a(p) represent the short delay correlation or spectral related
features of the interval speech pattern, and parameter signals .beta.(1),
.beta.(2), .beta.(3), and m represent long delay correlation or pitch
related features of the speech pattern. In this type of coder, the speech
signal is partitioned in frames or blocks, e.g., 5 msec or 40 samples in
duration. For such blocks, stochastic code store 120 may contain 1024
random white Gaussian codeword sequences, each sequence comprising a
series of 40 random numbers. Each codeword is scaled in scaler 125, prior
to filtering, by a factor .gamma. that is constant for the 5 msec block.
The speech adaptation is done in recursive filters 135 and 145.
Filter 135 uses a predictor with large memory (2 to 15 msec) to introduce
voice periodicity and filter 145 uses a predictor with short memory (less
than 2 msec) to introduce the spectral envelope in the synthetic speech
signal. Such filters are described in the article "Predictive coding of
speech at low bit rates" by B. S. Atal appearing in the IEEE Transactions
on Communicatons, Vol. COM-30, pp. 600-614, April 1982. The error
representing the difference between the original speech signal s(n)
applied to differencer 150 and synthetic speech signal s(n) applied from
filter 145 is further processed by linear filter 155 to attenuate those
frequency components where the error is perceptually less important and
amplify those frequency components where the error is perceptually more
important. The stochastic code sequence from store 120 which produces the
minimum mean-squared subjective error signal E(k) and the corresponding
optimum scale factor .gamma. are selected by peak picker 170 only after
processing of all 1024 code word sequences in store 120.
For purposes of analyzing the codeword processing of the circuit of FIG. 1,
synthesis filters 135 and 145 and perceptual weighting filter 155 can be
combined into one linear filter. The impulse response of this equivalent
filter may be represented by the sequence f(n). Only a part of the
equivalent filter output is determined by its input in the current 5 msec
frame since, as is well known in the art, a portion of the filter output
corresponds to signals carried over from preceding frames. The filter
memory from the previous frames plays no role in the search for the
optimum innovation sequence in the present frame. The contributions of the
previous memory to the filter output in the present frame can thus be
subtracted from the speech signal in determining the optimum code word
from stochastic code stoe 120. The residual after subtracting the
contributions of the filter memory carried over from the previous frames
may be represented by the signal x(n). The filter output contributed by
the kth codeword from store 120 in the present frame is
##EQU1##
where c.sup.(k) (i) is the ith sample of the kth codeword. One can rewrite
equation 1 in matrix notations as
x(k)=.gamma.(k)Fc(k), (2)
where F is a N.times.N matrix with the term in the nth row and the ith
column given by f(n-i). The total squared error E(k), representing the
difference between x(n) and x.sup.(k) (n), is given by
E(k)=.vertline..vertline.x-.gamma.(k)Fc(k).vertline..vertline..sup.2, (3)
where the vector x represents the signal x(n) in vector notations, and
.vertline..vertline. .vertline..vertline..sup.2 indicates the sum of the
squares of the vector components. The optimum scale factor .gamma.(k) that
minimizes the error E(k) can easily be determined by setting
.differential.E(k)/.differential..gamma.(k)=0 and this leads to
##EQU2##
The optimum codeword is obtained by finding the minimum of E(k) or the
maximum of the second term on the right side in equation 5.
While the signal processing described with respect to FIG. 1 is relatively
straight forward, the generation of the 1024 error signals E(k) of
equation 5 is a time consuming operation that cannot be accomplished in
real time in currently known high speed, large scale computers. The
complexity of the search processing in FIG. 1 is due to the presence of
the convolution operation represented by the matrix F in the error. The
complexity is substantially reduced if the matrix F is replaced by a
diagonal matrix. This is accomplished by representing the matrix F in the
orthogonal form using singular-value decomposition as described in
"Introduction to Matrix Computations" by G. W. Stewart, Academic Press,
pp. 317-320, 1973. Assume that
F=UDV.sup.t, (6)
where U and V are orthogonal matrices, D is a diagonal matrix with positive
elements and V.sup.t indicates the transpose of V. Because of the
orthogonality of U, equation 3 can be written as
E(k)=.vertline..vertline.U.sup.t
(x-.gamma.(k)Fc(k).vertline..vertline..sup.2. (7)
If we now replace F by its orthogonal form as expressed in equation 6, we
obtain
E(k)=.vertline..vertline.U.sup.t x-.gamma.(k)DV.sup.t
c(k).vertline..vertline..sup.2. (8)
On substituting
z=U.sup.t x
and
b(k)=V.sup.t c(k), (9)
in equation 8, we obtain
##EQU3##
As before, the optimum .gamma.(k) that minimizes E(k) can be determined by
setting .differential.E(k)/.differential..gamma.(k)=0 and equation 10
simplifies to
##EQU4##
The error signal expressed in equation 11 can be processed much faster
than the expression in equation 5. If Fc(k) is processed in a recursive
filter of order p (typically 20), processing according to equation 11 can
substantially reduce the processing time requirements for stochastic
coding.
Alternatively, the reduced processing time may also be obtained by
extending the operations of equation 5 from the time domain to a transform
domain such as the frequency domain. If the combined impulse response of
the synthesis filter with the long-delay prediction excluded and the
perceptual weighting filter is represented by the sequence h(n), the
filter output contributed by the kth codeword in the present frame can be
expressed as a convolution between its input .gamma.(k)c.sup.(k) (n) and
the impulse response h(n). The filter output is given by
x.sup.(k) (n)=.gamma.(k)h(n)*c.sup.(k) (n) (12)
The filter output can be expressed in the frequency domain as
X.sup.(k) (i)=.gamma.(k)H(i)C.sup.(k) (i), (13)
where X.sup.(k) (i), H(i) and C.sup.(k) (i) are discrete Fourier transforms
(DFTs) of x.sup.(k) (n),h(n) and c.sup.(k) (n), respectively. In practice,
the duration of the filter output can be considered to be limited to a 10
msec time interval and zero outside. Thus a DFT with 80 points is
sufficiently accurate for expressing equation 13. The total squared error
E(k) is expressed in frequency-domain notations as
##EQU5##
where X(i) is the DFT of x(n). If we express now
H(i)=d(i)e.sup.j.phi..sbsp.i, (15)
and
.xi..sub.i =X(i)e.sup.-j.phi..sbsp.i, (16)
equation 14 is then transformed to
##EQU6##
Again, the scale factor .gamma.(k) can be eliminated from equation 17 and
the total error can be expressed as
##EQU7##
where .xi.(i)* is complex conjugate .xi.(i). The frequency-domain search
has the advantage that the singular-value decomposition of the matrix F is
replaced by discrete fast Fourier transforms whereby the overall
processing complexity is significantly reduced. In the transform domain
using either the singular value decomposition or the discrete Fourier
transform processing, further savings in the computational load can be
achieved by restricting the search to a subset of frequencies (or
eigenvectors) corresponding to large values of d(i) (or b(i)). According
to the invention, the processing is substantially reduced whereby real
time operation with microprocessor integrated circuits is realizable. This
is accomplished by replacing the time domain processing involved in the
generation of the error between the synthetic speech signal formed
responsive to the innovation code and the input speech signal of FIG. 1
with transform domain processing as described hereinbefore.
DETAILED DESCRIPTION
A transform domain digital speech encoder using arbitrary codes for
excitation for excitation signals illustrative of the invention is shown
in FIGS. 2 and 3. The arbitrary codes may take the form of random number
sequences or may, for example, be varied sequences of +1 and -1 in any
order. Any arrangement of varied sequences may be used with the broad
restriction that the overall average of the sequences is small. Referring
to FIG. 2, a speech pattern such as a spoken message received by
microphone transducer 201 is bandlimited and converted into a sequence of
pulse samples in filter and sampler circuit 203 and supplied to linear
prediction coefficient (LPC) analyzer 209 via analog-to-digital converter
205. The filtering may be arranged to remove frequency components of the
speech signal above 4.0 KHz, and the sampling may be at an 8.0 KHz rate as
is well known in the art. Each sample from circuit 203 is transformed into
an amplitude representative digital code in the analog-to-digital
converter. The sequence of digitally coded speech samples is supplied to
LPC analyzer 209 which is operative, as is well known in the art, to
partition the speech signals into 5 to 20 ms time frame intervals and to
generate a set of linear prediction coefficient signals a(k), k=1, 2, . .
. , p representative of the predicted short time spectrum of the speech
samples of each frame. The analyzer also forms a set of perceptually
weighted linear predictive coefficient signals
b(k)=ka(k),
k=1, 2, . . . , p, (19)
where p is the number of the prediction coefficients.
The speech samples from A/D converter 205 are delayed in delay 207 to allow
time for the formation of speech parameter signals a(k) and the delayed
samples are supplied to the input of prediction residual generator 211.
The prediction residual generator, as is well known in the art, is
responsive to the delayed speech samples s(n) and the prediction
parameters a(k) to form a signal .differential.(n) corresponding to the
differences between speech samples and their predicted values. The
formation of the predictive parameters and the prediction residual signal
for each frame in predictive analyzer 209 may be performed according to
the arrangement disclosed in U.S. Pat. No. 3,740,476 issued to B. S. Atal,
June 19, 1973, and assigned to the same assignee, or in other arrangements
well known in the art.
Prediction residual signal generator 211 is operative to subtract the
predictable portion of the frame signal from the sample signals s(n) to
form signal .differential.(n) in accordance with
##EQU8##
where p, the number of the predictive coefficients, may be 12, N the
number of samples in a speech frame, may be 40, and a(k) are the
predictive coefficients of the frame. Predictive residual signal
.differential.(n) corresponds to the speech signal of the frame with the
short term redundancies removed. Longer term redundancy of the order of
several speech frames in the predictive residual signal remains and
predictive parameters .beta.(1), .beta.(2), .beta.(3) and m corresponding
to such longer term redundancy are generated in predictive pitch analyzer
220 such that m is an integer that maximizes
##EQU9##
as described in U.S. Pat. No. 4,354,057 issued to B. S. Atal et al on Jan.
9, 1979. As is well known, digital speech encoders may be formed by
encoding the predictive parameters of each successive frame, and the frame
predictive residual for transmission to decoder appratus or for storage
for later retrieval. While the bit rate for encoding the predictive
parameters is relatively low, the non-redundant nature of the residual
requires a very high bit rate. According to the invention, an optimum
arbitrary code
##EQU10##
is selected to represent the frame excitation, and a signal K* that
indexes the selected arbitrary excitation code is transmitted. In this
way, the speech code bit rate is minimized without adversely affecting
intelligibility. The arbitrary code is selected in the transform domain to
reduce the selection processing so that it can be performed in real time
with microprocessor components.
Selection of the arbitrary code for excitation includes combining the
predictive residual with the perceptually weighted linear predictive
parameters of the frame to generate a signal y(n). Speech pattern signal
y(n) corresponding to the perceptually weighted speech signal contains a
component y(n) due to the preceding frames. This preceding frame component
y(n) is removed prior to the selection processing so that the stored
arbitrary codes are in effect compared to only the current frame
excitation. Signal y(n) is formed in predictive filter 217 responsive to
the perceptually weighted predictive parameter and the predictive residual
signals of the frame as per the relation
##EQU11##
and are stoed in y(n) store 227.
The preceding frame speech contribution signal y(n) is generated in
preceding frame contribution signal generator 222 from the perceptually
weighted predictive parameter signal b(k) of the current frame, t | | |