WikiPatents - Community Patent Review
Create Free Account  |  License or Sell Your Patent  |  WikiPatents Marketplace  |  WikiPatents Blog
Username:  Password:  
    
Advanced Search
Digital speech processor using arbitrary excitation coding    
United States Patent4827517   
Link to this pagehttp://www.wikipatents.com/4827517.html
Inventor(s)Atal; Bishnu S. (New Providence, NJ); Trancoso; Isabel M. M. (Oeiras, PT)
AbstractAn arrangement for processing a speech message which uses arbitrary value codes to form time frame excitation signals. The arbitrary value codes, e.g., random numbers, are stored as well as signals indexing the codes and transform domain signals corresponding to the arbitrary codes are generated. The speech message is partitioned into time frame interval speech patterns and a first signal representative of the transform domain speech pattern of each successive time frame interval is formed responsive to the partitioned speech message. A plurality of second signals representative of time frame interval patterns corresponding to the transform code signals are generated responsive to said set of transform signals. One of the arbitrary code signals is selected jointly responsive to the first and second signals of each successive time interval to represent the time frame speech signal excitation, and the index signal corresponding to said selected arbitrary code signal is output. A replica of the speech message is formed from the arbitrary codes by concatenating a sequence of said arbitrary codes identified by the output index signals.
   














 Title Information Submit all comments and votes
 
Patent Text Patent PDF Print Page Summary File History
Plain text PDF images Print Summary File History
Drawing from US Patent 4827517
Digital speech processor using arbitrary excitation coding - US Patent 4827517 Drawing
Digital speech processor using arbitrary excitation coding
Inventor     Atal; Bishnu S. (New Providence, NJ); Trancoso; Isabel M. M. (Oeiras, PT)
Owner/Assignee     American Telephone and Telegraph Company, AT&T Bell Laboratories (Murray Hill, NJ)
Patent assignment
All assignments
Publication Date     May 2, 1989
Application Number     06/810,920
PAIR File History     Application Data   Transaction History
Image File Wrapper   Patent Term   Fees
Litigation
Filing Date     December 26, 1985
US Classification     704/218 704/219 704/221
Int'l Classification     G10L 001/00
Examiner     Shoop Jr.; William M.
Assistant Examiner     Ip; Paul
Attorney/Law Firm     Cubert; Jack S. Wisner; Wilford L. ,
Address
Parent Case    
Priority Data    
USPTO Field of Search     381/31 381/32 381/34 381/35 381/36 381/40 381/41 381/42 381/43 381/46 381/49 381/51 364/513.5 364/715 364/723
Patent Tags     digital speech processor arbitrary excitation coding
   
Enter a comma (,) or semicolon (;) between multiple tag words/phrases.
Describe this patent:
 Amusing   
 Clever   
 Complex   
 Efficient   
 Historic   
 Important   
 Innovative   
 Interesting   
 Practical   
 Simple   
[no votes]
Patent WIKI

Share information and news about this patent, including information and news about the technology, inventors, company, ligation and licensing.

 References Submit all comments and votes
 
*references marked with an asterisk below are user-added references
 U.S. References
 
Add a new US reference:  
ReferenceRelevancyCommentsReferenceRelevancyComments
3588460



[0 after 0 votes]
4701954
Atal
704/216
Oct,1987

[0 after 0 votes]
4472832
Atal
704/221
Sep,1984

[0 after 0 votes]
4354057
Atal
704/219
Oct,1982

[0 after 0 votes]
4184049
Crochiere
704/229
Jan,1980

[0 after 0 votes]
4133976
Atal
704/226
Jan,1979

[0 after 0 votes]
4092493
Rabiner
704/237
May,1978

[0 after 0 votes]
4022974
Kohut
704/262
May,1977

[0 after 0 votes]
3982070
Flanagan
704/265
Sep,1976

[0 after 0 votes]
3740476
Atal
704/207
Jun,1973

[0 after 0 votes]
3624302
Atal
324/76.48
Nov,1971

[0 after 0 votes]
 Foreign References
 Other References
 Market Review Submit all comments and votes
   
Market Size
Estimate the gross annual revenues of the relevant market sector:
> $10B
$5B - $10B
$2B - $5B
$500M - $2B
$100M - $500M
$10M - $100M
$1M - $10M
$500K - $1M
$100K - $500K
< $100K
[No votes]
$0
 
$0   $2.5B   $5B   $7.5B   $10B
Market Share
Estimate the percentage of the relevant market sector this invention will capture:
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Reasonable Royalty
What percentage of gross sales should the inventor or assignee be paid?
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Public's "Guesstimation" of Royalty Value
Market SizeN/A[No votes]
xMarket ShareN/A[No votes]
xReasonable RoyaltyN/A[No votes]

N/A

License Availablity
If you are NOT the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
License Availablity
If you ARE the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
Competitive Advantage
Does this invention have a significant competitive advantage over similar technologies?
Yes

No



[No votes]
Most helpful competitive advantage comment
[No comments]

Commercial Alternatives
Are there viable commercial alternatives for this invention?
Yes

No



[No votes]
Most helpful commercial alternative comment
[No comments]

 Technical Review Submit all comments and votes
 Claims Submit all comments and votes
 


What is claimed is:

1. Apparatus for encoding speech comprising

means (330) for storing a set of signals each representative of a random code and a set of index signals each identifying one of the random codes;

means (203 through 247 except 225 and 245) for partitioning the speech into successive time frame interval portions and for forming a time-domain signal representative of the portion of speech in each successive time frame interval;

means (225, 245, 250) for generating at least one transform domain signal from each such time-domain signal;

means (305) responsive to each random code signal for generating a transform domain code signal corresponding thereto, via the same type of transformation as in the aforesaid means for generating a transform domain signal;

means (315 and 320, or 501 through 520 and 320) for cross-correlating transform domain signals for each time frame interval with each of said transform domain code signals to select one of the transform domain code signals as yielding minimum error or maximum similarity as a representative of the speech portion in the time-frame interval; and

means (325) for outputting the index signal corresponding to the random code signal corresponding to the selected transform domain code signal.

2. Apparatus for encoding speech of the type claimed in claim 1 in which the means for forming a time domain signal comprises means for forming said signal as representative of the predictive parameters of the portion of speech in each successive time frame interval;

the means for generating at least one trnsform domain signal comprises means for generating a transform domain signal representative of the predictive parameters from said time domain signal representative of the predictive parameters; and

the means for generating at least one transform domain signal further comprises means (225, 245) for generating a transform domain signal representative of predictive characteristics for said portion of speech;

the means for cross-correlating includes means responsive to the predictive characteristics representative signal for forming a signal (.gamma.) representative of the relative scaling of the transform domain code signal with respect to a transform domain signal representative of the predictive parameters for each time frame interval; and

the outputting means comprises means for outputting the relative scaling signal and the signal representative of the predictive parameters.

3. Apparatus for encoding speech of the type claimed in claim 2, in which

the means for forming a time domain signal as representative of the portion of speech in each successive time frame interval comprises

means (209, 213, 215) for generating a set of signals representative of the predictive parameters of the speech in each successive time frame interval;

means (207, 211) for forming a signal representative of the predictive residual for the speech in each successive time frame interval; and

means (217, 227, 222, 235, 240, 247) responsive to the predictive residual generating means and to the predictive parameter signal generating means for removing the contribution attributable to speech from the previous time frame.

4. Apparatus for encoding speech of the type claimed in claim 3, in which the means for partitioning and forming a time domain signal, further includes

means (220, 230), responsive to the predictive residual generating means, for producing pitch predictive parameters including contributions of previous frames; and

the combining means of the outputting means is responsive to said means for producing pitch predictive parameters.

5. Apparatus for encoding speech of the type claimed in either of claims 2 or 3 in which the cross-correlating means comprises

means (501) for cross-correlating all three of said predictive-parameter-representative transform domain signal, said transform domain signal representative of the relative scaling for the portion of speech, and said transform domain code signal;

means (505, 510, 515, 520) responsive to the output of the means for cross-correlating specifically and to one or more of the three signals for producing the relative scaling signal (.gamma.) and for producing a cross-correlatin error signal (E.sub.(k)).

6. Apparatus for encoding speech comprising

means (330) for storing a set of signals each representative of a random code and set of index signals each identifying one of the random codes;

means (203 through 247 except 225 and 245) for partitioning the speech into successive time frame interval portions and for forming a time-domain signal representative of the portion of speech in each successive time frame interval;

means (225, 245, 250) for generating at least one transform domain signal from each such time-domain signal;

means (305) responsive to each random code signal for generating a transform domain code signal corresponding thereto, via the same type of transformation as in the aforesaid means for generating a transform domain signal;

means (315 and 320 or 501 through 520 and 320) for responding in a comparative fashion to transform domain signals for each time frame interval and, for each such signal, to each of said transform domain code signals to select one of the transform domain code signals as yielding minimum error or maximum similarity as a representative of the speech portion in the time frame interval; and

means (325) for outputting the index signal corresponding to the random code signal corresponding to the selected transform domain code signal.

7. A method for encoding speech comprising the steps of

storing a set of signals each representative of a random code and a set of index signals each identifying one of the random codes;

partitioning the speech into successive time frame interval portions;

forming a time-domain signal representative of the portion of speech in each successive time frame interval;

generating at least one transform domain signal from each such time-domain signal;

generating a transform domain code signal responsive to each random code signal, via the same type of transformation as in the aforesaid steps of generating a transform domain signal;

cross-correlating transform domain signals for each time frame interval with each of said transform domain code signals to select one of the transform domain code signals as yielding minimum error or maximum similarity as a representative of the speech portion in the time-frame interval; and

outputting the index signal corresponding to the random code signal corresponding to the selected transform domain code signal.

8. A method for encoding speech of the type claimed in claim 7 in which the step of forming a time domain signal comprises the step of forming said signal as representative of the predictive parameters of the portion of speech in each successive time frame interval;

the step of generating at least one transform domain signal comprises generating a transform domain signal representative of the predictive parameter from said time domain signal representative of the predictive parameters; and

the step of generating at least one transform domain signal further comprises step of generating a transform domain signal representative of predictive characteristics for said portion of speech;

the step of cross-correlating includes the step of forming a signal (.gamma.) representative of the relative scaling of the transform domain code signal with respect to a transform domain signal representative of the predictive parameters for each time frame interval in response to the representative signal representative of the energy predictive characteristics; and

the outputting means comprises means for outputting the relative scaling signal and the signal representative of the predictive parameters.

9. A method for encoding speech of the type claimed in claim 8, in which

the step of forming a time domain signal as representative of the pattern of the portion of speech in each successive time frame interval comprises

generating a set of signals representative of the predictive parameters of the speech in each successive time frame interval;

forming a signal representative of the predictive residual for the speech in each successive time frame interval; and

removing the contribution attributable to speech from the previous time frame in response to the predictive residual generating means and to the predictive parameter signal generating means.

10. A method for encoding speech of the type claimed in claim 9, in which the partitioning step and the step of forming a time domain signal includes

producing pitch predictive parameters including contributions of previous frames in response to the predictive residual representative signal; and

the combining step also combines said pitch predictive parameters.

11. A method for encoding speech of the type claimed in either of claims 8 or 9 in which the cross-correlating step comprises

specifically cross-correlating all three of said predictive-parameter-representative transform domain signal, said transform domain signal representative of the relative scaling for the portion of speech, and said transform domain code signal;

applying the output of the specifically cross-correlating step and one or more of the three signals

to produce the relative scaling signal (.gamma.) and

a cross-correlation error signal (E.sub.(k)).

12. A method for encoding speech comprising

storing a set of signals each representative of a random code and a set of index signals each identifying one of the random codes;

partitioning the speech into successive time frame interval portions;

forming a time-domain signal representative of the portion of speech in each successive time frame interval;

generating at least one transform domain signal from each such time-domain signal;

generating a transform domain code signal responsive to each random code signal via the same type of transformation as in the aforesaid step of generating a transform domain signal;

responding in a comparative fashion to transform domain signals for each time frame interval and, for each such signal, to each of said transform domain code signals to select one of the transform domain code signals as yielding minimum error or maximum similarity as a representative of the speech portion in the time frame interval; and

outputting the index signal corresponding to the random code signal corresponding to the selected transform.

13. Apparatus for producing a speech message comprising

means for receiving a sequence of speech message signals for the successive time intervals of the speech message, each time interval speech message signal including a set of transform-domain-coded signals representative of the time interval portion of the speech message, at least a portion of which are index signals corresponding to a known set of random codes

means for storing said known set of random codes in one-for-one association with the corresponding index signals

means for generating said random codes for each of the set of index signals,

and means for controlling speech wave generation for said time interval in response to said generated random codes.

14. Apparatus of the type claimed in claim 13

in which the storing means comprises means for storing the random codes sequentially so that a first portion of each succeeding one is derived from the latter portion of the preceding one.

15. A method for producing a speech message comprising

receiving a sequence of speech message signals for the successive time intervals of the speech message, each time interval speech message signal including a set of transform-domain-coded signals representative of the time interval portion of the speech message, at least a portion of which are index signals corresponding to a known set of random codes;

storing said known set of random codes in one-for-one association with the corresponding index signals;

generating said codes sequentially for each of the set of index signals;

and controlling speech wave generation for said time interval in response to said sequentially generated random codes.
 Description Submit all comments and votes
 


Our invention relates to speech processing and more particularly to digital speech coding arrangements.

Digital speech communication systems including voice storage and voice response facilities utilize signal compression to reduce the bit rate needed for storage and/or transmission. As is well known in the art, a speech pattern contains redundancies that are not essential to its apparent quality. Removal of redundant components of the speech pattern significantly lowers the number of digital codes required to construct a replica of the speech. The subjective quality of the speech replica, however, is dependent on the compression and coding techniques.

One well known digital speech coding system such as disclosed in U.S. Pat. No. 3,624,302 issued Nov. 30, 1971 includes linear prediction analysis of an input speech signal. The speech signal is partitioned into successive intervals of 5 to 20 milliseconds duration and a set of parameters representative of the interval speech is generated. The parameter set includes linear prediction coefficient signals representative of the spectral envelope of the speech in the interval, and pitch and voicing signals corresponding to the speech excitation. These parameter signals may be encoded at a much lower bit rate than the speech signal waveform itself. A replica of the input speech signal is formed from the parameter signal codes by synthesis. The synthesizer arrangement generally comprises a model of the vocal tract in which the excitation pulses of each successive interval are modified by the interval spectral envelope representative prediction coefficients in an all pole predictive filter.

The foregoing pitch excited linear predictive coding is very efficient and reduces the coded bit rate, e.g., from 64 kb/s to 2.4 kb/s. The produced speech replica, however, exhibits a synthetic quality that makes speech difficult to understand. In general, the low speech quality results from the lack of correspondence between the speech pattern and the linear prediction model used. Errors in the pitch code or errors in determining whether a speech intervals is voiced or unvoiced cause the speech replica to sound disturbed or unnatural. Similar problems are also evident in formant coding of speech. Alternative coding arrangements in which the speech excitation is obtained from the residual after prediction, e.g., APC, provide a marked improvement because the excitation is not dependent upon an inexact model. The excitation bit rate of these systems, however, is at least an order of magnitude higher than the linear predictive model. Attempts to lower the excitation bit rate in the residual type systems have generally resulted in a substantial loss in quality.

The article "Stochastic Coding of Speech Signals at Very Low Bit Rates" by Bishnu S. Atal and Manfred Schroeder appearing in the Proceedings of the International Conference on Communications-ICC'84, May 1984, pp. 1610-1613, discloses a stochastic model for generating speech excitation signals in which a speech waveform is represented as a zero mean Gaussian stochastic process with slowly-varying power spectrum. The optimum Gaussian innovation sequence is obtained by comparing a speech waveform segment, typically 5 ms. in duration, to synthetic speech waveforms derived from a plurality of random Gaussian innovation sequences. The innovation sequence that minimizes a perceptual error criterion is selected to represent the segment speech waveform. While the stochastic model described in this article results in low bit rate coding of the speech waveform excitation signal, a large number of innovation sequences are needed to provide an adequate selection. The signal processing required to select the best innovation sequence involves exhaustive search procedures to encode the innovation signals, but such search arrangements for code bit rates corresponding to 4.8 Kbit/sec code generation are very time consuming even when processed onlarge, high speed scientific computers. It is an object of the invention to provide improved speech coding and synthesis of high quality at lower bit rates utilizing arbitrary codes.

SUMMARY OF THE INVENTION

The foregoing object is realized by replacing the exhaustive search of innovation sequence stochastic or other arbitrary codes of a speech analyzer with an arrangement that converts the stochastic codes into transform domain code signals and generates a set of transform domain patterns from the transform codes for each time frame interval. The transform domain code patterns are compared to the transfer of the time interval speech pattern obtained from the input speech to select the best matching stochastic code and an index signal corresponding to the best matching stochastic code is output to represent the time frame interval speech. Transform domain processing reduces the complexity and the time required for code selection.

The index signal is applied to a decoder in which it is used to select a stochastic code stored therein. In a predictive speech synthesizer, the stochatic codes may represent the time frame speech pattern excitation signal whereby the code bit rate is reduced to that required for the index signals and the prediction parameters of the time frame. The stochastic codes may be predetermined overlapping segments of a string of stochastic numbers to reduce storage requirements.

The invention is directed to an arrangement for processing a speech message in which a set of arbitrary value code signals such as random numbers together with index signals indentifying the arbitrary value code signals and signals representative of transforms of the arbitrary valued codes are formed. The speech message is partitioned into time frame interval speech patterns and a first signal representative of the speech pattern of each successive time frame interval is formed responsive to the partitioned speech. A plurality of second signals representative of time frame interval patterns formed from the transform domain code signals are generated. One of said artitrary code signals is selected for each time frame interval jointly responsive to the first signal and the second signals of the time frame interval and the index signal corresponding to said selected transform signal is output.

According to one aspect of the invention, forming of the first signal includes generating a third signal that is a transform domain signal corresponding to the current time frame interval speech pattern and the generation of each second signal includes producing a fourth signal that is a transform domain signal corresponding to a time frame interval pattern responsive to said transform domain code signals. Arbitrary code selection comprises generating a signal representative of the similariti es between said third and fourth signals and determining the index signal corresponding to the fourth signal having the maximum similarities signal.

According to another aspect of the invention, the transform domain code signals are frequency domain transform codes derived from the arbitrary codes.

According to yet another aspect of the invention, the transform domain code signals are Fourier transforms of the arbitrary codes.

According to yet another aspect of the invention, a speech message is formed from the arbitrary codes by receiving a sequence of said outputted index signals, each identifying a predetermined arbitrary code. Each index signal corresponds to a time frame interval speech pattern. The arbitrary codes are concatenated responsive to the sequence of said received index signals and the speech message is formed responsive to the concatenated codes.

According to yet another aspect of the invention, a speech message is formed using a string of arbitrary value coded signals having predetermined segments thereof identified by index signals. A sequence of signals identifying predetermined segments of said string are received. Each of said signals of the sequence corresponds to speech patterns of successive time frame intervals. The predetermined segments of said arbitrary valued code string are selected responsive to the sequence of received identifying signals and the selected arbitrary codes are concatenated to generate a replica of the speech message.

According to yet another aspect of the invention, the arbitrary value signal sequences of the string are overlapping sequences.

BRIEF DESCRIPTION OF THE DRAWING

FIG. 1 depicts a speech encoder utilizing a prior art stochastic coding arrangement;

FIGS. 2 and 3 depict a general block diagram of a digital speech encoder usin arbitrary codes and transform domain processing that is illustrative of the invention;

FIG. 4 depicts a detailed block diagram of digital speech encoding signal processing arrangement that performs the functions of the circuit shown in FIGS. 2 and 3;

FIG. 5 shows a block diagram of an error and scale factor generating circuit useful in the arrangement of FIG. 3;

FIGS. 6-11 show flow chart diagrams that illustrate the operation of the circuit of FIG. 4; and

FIG. 12 shows a block diagram of a speech decoder circuit illustrative of the invention in which a string of random number codes form an overlapping sequence of stochastic codes.

GENERAL DESCRIPTION

FIG. 1 shows a prior art digital speech coder arranged to use stochastic codes for excitaion signals. Referring to FIG. 1, a speech pattern applied to microphone 101 is converted therein to a speech signal which is band pass filtered and sampled in filter and sampler 105 as is well known in the art. The resulting samples are converted into digital codes by analog-to-digital converter 110 to produce digitally coded speech signal s(n). Signal s(n) is processed in LPC and pitch predictive analyzer 115. The processing includes dividing the coded samples into successive speech frame intervals and producing a set of parameter signals corresponding to the signal s(n) in each successive frame. Parameter signals a(1), a(2), . . . , a(p) represent the short delay correlation or spectral related features of the interval speech pattern, and parameter signals .beta.(1), .beta.(2), .beta.(3), and m represent long delay correlation or pitch related features of the speech pattern. In this type of coder, the speech signal is partitioned in frames or blocks, e.g., 5 msec or 40 samples in duration. For such blocks, stochastic code store 120 may contain 1024 random white Gaussian codeword sequences, each sequence comprising a series of 40 random numbers. Each codeword is scaled in scaler 125, prior to filtering, by a factor .gamma. that is constant for the 5 msec block. The speech adaptation is done in recursive filters 135 and 145.

Filter 135 uses a predictor with large memory (2 to 15 msec) to introduce voice periodicity and filter 145 uses a predictor with short memory (less than 2 msec) to introduce the spectral envelope in the synthetic speech signal. Such filters are described in the article "Predictive coding of speech at low bit rates" by B. S. Atal appearing in the IEEE Transactions on Communicatons, Vol. COM-30, pp. 600-614, April 1982. The error representing the difference between the original speech signal s(n) applied to differencer 150 and synthetic speech signal s(n) applied from filter 145 is further processed by linear filter 155 to attenuate those frequency components where the error is perceptually less important and amplify those frequency components where the error is perceptually more important. The stochastic code sequence from store 120 which produces the minimum mean-squared subjective error signal E(k) and the corresponding optimum scale factor .gamma. are selected by peak picker 170 only after processing of all 1024 code word sequences in store 120.

For purposes of analyzing the codeword processing of the circuit of FIG. 1, synthesis filters 135 and 145 and perceptual weighting filter 155 can be combined into one linear filter. The impulse response of this equivalent filter may be represented by the sequence f(n). Only a part of the equivalent filter output is determined by its input in the current 5 msec frame since, as is well known in the art, a portion of the filter output corresponds to signals carried over from preceding frames. The filter memory from the previous frames plays no role in the search for the optimum innovation sequence in the present frame. The contributions of the previous memory to the filter output in the present frame can thus be subtracted from the speech signal in determining the optimum code word from stochastic code stoe 120. The residual after subtracting the contributions of the filter memory carried over from the previous frames may be represented by the signal x(n). The filter output contributed by the kth codeword from store 120 in the present frame is ##EQU1## where c.sup.(k) (i) is the ith sample of the kth codeword. One can rewrite equation 1 in matrix notations as

x(k)=.gamma.(k)Fc(k), (2)

where F is a N.times.N matrix with the term in the nth row and the ith column given by f(n-i). The total squared error E(k), representing the difference between x(n) and x.sup.(k) (n), is given by

E(k)=.vertline..vertline.x-.gamma.(k)Fc(k).vertline..vertline..sup.2, (3)

where the vector x represents the signal x(n) in vector notations, and .vertline..vertline. .vertline..vertline..sup.2 indicates the sum of the squares of the vector components. The optimum scale factor .gamma.(k) that minimizes the error E(k) can easily be determined by setting .differential.E(k)/.differential..gamma.(k)=0 and this leads to ##EQU2## The optimum codeword is obtained by finding the minimum of E(k) or the maximum of the second term on the right side in equation 5.

While the signal processing described with respect to FIG. 1 is relatively straight forward, the generation of the 1024 error signals E(k) of equation 5 is a time consuming operation that cannot be accomplished in real time in currently known high speed, large scale computers. The complexity of the search processing in FIG. 1 is due to the presence of the convolution operation represented by the matrix F in the error. The complexity is substantially reduced if the matrix F is replaced by a diagonal matrix. This is accomplished by representing the matrix F in the orthogonal form using singular-value decomposition as described in "Introduction to Matrix Computations" by G. W. Stewart, Academic Press, pp. 317-320, 1973. Assume that

F=UDV.sup.t, (6)

where U and V are orthogonal matrices, D is a diagonal matrix with positive elements and V.sup.t indicates the transpose of V. Because of the orthogonality of U, equation 3 can be written as

E(k)=.vertline..vertline.U.sup.t (x-.gamma.(k)Fc(k).vertline..vertline..sup.2. (7)

If we now replace F by its orthogonal form as expressed in equation 6, we obtain

E(k)=.vertline..vertline.U.sup.t x-.gamma.(k)DV.sup.t c(k).vertline..vertline..sup.2. (8)

On substituting

z=U.sup.t x

and

b(k)=V.sup.t c(k), (9)

in equation 8, we obtain ##EQU3## As before, the optimum .gamma.(k) that minimizes E(k) can be determined by setting .differential.E(k)/.differential..gamma.(k)=0 and equation 10 simplifies to ##EQU4## The error signal expressed in equation 11 can be processed much faster than the expression in equation 5. If Fc(k) is processed in a recursive filter of order p (typically 20), processing according to equation 11 can substantially reduce the processing time requirements for stochastic coding.

Alternatively, the reduced processing time may also be obtained by extending the operations of equation 5 from the time domain to a transform domain such as the frequency domain. If the combined impulse response of the synthesis filter with the long-delay prediction excluded and the perceptual weighting filter is represented by the sequence h(n), the filter output contributed by the kth codeword in the present frame can be expressed as a convolution between its input .gamma.(k)c.sup.(k) (n) and the impulse response h(n). The filter output is given by

x.sup.(k) (n)=.gamma.(k)h(n)*c.sup.(k) (n) (12)

The filter output can be expressed in the frequency domain as

X.sup.(k) (i)=.gamma.(k)H(i)C.sup.(k) (i), (13)

where X.sup.(k) (i), H(i) and C.sup.(k) (i) are discrete Fourier transforms (DFTs) of x.sup.(k) (n),h(n) and c.sup.(k) (n), respectively. In practice, the duration of the filter output can be considered to be limited to a 10 msec time interval and zero outside. Thus a DFT with 80 points is sufficiently accurate for expressing equation 13. The total squared error E(k) is expressed in frequency-domain notations as ##EQU5## where X(i) is the DFT of x(n). If we express now

H(i)=d(i)e.sup.j.phi..sbsp.i, (15)

and

.xi..sub.i =X(i)e.sup.-j.phi..sbsp.i, (16)

equation 14 is then transformed to ##EQU6## Again, the scale factor .gamma.(k) can be eliminated from equation 17 and the total error can be expressed as ##EQU7## where .xi.(i)* is complex conjugate .xi.(i). The frequency-domain search has the advantage that the singular-value decomposition of the matrix F is replaced by discrete fast Fourier transforms whereby the overall processing complexity is significantly reduced. In the transform domain using either the singular value decomposition or the discrete Fourier transform processing, further savings in the computational load can be achieved by restricting the search to a subset of frequencies (or eigenvectors) corresponding to large values of d(i) (or b(i)). According to the invention, the processing is substantially reduced whereby real time operation with microprocessor integrated circuits is realizable. This is accomplished by replacing the time domain processing involved in the generation of the error between the synthetic speech signal formed responsive to the innovation code and the input speech signal of FIG. 1 with transform domain processing as described hereinbefore.

DETAILED DESCRIPTION

A transform domain digital speech encoder using arbitrary codes for excitation for excitation signals illustrative of the invention is shown in FIGS. 2 and 3. The arbitrary codes may take the form of random number sequences or may, for example, be varied sequences of +1 and -1 in any order. Any arrangement of varied sequences may be used with the broad restriction that the overall average of the sequences is small. Referring to FIG. 2, a speech pattern such as a spoken message received by microphone transducer 201 is bandlimited and converted into a sequence of pulse samples in filter and sampler circuit 203 and supplied to linear prediction coefficient (LPC) analyzer 209 via analog-to-digital converter 205. The filtering may be arranged to remove frequency components of the speech signal above 4.0 KHz, and the sampling may be at an 8.0 KHz rate as is well known in the art. Each sample from circuit 203 is transformed into an amplitude representative digital code in the analog-to-digital converter. The sequence of digitally coded speech samples is supplied to LPC analyzer 209 which is operative, as is well known in the art, to partition the speech signals into 5 to 20 ms time frame intervals and to generate a set of linear prediction coefficient signals a(k), k=1, 2, . . . , p representative of the predicted short time spectrum of the speech samples of each frame. The analyzer also forms a set of perceptually weighted linear predictive coefficient signals

b(k)=ka(k),

k=1, 2, . . . , p, (19)

where p is the number of the prediction coefficients.

The speech samples from A/D converter 205 are delayed in delay 207 to allow time for the formation of speech parameter signals a(k) and the delayed samples are supplied to the input of prediction residual generator 211. The prediction residual generator, as is well known in the art, is responsive to the delayed speech samples s(n) and the prediction parameters a(k) to form a signal .differential.(n) corresponding to the differences between speech samples and their predicted values. The formation of the predictive parameters and the prediction residual signal for each frame in predictive analyzer 209 may be performed according to the arrangement disclosed in U.S. Pat. No. 3,740,476 issued to B. S. Atal, June 19, 1973, and assigned to the same assignee, or in other arrangements well known in the art.

Prediction residual signal generator 211 is operative to subtract the predictable portion of the frame signal from the sample signals s(n) to form signal .differential.(n) in accordance with ##EQU8## where p, the number of the predictive coefficients, may be 12, N the number of samples in a speech frame, may be 40, and a(k) are the predictive coefficients of the frame. Predictive residual signal .differential.(n) corresponds to the speech signal of the frame with the short term redundancies removed. Longer term redundancy of the order of several speech frames in the predictive residual signal remains and predictive parameters .beta.(1), .beta.(2), .beta.(3) and m corresponding to such longer term redundancy are generated in predictive pitch analyzer 220 such that m is an integer that maximizes ##EQU9## as described in U.S. Pat. No. 4,354,057 issued to B. S. Atal et al on Jan. 9, 1979. As is well known, digital speech encoders may be formed by encoding the predictive parameters of each successive frame, and the frame predictive residual for transmission to decoder appratus or for storage for later retrieval. While the bit rate for encoding the predictive parameters is relatively low, the non-redundant nature of the residual requires a very high bit rate. According to the invention, an optimum arbitrary code ##EQU10## is selected to represent the frame excitation, and a signal K* that indexes the selected arbitrary excitation code is transmitted. In this way, the speech code bit rate is minimized without adversely affecting intelligibility. The arbitrary code is selected in the transform domain to reduce the selection processing so that it can be performed in real time with microprocessor components.

Selection of the arbitrary code for excitation includes combining the predictive residual with the perceptually weighted linear predictive parameters of the frame to generate a signal y(n). Speech pattern signal y(n) corresponding to the perceptually weighted speech signal contains a component y(n) due to the preceding frames. This preceding frame component y(n) is removed prior to the selection processing so that the stored arbitrary codes are in effect compared to only the current frame excitation. Signal y(n) is formed in predictive filter 217 responsive to the perceptually weighted predictive parameter and the predictive residual signals of the frame as per the relation ##EQU11## and are stoed in y(n) store 227.

The preceding frame speech contribution signal y(n) is generated in preceding frame contribution signal generator 222 from the perceptually weighted predictive parameter signal b(k) of the current frame, t