WikiPatents - Community Patent Review
Create Free Account  |  License or Sell Your Patent  |  WikiPatents Marketplace  |  WikiPatents Blog
Username:  Password:  
    
Advanced Search
Mode-specific method and apparatus for encoding signals containing speech    
United States Patent5596676   
Link to this pagehttp://www.wikipatents.com/5596676.html
Inventor(s)Swaminathan; Kumar (Gaithersburg, MD); Ganesan; Kalyan (Germantown, MD); Gupta; Prabhat K. (Germantown, MD)
AbstractA method for encoding a signal that includes a speech component is described. First and second linear prediction windows of a frame are analyzed to generate sets of filter coefficients. First and second pitch analysis windows of the frame are analyzed to generate pitch estimates. The frame is classified in one of at least two modes, e.g. voiced, unvoiced and noise modes, based, for example, on pitch stationarity, short-term level gradient or zero crossing rate. Then the frame is encoded using the filter coefficients and pitch estimates in a particular manner depending upon the mode determination for the frame, preferably employing CELP based encoding algorithms.
   














 Title Information Submit all comments and votes
 
Patent Text Patent PDF Print Page Summary File History
Plain text PDF images Print Summary File History
Drawing from US Patent 5596676
Mode-specific method and apparatus for encoding signals containing speech - US Patent 5596676 Drawing
Mode-specific method and apparatus for encoding signals containing speech
Inventor     Swaminathan; Kumar (Gaithersburg, MD); Ganesan; Kalyan (Germantown, MD); Gupta; Prabhat K. (Germantown, MD)
Owner/Assignee     Hughes Electronics (Los Angeles, CA)
Patent assignment
All assignments
Publication Date     January 21, 1997
Application Number     08/540,637
PAIR File History     Application Data   Transaction History
Image File Wrapper   Patent Term   Fees
Litigation
Filing Date     October 11, 1995
US Classification     704/208 704/210 704/219 704/262 704/268
Int'l Classification     G10L 009/12 G10L 009/14
Examiner     MacDonald; Allen R.
Assistant Examiner     Grover; John Michael
Attorney/Law Firm     Lindeen, III; Gordon R. Low; Wanda K. , Denson-
Address
Parent Case     BACKGROUND OF THE INVENTION This is a division of application Ser. No. 08/229,271 filed Apr. 18, 1994, which is a continuation-in-part of prior application Ser. No. 08/227,881 filed Apr. 15, 1994, of Kumar Swaminathan, Kalyan Ganesan, and Prabhat K. Gupta for METHOD OF ENCODING A SIGNAL CONTAINING SPEECH, which is a continuation-in-part of prior application Ser. No. 07/905,992, filed Jun. 25, 1992, of Kumar Swaminathan for HIGH QUALITY LOW BIT RATE CELP-BASED SPEECH CODEC, issued as U.S. Pat. No. 5,495,555, which is a continuation-in-part application under 37 C.F.R. .sctn.1.162 of prior application Ser. No. 07/891,596, filed Jun. 1, 1992, of Kumar Swaminathan for CELP EXCITATION ANALYSIS FOR VOICED SPEECH (abandoned). The contents of patent application Ser. No. 07/905,992 entitled "HIGH QUALITY LOW BIT RATE CELP-BASED SPEECH CODEC" are hereby incorporated by reference.
Priority Data    
USPTO Field of Search     395/2.17 395/2.19 395/2.28 395/2.32 395/2.71 395/2.77
Patent Tags     mode-specific encoding signals containing speech
   
Enter a comma (,) or semicolon (;) between multiple tag words/phrases.
Describe this patent:
 Amusing   
 Clever   
 Complex   
 Efficient   
 Historic   
 Important   
 Innovative   
 Interesting   
 Practical   
 Simple   
[no votes]
Patent WIKI

Share information and news about this patent, including information and news about the technology, inventors, company, ligation and licensing.

 References Submit all comments and votes
 
*references marked with an asterisk below are user-added references
 U.S. References
 
Add a new US reference:  
ReferenceRelevancyCommentsReferenceRelevancyComments
5495555
Swaminathan
704/207
Feb,1996

[0 after 0 votes]
5459814
Gupta
704/233
Oct,1995

[0 after 0 votes]
4771465
Bronson
704/207
Sep,1988

[0 after 0 votes]
4058676
Wilkes
704/220
Nov,1977

[0 after 0 votes]
 Foreign References
 Other References
 Market Review Submit all comments and votes
   
Market Size
Estimate the gross annual revenues of the relevant market sector:
> $10B
$5B - $10B
$2B - $5B
$500M - $2B
$100M - $500M
$10M - $100M
$1M - $10M
$500K - $1M
$100K - $500K
< $100K
[No votes]
$0
 
$0   $2.5B   $5B   $7.5B   $10B
Market Share
Estimate the percentage of the relevant market sector this invention will capture:
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Reasonable Royalty
What percentage of gross sales should the inventor or assignee be paid?
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Public's "Guesstimation" of Royalty Value
Market SizeN/A[No votes]
xMarket ShareN/A[No votes]
xReasonable RoyaltyN/A[No votes]

N/A

License Availablity
If you are NOT the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
License Availablity
If you ARE the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
Competitive Advantage
Does this invention have a significant competitive advantage over similar technologies?
Yes

No



[No votes]
Most helpful competitive advantage comment
[No comments]

Commercial Alternatives
Are there viable commercial alternatives for this invention?
Yes

No



[No votes]
Most helpful commercial alternative comment
[No comments]

 Technical Review Submit all comments and votes
 Claims Submit all comments and votes
 


What is claimed is:

1. A method of encoding a signal having a speech component, the signal being organized as a plurality of frames, the method comprising the steps, performed for each frame, of:

analyzing a first linear prediction window to generate a first set of filter coefficients for a frame;

analyzing a second linear prediction window to generate a second set of filter coefficients for the frame;

analyzing a first pitch analysis window to generate a first pitch estimate for the frame;

analyzing a second pitch analysis window to generate a second pitch estimate for the frame;

determining whether the frame is one of a first mode, a second mode and a third mode, depending on measures of energy content of the frame and spectral content of the frame;

encoding the frame, depending on the second set of filter coefficients and the first and the second pitch estimates, independently of the first set of filter coefficients, when the frame is determined to be the third mode;

encoding the frame, depending on the first and the second sets of filter coefficients, independently of the first and the second pitch estimates, when the frame is determined to be the second mode; and

encoding the frame, depending on the second set of filter coefficients, independently of the first set of filter coefficients and the first and the second pitch estimates, when the frame is determined to be the first mode.

2. The method of claim 1, wherein the determining step includes the substep of:

determining a mode depending on a determined mode of a previous frame.

3. The method of claim 1 wherein the determining step includes the substep of:

determining the mode to be the first mode only when the determined mode of a previous frame is either the first mode or the second mode.

4. The method of claim 1, wherein the determining step includes the substep of:

determining the mode to be the third mode only when the determined mode of a previous frame is either the third mode or the second mode.

5. The method of claim 1 wherein the determining step further depends on measures of pitch stationarity between the frame and a previous frame.

6. The method of claim 1 wherein the determining step further depends on measures of short-term level gradient within the frame.

7. The method of claim 1 wherein the determining step further depends on measures of a zero-crossing rate within the frame.

8. The encoding method of claim 1, wherein the first linear prediction window is contained within the frame and the second linear prediction window begins during the frame and extends into the next frame.

9. The encoding method of claim 1, wherein the first pitch estimate window is contained within the frame and the second pitch estimate window begins during the frame and extends into the next frame.

10. The encoding method of claim 1, wherein a frame determined to be of a third mode contains a signal with a speech component composed of primarily voiced speech.

11. The encoding method of claim 1, wherein a frame determined to be of a second mode contains a signal with a speech component composed of primarily unvoiced speech.

12. The encoding method of claim 1, wherein a frame determined to be of a first mode contains a signal with a low speech component.

13. An encoder for encoding a signal having a speech component, the signal being organized as a plurality of frames, comprising:

a filter coefficient generator for analyzing a first linear prediction window to generate a first set of filter coefficients for a frame and for analyzing a second linear prediction window to generate a second set of filter coefficients for the frame;

a pitch estimator for analyzing a first pitch analysis window to generate a first pitch estimate for the frame and analyzing a second pitch analysis window to generate a second pitch estimate for the frame;

a mode determinator for determining whether the frame is one of a first mode, a second mode and a third mode, depending on measures of energy content of the frame and spectral content of the frame; and

a frame encoder for encoding the frame depending on the determined mode of the frame, wherein

a frame determined to be of a third mode is encoded depending on the second set of filter coefficients and the first and the second pitch estimates, independently of the first set of filter coefficients,

a frame determined to be of a second mode is encoded depending on the first and the second sets of filter coefficients, independently of the first and the second pitch estimates, and

a frame determined to be of a first mode is encoded depending on the second set of filter coefficients, independently of the first set of filter coefficients and the first and the second pitch estimates.

14. The encoder of claim 13, wherein the mode determinator determines the mode depending on a determined mode of a previous frame.

15. The encoder of claim 13, wherein the mode determinator determines the frame to be of the first mode only when the determined mode of a previous frame is either the first mode or the second mode.

16. The encoder of claim 13, wherein the mode determininator determines the frame to be of the third mode only when the determined mode of a previous frame is either the third mode or the second mode.

17. The encoder of claim 13 wherein the mode determininator further depends on measures of pitch stationarity between the frame and a previous frame.

18. The encoder of claim 13 wherein the mode determinator further depends on measures of short-term level gradient within the frame.

19. The encoder of claim 13 wherein the mode determinator further depends on measures of a zero-crossing rate within the frame.

20. The encoder of claim 13, wherein the first linear prediction window is contained within the frame and the second linear prediction window begins during the frame and extends into the next frame.

21. The encoder of claim 13, wherein the first pitch estimate window is contained within the frame and the second pitch estimate window begins during the frame and extends into the next frame.

22. The encoder of claim 13, wherein a frame determined to be of a third mode contains a signal with a speech component composed of primarily voiced speech.

23. The encoder of claim 13, wherein a frame determined to be of a second mode contains a signal with a speech component composed of primarily unvoiced speech.

24. The encoder of claim 13, wherein a frame determined to be of a first mode contains a signal with a low speech component.
 Description Submit all comments and votes
 


FIELD OF THE INVENTION

The present invention generally relates to a method of encoding a signal containing speech and more particularly to a method employing a linear predictor to encode a signal.

DESCRIPTION OF THE RELATED ART

A modern communication technique employs a Codebook Excited Linear Prediction (CELP) coder. The codebook is essentially a table containing excitation vectors for processing by a linear predictive falter. The technique involves partitioning an input signal into multiple portions and, for each portion, searching the codebook for the vector that produces a filter output signal that is closest to the input signal.

The typical CELP technique may distort portions of the input signal dominated by noise because the codebook and the linear predictive filter that may be optimum for speech may be inappropriate for noise.

OBJECT AND SUMMARY OF THE INVENTION

It is an object of the present invention to provide a method of encoding a signal containing both speech and noise while avoiding some of the distortions introduced by typical CELP encoding techniques.

Additional objectives and advantages of the invention will be set forth in the description that follows and in part will be obvious from the description, or may be learned by practice of the invention. The objects and advantages of the invention may be realized and attained by means of the instrumentalities and combinations particularly pointed out in the appended claims.

To achieve the objects and in accordance with the purpose of the invention, as embodied and broadly described herein, a method of processing a signal having a speech component, the signal being organized as a plurality of frames, is used. The method comprises the steps, performed for each frame, of determining whether the frame corresponds to a first mode, depending on whether the speech component is substantially absent from the frame; generating an encoded frame in accordance with one of a first coding scheme, when the frame corresponds to the first mode, and a second coding scheme when the frame does not correspond to the first mode; and decoding the encoded frame in accordance with one of the first coding scheme, when the frame corresponds to the first mode, and the second coding scheme when the frame does not correspond to the first mode.

BRIEF DESCRIPTION OF THE DRAWINGS

The foregoing and other objects, aspects and advantages will be better understood from the following detailed description of a preferred embodiment of the invention with reference to the drawings, in which:

FIG. 1 is a block diagram of a transmitter in a wireless communication system according to a preferred embodiment of the invention;

FIG. 2 is a block diagram of a receiver in a wireless communication system according to the preferred embodiment of the invention;

FIG. 3 is block diagram of the encoder in the transmitter shown in FIG. 1;

FIG. 4 is a block diagram of the decoder in the receiver shown in FIG. 2;

FIG. 5A is a timing diagram showing the alignment of linear prediction analysis windows in the encoder shown in FIG. 3;

FIG. 5B is a timing diagram showing the alignment of pitch prediction analysis windows for open loop pitch prediction in the encoder shown in FIG. 3;

FIGS. 6A and 6B show a flowchart illustrating the 26-bit line spectral frequency vector quantization process performed by the encoder of FIG. 3;

FIG. 7 is a flowchart illustrating the operation of a pitch tracking algorithm;

FIG. 8 is a block diagram showing in more detail the open loop pitch estimation of the encoder shown in FIG. 3;

FIG. 9 is a flowchart illustrating the operation of the modified pitch tracking algorithm implemented by the open loop pitch estimation shown in FIG. 8;

FIG. 10 is a flowchart showing the processing performed by the mode determination module shown in FIG. 3;

FIG. 11 is a dataflow diagram showing a part of the processing of a step of determining spectral stationarity values shown in FIG. 10;

FIG. 12 is a dataflow diagram showing another part of the processing of the step of determining spectral stationarity values;

FIG. 13 is a dataflow diagram showing another part of the processing of the step of determining spectral stationarity values;

FIG. 14 is a dataflow diagram showing the processing of the step of determining pitch stationarity values shown in FIG. 10;

FIG. 15 is a dataflow diagram showing the processing of the step of generating zero crossing rate values shown in FIG. 10;

FIGS. 16A, 16B and 16C illustrate a dataflow diagram showing the processing of the step of determining level gradient values in FIG. 10;

FIG. 17 is a dataflow diagram showing the processing of the step of determining short-term energy values shown in FIG. 10;

FIGS. 18A, 18B and 18C are a flowchart of determining the mode based on the generated values as shown in FIG. 10;

FIG. 19 is a block diagram showing in more detail the implementation of the excitation modeling circuitry of the encoder shown in FIG. 3;

FIG. 20 is a diagram illustrating a processing of the encoder show in FIG. 3;

FIGS. 22A and 22B show a chart of speech coder parameters for mode A;

FIG. 23 is a chart of speech coder parameters for mode A;

FIG. 24 is a chart of speech coder parameters for mode A;

FIG. 25 is a block diagram illustrating a processing of the speech decoder shown in FIG. 4; and

FIG. 21 is a timing diagram showing an alternative alignment of linear prediction analysis windows.

DETAILED DESCRIPTION OF A PREFERRED EMBODIMENT OF THE INVENTION

FIG. 1 shows the transmitter of the preferred communication system. Analog-to-digital (A/D) converter 11 samples analog speech from a telephone handset at an 8 KHz rate, converts to digital values and supplies the digital values to the speech encoder 12. Channel encoder 13 further encodes the signal, as may be required in a digital cellular communications system, and supplies a resulting encoded bit stream to a modulator 14. Digital-to-analog (D/A) converter 15 converts the output of the modulator 14 to Phase Shift Keying (PSK) signals. Radio frequency (RF) up converter 16 amplifies and frequency multiplies the PSK signals and supplies the amplified signals to antenna 17.

A low-pass, antialiasing, filter (not shown) filters the analog speech signal input to A/D converter 11. A high-pass, second order biquad, filter (not shown) filters the digitized samples from A/D converter 11. The transfer function is: ##EQU1##

The high pass filter attenuates D.C. or hum contamination may occur in the incoming speech signal.

FIG. 2 shows the receiver of the preferred communication system. RF down converter 22 receives a signal from antenna 21 and heterodynes the signal to an intermediate frequency (IF). A/D converter 23 converts the IF signal to a digital bit stream, and demodulator 24 demodulates the resulting bit stream. At this point the reverse of the encoding process in the transmitter takes place. Channel decoder 25 and speech decoder 26 perform decoding. D/A converter 27 synthesizes analog speech from the output of the speech decoder.

Much of the processing described in this specification is performed by a general purpose signal processor executing program statements. To facilitate a description of the preferred communication system, however, the preferred communication system is illustrated in terms of block and circuit diagrams. One of ordinary skill in the art could readily transcribe these diagrams into program statements for a processor.

FIG. 3 shows the encoder 12 of FIG. 1 in more detail, including an audio preprocessor 31, linear predictive (LP) analysis and quantization module 32, and open loop pitch estimation module 33. Module 34 analyzes each frame of the signal to determine whether the frame is mode A, mode B, or mode C, as described in more detail below. Module 35 performs excitation modelling depending on the mode determined by module 34. Processor 36 compacts compressed speech bits.

FIG. 4 shows the decoder 26 of FIG. 2, including a processor 41 for unpacking of compressed speech bits, module 42 for excitation signal reconstruction, filter 43, speech synthesis filter 44, and global post filter 45.

FIG. 5A shows linear prediction analysis windows. The preferred communication system employs 40 ms. speech frames. For each frame, module 32 performs LP (linear prediction) analysis on two 30 ms. windows that are spaced apart by 20 ms. The first LP window is centered at the middle, and the second LP window is centered at the leading edge of the speech frame such that the second LP window extends 15 ms. into the next frame. In other words, module 32 analyzes a first part of the frame (LP window 1) to generate a first set of filter coefficients and analyzes a second part of the frame and a part of a next frame (LP window 2) to generate a second set of filter coefficients.

FIG. 5B shows pitch analysis windows. For each frame, module 32 performs pitch analysis on two 37.625 ms. windows. The first pitch analysis window is centered at the middle, and the second pitch analysis window is centered at the leading edge of the speech frame such that the second pitch analysis window extends 18.8125 ms. into the next frame. In other words, module 32 analyzes a third part of the frame (pitch analysis window 1) to generate a first pitch estimate and analyzes a fourth part of the frame and a part of the next frame (pitch analysis window 2) to generate a second pitch estimate.

Module 32 employs multiplication by a Hamming window followed by a tenth order autocorrelation method of LP analysis. With this method of LP analysis, module 32 obtains optimal filter coefficients and optimal reflection coefficients. In addition, the residual energy after LP analysis is also readily obtained and, when expressed as a fraction of the speech energy of the windowed LP analysis buffer, is denoted as .alpha..sub.1 for the first LP window and .alpha..sub.2 for the second LP window. These outputs of the LP analysis are used subsequently in the mode selection algorithm as measures of spectral stationarity, as described in more detail below.

After LP analysis, module 32 bandwidth broadens the filter coefficients for the first LP window, and for the second LP window, by 25 Hz, converts the coefficients to ten line spectral frequencies (LSF), and quantizes these ten line spectral frequencies with a 26-bit LSF vector quantization (VQ), as described below.

Module 32 employs a 26-bit vector quantization (VQ) for each set of ten LSFs. This VQ provides good and robust performance across a wide range of handsets and speakers. Separate VQ codebooks are designed for "IRS filtered" and "flat unfiltered" ("non-IRS-filtered") speech material. The unquantized LSF vector is quantized by the "IRS filtered" VQ tables as well as the "flat unfiltered" VQ tables. The optimum classification is selected on the basis of the cepstral distortion measure. Within each classification, the vector quantization is carried out. Multiple candidates for each split vector are chosen on the basis of energy weighted mean square error, and an overall optimal selection is made within each classification on the basis of the cepstral distortion measure among all combinations of candidates. After the optimum classification is chosen, the quantized line spectral frequencies are converted to filter coefficients.

More specifically, module 32 quantizes the ten line spectral frequencies for both sets with a 26-bit multi-codebook split vector quantizer that classifies the unquantized line spectral frequency vector as a "voiced IRS-filtered," "unvoiced IRS-filtered," "voiced non-IRS-filtered," and "unvoiced non-IRS-filtered" vector, where "IRS" refers to intermediate reference system filter as specified by CCITT, Blue Book, Rec.P.48.

FIGS. 6A and 6B show an outline of the LSF vector quantization process. Module 32 employs a split vector quantizer for each classification, including a 3-4-3 split vector quantizer for the "voiced IRS-filtered" and the "voiced non-IRS-filtered" categories 51 and 53. The first three LSFs use an 8-bit codebook in function modules 55 and 57, the next four LSFs use a 10-bit codebook in function modules 59 and 61, and the last three LSFs use a 6-bit codebook in function modules 63 and 65. For the "unvoiced IRS-filtered" and the "unvoiced non-IRS-filtered" categories 52 and 54, a 3-3-4 split vector quantizer is used. The first three LSFs use a 7-bit codebook in function modules 56 and 58, the next three LSFs use an 8-bit vector codebook in function modules 60 and 62, and the last four LSFs use a 9-bit codebook in function modules 64 and 66. From each split vector codebook, the three best candidates are selected in function modules 67, 68, 69, and 70 using the energy weighted mean square error criteria. The energy weighting reflects the power level of the spectral envelope at each line spectral frequency. The three best candidates for each of the three split vectors result in a total of twenty-seven combinations for each category. The search is constrained so that at least one combination would result in an ordered set of LSFs. This is usually a very mild constraint imposed on the search. The optimum combination of these twenty-seven combinations is selected in function module 71 depending on the cepstral distortion measure. Finally, the optimal category or classification is determined also on the basis of the cepstral distortion measure. The quantized LSFs are converted to filter coefficients and then to autocorrelation lags for interpolation purposes.

The resulting LSF vector quantizer scheme is not only effective across speakers but also across varying degrees of IRS filtering which models the influence of the handset transducer. The codebooks of the vector quantizers are trained from a sixty talker speech database using flat as well as IRS frequency shaping. This is designed to provide consistent and good performance across several speakers and across various handsets. The average log spectral distortion across the entire TIA half rate database is approximately 1.2 dB for IRS filtered speech data and approximately 1.3 dB for non-IRS filtered speech data.

Two estimates of the pitch are determined per free at intervals of 20 msec. These open loop pitch estimates are used in mode selection and to encode the closed loop pitch analysis if the selected mode is a predominantly voiced mode.

Module 33 determines the two pitch estimates from the two pitch analysis windows described above in connection with FIG. 5B, using a modified form of the pitch tracking algorithm shown in FIG. 7. This pitch estimation algorithm makes an initial pitch estimate in function module 73 using an error function calculated for all values in the set {(22.0, 22.5, . . . , 114.5}, followed by pitch tracking to yield an overall optimum pitch value. Function module 74 employs look-back pitch tracking using the error functions and pitch estimates of the previous two pitch analysis windows. Function module 75 employs look-ahead pitch tracking using the error functions of the two future pitch analysis windows. Decision module 76 compares pitch estimates depending on look-back and look-ahead pitch tracking to yield an overall optimum pitch value at output 77. The pitch estimation algorithm shown in FIG. 7 requires the error functions of two future pitch analysis windows for its look-ahead pitch tracking and thus introduces a delay of 40 ms. In order to avoid this penalty, the preferred communication system employs a modification of the pitch estimation algorithm of FIG. 7.

FIG. 8 shows the open loop pitch estimation 33 of FIG. 3 in more detail. Pitch analysis windows one and two are input to respective compute error functions 331 and 332. The outputs of these error function computations are input to a refinement of past pitch estimates 333, and the refined pitch estimates are sent to both look back and look ahead pitch tracking 334 and 335 for pitch window one. The outputs of the pitch tracking circuits are input to selector 336 which selects the open loop pitch one as the first output. The selected open loop pitch one is also input to a look back pitch tracking circuit for pitch window two which outputs the open loop pitch two.

FIG. 9 shows the modified pitch tracking algorithm implemented by the pitch estimation circuitry of FIG. 8. The modified pitch estimation algorithm employs the same error function as in the FIG. 7 algorithm in each pitch analysis window, but the pitch tracking scheme is altered. Prior to pitch tracking for either the first or second pitch analysis window, the previous two pitch estimates of the two previous pitch analysis windows are refined in function modules 81 and 82, respectively, with both look-back pitch tracking and look-ahead pitch tracking using the error functions of the current two pitch analysis windows. This is followed by look-back pitch tracking in function module 83 for the first pitch analysis window using the refined pitch estimates and error functions of the two previous pitch analysis windows. Look-ahead pitch tracking for the first pitch analysis window in function module 84 is limited to using the error function of the second pitch analysis window. The two estimates are compared in decision module 85 to yield an overall best pitch estimate for the first pitch analysis window. For the second pitch analysis window, look-back pitch tracking is carried out in function module 86 as well as the pitch estimate of the first patch analysis window and its error function. No look-ahead pitch tracking is used for this second pitch analysis window with the result that the look-back pitch estimate is taken to be the overall best pitch estimate at output 87.

FIG. 10 shows the mode determination processing performed by mode selector 34. Depending on spectral stationarity, pitch stationarity, short term energy, short term level gradient, and zero crossing rate of each 40 ms. frame, mode selector 34 classifies each frame into one of three modes: voiced and stationary mode (Mode A), unvoiced or transient mode (Mode B), and background noise mode (Mode C). More specifically, mode selector 34 generates two logical values, each indicating spectral stationarity or similarity of spectral content between the currently processed frame and the previous frame (Step 1010). Mode selector 34 generates two logical values indicating pitch stationarity, similarity of fundamental frequencies, between the currently processed frame and the previous frame (Step 1020). Mode selector 34 generates two logical values indicating the zero crossing rate of the currently processed frame (Step 1030), a rate influenced by the higher frequency components of the frame relative to the lower frequency components of the frame. Mode selector 34 generates two logical values indicating level gradients within the currently processed frame (Step 1030). Mode selector 34 generates five logical values indicating short-term energy of the currently processed frame (Step 1050). Subsequently, mode selector 34 determines the mode of the frame to be mode A, mode B, or mode C, depending on the values generated in Steps 1010-1050 (Step 1060).

FIG. 11 is a block diagram showing a processing of Step 1010 of FIG. 10 in more detail. The processing of FIG. 11 determines a cepstral distortion in dB. Module 1110 converts the quantized filter coefficients of window 2 of the current frame into the lag domain, and module 1120 converts the quantized filter coefficients of window 2 of the previous frame into the lag domain. Module 1130 interpolates the outputs of modules 1110 and 1120, and module 1140 converts the output of module 1130 back into falter coefficience. Module 1150 converts the output from module 1140 into the cepstral domain, and module 1160 converts the unquantized filter coefficients from window 1 of the current frame into the cepstral domain. Module 1170 generates the cepstral distortion d.sub.c from the outputs of 1150 and 1160.

FIG. 12 shows generation of spectral stationarity value LPCFLAG1, which is a relatively strong indicator of spectral stationarity for the frame. Mode selector 34 generates LPCFLAG1 using a combination of two techniques for measuring spectral stationarity. The first technique compares the cepstral distortion d.sub.c using comparators 1210 and 1220. In FIG. 12, the d.sub.t1 threshold input to comparator 1210 is -8.0 and the d.sub.t2 threshold input to comparator 1220 is -6.0.

The second technique is based on the residual energy after LPC analysis, expressed as a fraction of the LPC analysis speech buffer spectral energy. This residual energy is a by-product of LPC analysis, as described above. The .alpha.1 input to comparator 1230 is the residual energy for the falter coefficients of window 1 and the .alpha.2 input to comparator 1240 is the residual energy of the filter coefficients of window 2. The .alpha.t1 input to comparators 1230 and 1240 is a threshold equal to 0.25.

FIG. 13 shows dataflow within mode selector 34 for a generation of spectral stationarity value flag LPCFLAG2, which is a relatively week indicator of spectral stationarity. The processing shown in FIG. 13 is similar to that shown in FIG. 12, except that LPCFLAG2 is based on a relatively relaxed set of thresholds. The d.sub.t2 input to comparator 1310 is -6.0, the d.sub.t3 input to comparator 1320 is -4.0, the d.sub.t4 input to comparator 1350 is -2.0, the .alpha.t1 input to comparators 1330 and 1340 is a threshold 0.25, and the .alpha.t2 to comparators 1360 and 1370 is 0.15.

FIG. 14 illustrates the process by which mode selector 34 measures pitch stationarity using both the open loop pitch values of the current frame, denoted as P.sub.1 for pitch window 1 and P.sub.2 for pitch window 2, and the open loop pitch value of window 2 of the previous frame denoted by P.sub.-1. A lower range of pitch values (P.sub.L1 P.sub.U1) and an upper range of pitch values (P.sub.L2 P.sub.U2) are:

P.sub.L1 =MIN (P.sub.-1, P.sub.2)-P.sub.t

P.sub.U1 =MIN (P.sub.-1, P.sub.2)+P.sub.t

P.sub.L2 =MAX (P.sub.-1, P.sub.2)-P.sub.t

P.sub.U2 =MAX (P.sub.-1, P.sub.2)+P.sub.t,

where P.sub.t is 8.0. If the two ranges are non-overlapping, i.e., P.sub.L2 >P.sub.U1, then only a weak indicator of pitch stationarity, denoted by PITCHFLAG2, is possible end PITCHFLAG2 is set if P.sub.1 lies within either the lower range (P.sub.L1, P.sub.U1) or upper range (P.sub.L2, P.sub.U2). If the two ranges are overlapping, i.e., P.sub.L2 .ltoreq.P.sub.U1, a strong indicator of pitch stationarity, denoted by PITCHFLAG1, is possible and is set if P.sub.1 lies within the range (P.sub.L, P.sub.U), where

P.sub.L =(P.sub.-1 +P.sub.2)/2-2P.sub.t

P.sub.U =(P.sub.-1 +P.sub.2)/2+2P.sub.t

FIG. 14 shows a dataflow for generating PITCHFLAG1 and PITCHFLAG2 within mode selector 34. Module 14005 generates an output equal to the input having the largest value, and module 14010 generates an output equal to the input having the smallest values. Module 1420 generates an output that is an average of the values of the two inputs. Modules 14030, 14035, 14040, 14045, 14050 and 14055 are adders. Modules 14080, 14025 and 14090 are AND gates. Module 14087 is an inverter. Modules 14065, 14070, and 14075 are each logic blocks generating a true output when (C>=B)&(C<=A).

The circuit of FIG. 14 also processes reliability values V.sub.-1, V.sub.1, and V.sub.2, each indicating whether the values P.sub.-1, P.sub.1, and P.sub.2, respectively, are reliable. Typically, these reliability values are a by-product of the pitch calculation algorithm. The circuit shown in FIG. 14 generates false values for PITCHFLAG 1 and PITCHFLAG 2 if any of these flags V.sub.-1, V.sub.1, V.sub.2, are false. Processing of these reliability values is optional.

FIG. 15 shows dataflow within mode selector 34 for generating two logical values indicating a zero crossing rate for the frame. Modules 15002, 15004, 15006, 15008, 15010, 15012, 15014 and 15016 each count the number of zero crossings in a respective 5 millisecond subframe of the frame currently being processed. For example, module 15006 counts the number of zero crossings of the signal occurring from the time 10 millisecond from the beginning of the frame to the time 15 ms from the beginning of the frame. Comparators 15018, 15020, 15022, 14024, 15026, 15028, 15030, and 15032 in combination with adder 15035, generate a value indicating the number of 5 millisecond (MS) subframes having zero crossings of >=15. Comparator 15040 sets the flag ZC.sub.-- LOW when the number of such subframes is less than 2, and the comparator 15037 sets the flag ZC.sub.-- HIGH when the number of such subframes is greater than 5. The value ZC.sub.t input to comparators 15018-15032 is 15, the value Z.sub.t1 input to comparator 15040 is 2, and the value Z.sub.t2 input to comparator 15037 is 5.

FIGS. 16A, 16B, and 16C show a data flow for generating two logical values indicative of short term level gradient. Mode selector 34 measures short term level gradient, an indication of transients within a frame, using a low-pass filtered version of the companded input signal amplitude. Module 16005 generates the absolute value of the input signal S(n), module 16010 compands its input signal, and low-pass filter 16015 generates a signal A.sub.L (n) that, at time instant n, is expressed by:

A.sub.L (n)=(63/64)A.sub.L (n-1)+(1/64)C(.vertline.s(n).vertline.)

where the companding function C(.) is the .mu.-law function described in CCITT G.711. Delay 16025 generates an output that is a 10 ms-delayed version of its input and subtractor 16027 generates a difference between A.sub.L (n) and A.sub.L (N-80). Module 16030 generates a signal that is an absolute value of its input.

Every 5 ms, mode selector 34 compares A.sub.L (n) with that of 10 ms ago and, if the difference .vertline.A.sub.L (n)-A.sub.L (n-80).vertline. exceeds a fixed relaxed threshold, increments a counter. (In the preceding expression, 80 corresponds to 8 samples per MS times 10 MS). As shown in FIG. 16C, if this difference does not exceed a relatively stringent threshold (L.sub.t2 =32) for any subframe, mode selector 43 sets LVLFLAG2, weakly indicating an absence of transients. As shown in FIG. 16B, if this difference exceeds a more relaxed threshold (L.sub.t1 =10) for no more than one subframe (L.sub.t3 =2) mode selector 34 sets LVLFLAG1, strongly indicating an absence of transients.

More specifically, FIG. 16B shows delay circuits 16032-16046 that each generate a 5 ms delayed version of its input. Each of latches 16048-16062 save a signal on its input. Latches 16048-16062 are strobed at a common time, near the end of each 40 ms speech frame, so that each latch saves a portion of the frame separated by 5 ms from the portion saved by an adjacent latch. Comparators 16064-16078 each compare the output of a respective latch to the threshold L.sub.t1 and adder 16080 sums the comparator outputs and sends the sum to comparator 16082 for comparison to the threshold L.sub.t3.

FIG. 16C shows a circuit for generating LVLFLAG2. In FIG. 16C, delays 16132-16146 are similar to the delays shown in FIG. 16B and latches 16148-16162 are similar to the latches shown in FIG. 16B. Comparators 16164-16178 each compare an output of a respective latch to the threshold L.sub.t2 =2. Thus, OR gate 16180 generates a true output if any of the latched signal originating from module 16030 exceeds the threshold L.sub.t2. Inverter 16182 inverts the output of OR gate 16180.

FIG. 17 shows a data flow for generating parameters indicative of short term energy. Short term energy is measured as the mean square energy (average energy per sample) on a frame basis as well as on a 5 ms basis. The short term energy is determined relative to a background energy E.sub.bn. E.sub.bn is initially set to a constant E.sub.0 =(100.times.(12).sup.1/2).sup.2. Subsequently, when a frame is determined to be mode C, E.sub.bn is set equal to (7/8)E.sub.bn +(1/8)E.sub.0. Thus, some of the thresholds employed in the circuit of FIG. 17 are adaptive. In FIG. 17, E.sub.t.phi. =0.707 E.sub.bn, E.sub.t1 =5, E.sub.t2 =2.5 E.sub.bn, E.sub.t3 =1.8 E.sub.bn, E.sub.t4 =E.sub.bn, E.sub.t5 =0.707 E.sub.bn, and E.sub.t6 =16.0.

The short term energy on a 5 ms basis provides an indication of presence of speech throughout the frame using a single flag EFLAG1, which is generated by testing the short term energy on a 5 ms basis against a threshold, incrementing a counter whenever the threshold is exceeded, and testing the counter's final value against a fixed threshold. Comparing the short term energy on a frame basis to various thresholds provides indication of absence of speech throughout the frame in the form of several flags with varying degrees of confidence. These flags are denoted as EFLAG2, EFLAG3, EFLAG4, and EFLAG5.

FIG. 17 shows dataflow within mode selector 34 for generating these flags. Modules 17002, 17004, 17006, 17008, 17010, 17015, 17020, and 17022 each count the energy in a respective 5 MS subframe of the frame currently being processed. Comparators 17030, 17032, 17034, 17036, 17038, 17040, 17042, and 17044, in combination with adder 17050, count the number of subframes having an energy exceeding E.sub.to =0.707 E.sub.bn.

FIGS. 18A, 18B, and 18C show the processing of step 1060. Mode selector 34 first classifies the frame as background noise (mode C) or speech (modes A or B). Mode C tends to be characterized by low energy, relatively high spectral stationarity between the current frame and the previous frame, a relative absence of pitch stationarity between the current frame and the previous frame, and a high zero crossing rate. Background noise (mode C) is declared either on the basis of the short term energy flag EFLAG5 alone or by combining short term energy flags EFLAG4, EFLAG3, and EFLAG2 with other flags indicating high zero crossing rate, absence of pitch, absence of transients, etc.

More specifically, if the mode of the previous frame was A or if EFLAG2 is not true, processing proceeds to step 18045 (step 18005). Step 18005 ensures that the current frame will not be mode C if the previous frame was mode A. The current frame is mode C if (LPCFLAG1 and EFLAG3) is true or (LPCFLAG2 and EFLAG4) is true or EFLAG5 is true (steps 18010, 18015, and 18020). The current frame is mode C if ((not PITCHFLAG1) and LPCFLAG1 and ZC.sub.-- HIGH) is true (step 18025) or ((not PITCHFLAG1) and (not PITCHFLAG2) and LPCFLAG2 and ZC.sub.-- HIGH) is true (step 18030). Thus, the processing shown in FIG. 18A determines whether the frame corresponds to a first mode (Mode C), depending on whether a speech component is substantially absent from the frame.

In step 18045, a score is calculated depending on the mode of the previous free. If the mode of the previous frame was mode A, the score is 1+LVFLAG1+EFLAG1+ZC.sub.-- LOW. If the previous mode was mode B, the score is 0+LVFLAG1+EFLAG1+ZC.sub.-- LOW. If the mode of the previous frame was mode C, the score is 2+LVFLAG1+EFLAG1+ZC.sub.-- LOW.

If the mode of the previous frame was mode C or not LVLFLAG2, the mode of the current frame is mode B (step 18050). The current frame is mode A if (LPCFLAG1 & PITCHFLAG1) is true, provided the score is not less than 2 (steps 18060 and 18055). The current frame is mode A if (LPCFLAG1 and PITCHFLAG2) is true or (LPCFLAG2 and PITCHFLAG1) is true, provided score is not less than 3 (steps 18070, 18075, and 18080).

Subsequently, speech encoder 12 generates an encoded frame in accordance with one of a first coding scheme (a coding scheme for mode C), when the frame corresponds to the first mode, and an alternative coding scheme (a coding scheme for modes A or B), when the frame does not correspond to the first mode, as described in mode detail below.

For mode A, only the second set of line spectral frequency vector quantization indices need to be transmitted because the first set can be inferred at the receiver due to the slowly varying nature of the vocal tract shape. In addition, the first and second open loop patch estimates are quantized and transmitted because they are used to encode the closed loop pitch estimates in each subframe. The quantization of the second open loop pitch estimate is accomplished using a non-uniform 4-bit quantizer while the quantization of the first open loop pitch estimate is accomplished using a differential non-uniform 3-bit quantizer. Since the vector quantization indices of the LSF's for the first linear prediction analysis window are neither transmitted nor used in mode selection, they need not be calculated in mode A. This reduces the complexity of the short term predictor section of the encoder in this mode. This reduced complexity as well as the lower bit rate of the short term predictor parameters in mode A is offset by faster update of all the excitation model parameters.

For mode B, both sets of line spectral frequency vector quantization must be transmitted because of potential spectral nonstationarity. However, for the first set of line spectral frequencies we need search only 2 of the 4 classifications or categories. This is because the IRS vs. non-IRS selection varies very slowly with time. If the second set of line spectral frequencies were chosen from the "voiced IRS-filtered" category, then the first set can be expected to be from either the "voiced IRS-filtered" or "unvoiced IRS-filtered" categories. If the second set of line spectral frequencies were chosen from the "unvoiced IRS-filtered" category, then again the first set can be expected to be from either the "voiced IRS-filtered" or "unvoiced IRS-filtered" categories. If the second set of line spectral frequencies were chosen from the "voiced non-IRS-filtered" category, then the first set can be expected to be from either the "voiced non-IRS-filtered" or "unvoiced non-IRS filtered" categories. Finally, if the second set of line spectral frequencies were chosen from the "unvoiced non-IRS-filtered" category, then again the first set can be expected to be from either the "voiced non-IRS-filtered" or "unvoiced non-IRS-filtered" categories. As a result only two categories of LSF codebooks need be searched for the quantization of the first set of line spectral frequencies. Furthermore, only 25 bits are needed to encode these quantization indices instead of the 26 needed for the second set of LSF's, since the optimal category for the first set can be coded using just 1 bit. For mode B, neither of the two open loop pitch estimates are transmitted since they are not used in guiding the closed loop pitch estimates. The higher complexity involved in encoding as well as the higher bit rate of the short term predictor parameters in mode B is compensated by a slower update of all the excitation model parameters.

For mode C, only the second set of line spectral frequency vector quantization indices need to be transmitted because for human ear is not as sensitive to rapid changes in spectral shape variations for noisy inputs. Further, such rapid spectral shape variations are atypical for many kinds of b