|
Claims  |
|
|
I claim:
1. A method for decoding digital signals representing encoded speech
signals comprising steps of:
providing an input digital signal;
determining whether the input digital signal comprises voiced speech or
unvoiced speech;
synthesizing speech signals using frequency domain techniques when the
input digital signal represents voiced speech; and
synthesizing speech signals using time domain techniques when the input
digital signal represents unvoiced speech, wherein said step of
synthesizing speech signals using frequency domain techniques when the
input digital signal represents voiced speech further comprises steps of:
interpolating phases between transmitted phases to fill an array describing
phase with interpolated phase data;
inverse fast Fourier transforming said interpolated phase data to provide
reconstructed target epochs;
interpolating linear predictive coding (LPC) coefficients to simulate LPC
coefficients elided in a transmitter to provide reconstructed LPC
coefficients;
interpolating between the reconstructed target epochs to provide a
reconstructed voiced excitation function; and
synthesizing speech signals from the reconstructed voiced excitation
function and the reconstructed LPC coefficients with a lattice synthesis
filter to provide reconstructed speech signals.
2. A method as claimed in claim 1, wherein said step of synthesizing speech
signals using time domain techniques when the input digital signal
represents unvoiced speech further comprises steps of:
decoding a series of contiguous root-mean-square (RMS) amplitudes;
interpolating between the contiguous RMS amplitudes to regenerate an
excitation envelope;
modulating a noise generator with the excitation envelope to provide
unvoiced excitation; and
synthesizing unvoiced speech from the unvoiced excitation.
3. A method as claimed in claim 2, wherein said step of modulating a noise
generator with the excitation envelope to provide unvoiced excitation
includes a step of modulating a Gaussian random number generator to
provide unvoiced excitation.
4. A method as claimed in claim 2, wherein said step of synthesizing
unvoiced speech from the unvoiced excitation includes a step of
synthesizing unvoiced speech by a lattice filter from the unvoiced
excitation.
5. A method as claimed in claim 2, wherein:
said step of modulating a noise generator with the excitation envelope to
provide unvoiced excitation includes a step of modulating a Gaussian
random number generator to provide unvoiced excitation; and
said step of synthesizing unvoiced speech from the unvoiced excitation
includes a step of synthesizing unvoiced speech by a lattice filter from
the unvoiced excitation.
6. A method as claimed in claim 1, wherein synthesizing speech signals from
the reconstructed voiced excitation function includes a step of windowing
the reconstructed voiced excitation function.
7. A method as claimed in claim 6, wherein said step of windowing the
reconstructed voiced excitation function includes a step of windowing the
reconstructed voiced excitation function with a trapezoidal window.
8. An apparatus for pitch epoch synchronous decoding of digital signals
representing encoded speech signals comprising:
an input for receiving digital signal;
means for determining voicing of said input digital signal coupled to said
input;
first means for synthesizing speech signals using frequency domain
techniques when said input digital signal represents voiced speech; and
second means for synthesizing speech signals using time domain techniques
when said input digital signal represents unvoiced speech, said first and
second means for synthesizing speech signals each coupled to said means
for determining voicing, wherein said first means for synthesizing speech
signals comprises;
means for interpolating phases between transmitted phases to fill an array
describing phase with interpolated phase data, said interpolating means
coupled to said means for determining voicing;
means for inverse fast Fourier transforming (iFFT) said interpolated phase
data to provide reconstructed target epochs, said iFFT means coupled to
said interpolating means;
linear predictive coding (LPC) coefficient interpolation means coupled to
said iFFT means, said LPC coefficient interpolation means for providing a
reconstructed set of LPC coefficients by interpolation of LPC coefficients
to simulate elided LPC coefficients;
epoch interpolating means coupled to said LPC coefficient interpolation
means, said epoch interpolating means for interpolating between said
reconstructed target epochs to provide a reconstructed voiced excitation
function; and
lattice synthesis filter means coupled to said epoch interpolating means,
said lattice synthesis filter means for synthesizing speech signals from
the reconstructed voiced excitation function and the reconstructed set of
LPC coefficients to provide reconstructed speech signals.
9. An apparatus as claimed in claim 8, wherein said second means for
synthesizing speech signals comprises:
means for decoding a series of contiguous representative amplitudes coupled
to said means for determining voicing;
a noise generator coupled to said means for decoding, said noise generator
providing noise at a level modulated with an envelope derived from the
series of contiguous representative amplitudes to provide reconstructed
unvoiced excitation; and
a lattice synthesis filter for synthesizing unvoiced speech from said
reconstructed unvoiced excitation function.
10. An apparatus as claimed in claim 9, wherein said means for decoding a
series of contiguous representative amplitudes is a means for decoding a
series of contiguous root-mean-square (RMS) amplitudes.
11. An apparatus as claimed in claim 9, wherein said noise generator is a
Gaussian noise generator.
12. An apparatus as claimed in claim 8, wherein said first means for
synthesizing speech signals includes windowing means coupled to said epoch
interpolating means, said windowing means for windowing said reconstructed
voiced excitation function to remove artifacts from said iFFT means, said
windowing means having an output coupled to said lattice synthesis filter
means.
13. An apparatus as claimed in claim 8, wherein said first means for
synthesizing speech signals includes trapezoidal windowing means coupled
to said epoch interpolating means, said trapezoidal windowing means for
windowing said reconstructed voiced excitation function to remove
artifacts from said iFFT means, said trapezoidal windowing means having an
output coupled to said lattice synthesis filter means.
14. A method for pitch epoch synchronous encoding of speech signals and
decoding digital signals representing encoded speech signals, said method
comprising steps of:
inputting an input signal; and, when said input signal comprises an input
speech signal:
processing the input speech signal to characterize qualities including
linear predictive coding coefficients;
determining whether the input speech signal comprises voiced speech or
unvoiced speech;
analyzing input speech signals using frequency domain techniques when input
speech signals comprise voiced speech to provide an excitation function,
wherein said step of analyzing input speech signals using frequency domain
techniques comprises steps of:
determining epoch excitation positions within a frame of speech data;
determining fractional pitch;
determining a group of synchronous linear predictive coding (LPC)
coefficients by performing epoch-synchronous LPC analysis; and
selecting an interpolation excitation target from within a particular epoch
of speech data to provide a target excitation function, wherein the target
excitation function comprises per-epoch speech parameters and wherein said
encoding step includes encoding fractional pitch and synchronous LPC
coefficients; and
encoding the excitation function to provide a digital output signal
representing the input speech signal; and, when said input signal
comprises an input digital signal representing encoded speech signals:
determining voicing of the input digital signal, synthesizing speech
signals using frequency domain techniques when the input digital signal
represents voiced speech; and, when the input digital signal represents
unvoiced speech:
decoding a series of contiguous root-mean-square (RMS) amplitudes;
interpolating between the contiguous RMS amplitudes to regenerate an
excitation envelope;
modulating a noise generator with the excitation envelope to provide
unvoiced excitation; and
synthesizing unvoiced speech from the unvoiced excitation; and, when the
input digital signal represents voiced speech:
interpolating phases between transmitted phases to fill an array describing
phase with interpolated phase data;
inverse fast Fourier transforming said interpolated phase data to provide
reconstructed target epochs;
interpolating linear predictive coding (LPC) coefficients to simulate LPC
coefficients elided in a transmitter to provide reconstructed LPC
coefficients;
interpolating between the reconstructed target epochs to provide a
reconstructed voiced excitation function; and
synthesizing speech signals from the reconstructed voiced excitation
function and the reconstructed LPC coefficients with a lattice synthesis
filter to provide reconstructed speech signals.
15. A method for decoding digital signals representing encoded speech
signals comprising steps of:
providing an input digital signal;
determining whether the input digital signal comprises voiced speech or
unvoiced speech;
synthesizing speech signals using frequency domain techniques when the
input digital signal represents voiced speech; and
synthesizing speech signals using time domain techniques when the input
digital signal represents unvoiced speech, wherein said step of
synthesizing speech signals using time domain techniques when the input
digital signal represents unvoiced speech further comprises steps of:
decoding a series of contiguous root-mean-square (RMS) amplitudes;
interpolating between the contiguous RMS amplitudes to regenerate an
excitation envelope;
modulating a noise generator with the excitation envelope to provide
unvoiced excitation; and
synthesizing unvoiced speech from the unvoiced excitation; and
wherein said step of synthesizing speech signals using frequency domain
techniques when the input digital signal represents voiced speech further
comprises steps of:
interpolating phases between transmitted phases to fill an array describing
phase with interpolated phase data;
inverse fast Fourier transforming said interpolated phase data to provide
reconstructed target epochs;
interpolating linear predictive coding (LPC) coefficients to simulate LPC
coefficients elided in a transmitter to provide reconstructed LPC
coefficients;
interpolating between the reconstructed target epochs to provide a
reconstructed voiced excitation function; and
synthesizing speech signals from the reconstructed voiced excitation
function and the reconstructed LPC coefficients with a lattice synthesis
filter to provide reconstructed speech signals.
16. A method as claimed in claim 15, wherein synthesizing speech signals
from the reconstructed voiced excitation function includes a step of
windowing the reconstructed voiced excitation function with a trapezoidal
window.
17. An apparatus for pitch epoch synchronous decoding of digital signals
representing encoded speech signals comprising:
an input for receiving digital signal;
means for determining voicing of said input digital signal coupled to said
input;
first means for synthesizing speech signals using frequency domain
techniques when said input digital signal represents voiced speech; and
second means for synthesizing speech signals using time domain techniques
when said input digital signal represents unvoiced speech, said first and
second means for synthesizing speech signals each coupled to said means
for determining voicing, wherein said second means for synthesizing speech
signals comprises:
means for decoding a series of contiguous root-mean-square (RMS)
representative amplitudes coupled to said means for determining voicing;
a noise generator coupled to said means for decoding, said noise generator
providing noise at a level modulated with an envelope derived from the
series of contiguous representative amplitudes to provide reconstructed
unvoiced excitation; and
a lattice synthesis filter for synthesizing unvoiced speech from said
reconstructed unvoiced excitation function; and wherein said first means
for synthesizing speech signals comprises:
means for interpolating phases between transmitted phases to fill an array
describing phase with interpolated phase data, said interpolating means
coupled to said means for determining voicing;
means for inverse fast Fourier transforming (iFFT) said interpolated phase
data to provide reconstructed target epochs, said iFFT means coupled to
said interpolating means;
linear predictive coding (LPC) coefficient interpolation means coupled to
said iFFT means, said LPC coefficient interpolation means for providing a
reconstructed set of LPC coefficients by interpolation of LPC coefficients
to simulate elided LPC coefficients;
epoch interpolating means coupled to said LPC coefficient interpolation
means, said epoch interpolating means for interpolating between said
reconstructed target epochs to provide a reconstructed voiced excitation
function; and
lattice synthesis filter means coupled to said epoch interpolating means,
said lattice synthesis filter means for synthesizing speech signals from
the reconstructed voiced excitation function and the reconstructed set of
LPC coefficients to provide reconstructed speech signals.
18. An apparatus as claimed in claim 17, wherein said first means for
synthesizing speech signals includes trapezoidal windowing means coupled
to said epoch interpolating means, said trapezoidal windowing means for
windowing said reconstructed voiced excitation function to remove
artifacts from said iFFT means, said trapezoidal windowing means having an
output coupled to said lattice synthesis filter means. |
|
|
|
|
Claims  |
|
|
Description  |
|
|
CROSS-REFERENCE TO RELATED APPLICATIONS
This application is related to co-pending U.S. patent applications Ser. No.
07/732,977, filed on Jul. 19, 1991 and Ser. No. 08/068,918, entitled
"Excitation Synchronous Time Encoding Vocoder And Method", filed on an
even date herewith, which are assigned to the same assignee as the present
application.
FIELD OF THE INVENTION
This invention relates in general to the field of digitally encoded human
speech, in particular to coding and decoding techniques and more
particularly to high fidelity techniques for digitally encoding speech and
transmitting digitally encoded speech using reduced bandwidth in concert
with synthesizing speech signals of increased clarity from digital codes.
BACKGROUND OF THE INVENTION
Digital encoding of speech signals and/or decoding of digital signals to
provide intelligible speech signals are important for many electronic
products providing secure communications capabilities, communications via
digital links or speech output signals derived from computer instructions.
Many digital voice systems suffer from poor perceptual quality in the
synthesized speech. Insufficient characterization of input speech basis
elements, bandwidth limitations and subsequent reconstruction of
synthesized speech signals from encoded digital representations all
contribute to perceptual degradation of synthesized speech quality.
Moreover, some information carrying capacity is lost; the nuances,
intonations and emphases imparted by the speaker carry subtle but
significant messages lost in varying degrees through corruption in en- and
subsequent de-coding of speech signals transmitted in digital form.
In particular, auto-regressive linear predictive coding (LPC) techniques
comprise a system transfer function having all poles and no zeroes. These
prior a coding techniques and especially those utilizing linear predictive
coding analysis tend to neglect all resonance contributions from the nasal
cavities (which essentially provide the "zeroes" in the transfer function
describing the human speech apparatus) and result in reproduced speech
having an artificially "tinny" or "nasal" quality.
Standard techniques for digitally encoding and decoding speech generally
utilize signal processing analysis techniques which require significant
bandwidth in realizing high quality real-time communication.
What are needed are apparatus and methods for rapidly and accurately
characterizing speech signals in a fashion lending itself to digital
representation thereof as well as synthesis methods and apparatus for
providing speech signals from digital representations which provide high
fidelity and conserve digital bandwidth requirements.
SUMMARY OF THE INVENTION
Briefly stated, there is provided a new and improved apparatus for digital
speech representation and reconstruction and a method therefor.
A method for pitch epoch synchronous encoding of speech signals. The method
includes steps of providing an input speech signal, processing the input
speech signal to characterize qualities including linear predictive coding
coefficients and voicing, characterizing input speech signals using
frequency domain techniques when input speech signals comprise voiced
speech to provide an excitation function, characterizing the input speech
signals using time domain techniques when the input speech signals
comprise unvoiced speech to provide an excitation function and encoding
the excitation function to provide a digital output signal representing
the input speech signal.
In a preferred embodiment, the apparatus comprises an apparatus for pitch
epoch synchronous decoding of digital signals representing encoded speech
signals. The apparatus includes an input for receiving digital signal, an
apparatus for determining voicing of the input digital signal coupled to
the input, a first apparatus for synthesizing speech signals using
frequency domain techniques when the input digital signal represents
voiced speech and a second apparatus for synthesizing speech signals using
time domain techniques when the input digital signal represents unvoiced
speech. The first and second apparatus synthesize speech signals each
coupled to the apparatus for determining voicing.
An apparatus for pitch epoch synchronous decoding of digital signals
representing encoded speech signals includes an input for receiving
digital signals and an apparatus for determining voicing of the input
digital signals. The apparatus for determining voicing is coupled to the
input. The apparatus also includes a first apparatus for synthesizing
speech signals using frequency domain techniques when the input digital
signal represents voiced speech and a second apparatus for synthesizing
speech signals using time domain techniques when the input digital signal
represents unvoiced speech. The first and second apparatus for
synthesizing speech signals each are coupled to the apparatus for
determining voicing.
An apparatus for pitch epoch synchronous encoding of speech signals
includes an input for receiving input speech signals and an apparatus for
determining voicing of the input speech signals. The apparatus for
determining voicing is coupled to the input. The apparatus further
includes a first device for characterizing the input speech signals using
frequency domain techniques, which is coupled to the apparatus for
determining voicing. The first characterizing device operates when the
input speech signals comprise voiced speech and provides frequency domain
characterized speech as output signals. The apparatus further includes a
second device for characterizing the input speech signals using time
domain techniques, which is also coupled to the apparatus for determining
voicing. The second characterizing device operates when the input speech
signals comprise unvoiced speech and provides characterized speech as
output signals. The apparatus also includes an encoder for encoding the
characterized speech to provide a digital output signal representing the
input speech signal, which encoder is coupled to the first and second
characterizing devices.
BRIEF DESCRIPTION OF THE DRAWING
The invention is pointed out with particularity in the appended claims.
However, a more complete understanding of the present invention may be
derived by referring to the detailed description and claims when
considered in connection with the figures, wherein like reference numbers
refer to similar items throughout the figures, and;
FIG. 1 is a simplified block diagram, in flow chart form, of a speech
digitizer in a transmitter in accordance with the present invention;
FIG. 2 is a simplified block diagram, in flow chart form, of a speech
synthesizer in a receiver for digital data provided by an apparatus such
as the transmitter of FIG. 1; and
FIG. 3 is a highly simplified block diagram of a voice communication
apparatus employing the speech digitizer of FIG. 1 and the speech
synthesizer of FIG. 2 in accordance with the present invention.
The exemplification set out herein illustrates a preferred embodiment of
the invention in one form thereof, and such exemplification is not
intended to be construed as limiting in any manner.
DETAILED DESCRIPTION OF THE DRAWING
As used herein, the terms "excitation", "excitation function", "driving
function" and "excitation waveform" have equivalent meanings and refer to
a waveform provided by linear predictive coding apparatus as one of the
output signals therefrom. As used herein, the terms "target", "excitation
target" and "target epoch" have equivalent meanings and refer to an epoch
selected first for characterization in an encoding apparatus and second
for later interpolation in a decoding apparatus. FIG. 1 is a simplified
block diagram, in flow chart form, of speech digitizer 15 in transmitter
10 in accordance with the present invention.
A primary component of voiced speech (e.g., "oo" in "shoot") is
conveniently represented as a quasi-periodic, impulse-like driving
function or excitation function having slowly varying envelope and period.
This period is referred to as the "pitch period" or epoch, comprising an
individual impulse within the driving function. Conversely, the driving
function associated with unvoiced speech (e.g., "ss" in "hiss") is largely
random in nature and resembles shaped noise, i.e., noise having a
time-varying envelope, where the envelope shape is a primary
information-carrying component.
The composite voiced/unvoiced driving waveform may be thought of as an
input to a system transfer function whose output provides a resultant
speech waveform. The composite driving waveform may be referred to as the
"excitation function" for the human voice. Thorough, efficient
characterization of the excitation function yields a better approximation
to the unique attributes of an individual speaker, which attributes are
poorly represented or ignored altogether in reduced bandwidth voice coding
schemata to date (e.g., LPC10e).
In the arrangement according to the present invention, speech signals are
supplied via input 11 to highpass filter 12. Highpass filter 12 is coupled
to frame based linear predictive coding (LPC) apparatus 14 via link 13.
LPC apparatus 14 provides an excitation function via link 16 to
autocorrelator 17.
Autocorrelator 17 estimates .tau., the integer pitch period in samples (or
regions) of the quasi-periodic excitation waveform. The excitation
function and the .tau. estimate are input via link 18 to pitch loop filter
19, which estimates excitation function structure associated with the
input speech signal. Pitch loop filter 19 is well known in the art (see,
for example, "Pitch Prediction Filters In Speech Coding", by R. P.
Ramachandran and P. Kabal, in IEEE Transactions on Acoustics, Speech and
Signal Processing, vol. 37, no. 4, April 1989). The estimates for LPC
prediction gain (from frame based LPC apparatus 14), pitch loop filter
prediction gain (from pitch loop filter 19) and filter coefficient values
(from pitch loop filter 19) are used in decision block 22 to determine
whether input speech data represent voiced or unvoiced input speech data.
Unvoiced excitation data are coupled via link 23 to block 24, where
contiguous RMS levels are computed. Signals representing these RMS levels
are then coupled via link 25 to vector quantizer codebooks 41 having
general composition and function are well known in the art.
Typically, a 30 millisecond frame of unvoiced excitation comprising 240
samples is divided into 20 contiguous time slots. The excitation signal
occurring during each time slot is analyzed and characterized by a
representative level, conveniently realized as an RMS (root-mean-square)
level. This effective technique for the transmission of unvoiced frame
composition offers a level of computational simplicity not possible with
much more elaborate frequency-domain fast Fourier transform (FFT) methods,
without significant compromise in quality of the reconstructed unvoiced
speech signals.
Voiced excitation data are frequency-domain processed in block 24', where
speech characteristics are analyzed on a "per epoch" basis. These data are
coupled via link 26 to block 27, wherein epoch positions are determined.
Following epoch position determination, data are coupled via link 28 to
block 27', where fractional pitch is determined. Data are then coupled via
link 28' to block 29, wherein excitation synchronous LPC analysis is
performed on the input speech given the epoch positioning data (from block
27), both provided via link 28'.
This process provides revised LPC coefficients and excitation function
which are coupled via link 30 to block 31, wherein a single excitation
epoch is chosen in each frame as an interpolation target. The single epoch
may be chosen randomly or via a closed loop process as is known in the
art. Excitation synchronous LPC coefficients (from LPC apparatus 29),
corresponding to the target excitation function are chosen as coefficient
interpolation targets and are coupled via link 30 to select interpolation
targets 31. Selected interpolation targets (block 31) are coupled via link
32 to correlate interpolation targets 33.
The LPC coefficients are utilized via interpolation to regenerate data
elided in the transmitter at the receiver (discussed in connection with
FIG. 4, infra). As only one set of LPC coefficients and information
corresponding to one excitation epoch are encoded at the transmitter, the
remaining excitation waveform and epoch-synchronous coefficients must be
derived from the chosen "targets" at the receiver. Linear interpolation
between transmitted targets has been used with success to regenerate the
missing information, although other non-linear schemata are also useful.
Thus, only a single excitation epoch (i.e., voiced speech) is frequency
domain analyzed and encoded per frame at the transmitter, with the
intervening epochs filled in by interpolation at receiver 9.
Chosen epochs are coupled via link 32 to block 33, wherein chosen epochs in
adjacent frames (e.g., the chosen epoch in the preceding frame) are
cross-correlated in order to determine an optimum epoch starting index and
enhance the effectiveness of the interpolation process. By correlating the
two targets, the maximum correlation index shift may be introduced as a
positioning offset prior to interpolation. This offset improves on the
standard interpolation scheme by forcing the "phase" of the two targets to
coincide. Failure to perform this correlation procedure prior to
interpolation often leads to significant reconstructed excitation envelope
error at receiver 9 (FIG. 2, infra).
The correlated target epochs are coupled via link 34 to cyclical shift 36',
wherein data are shifted or "rotated" in the data array. Shifted data are
coupled via link 37' and then fast Fourier transformed (FFT) (block 36").
Transformed data are coupled via link 37" and are then frequency domain
encoded (block 38). In receiver 9 (discussed in connection with FIG. 2,
infra), interpolation is used to regenerate information elided in
transmitter 10. As only one set of LPC coefficients and one excitation
epoch are encoded at the transmitter, the remaining excitation waveform
and epoch-synchronous coefficients must be derived from the chosen
"targets" at the receiver. Linear interpolation between transmitted
targets has been used with success to regenerate the missing information,
although other non-linear schemata are also useful.
Only one excitation epoch is frequency domain characterized (and the result
encoded) per frame of data, and only a small number of characterizing
samples are required to adequately represent the salient features of the
excitation epoch, e.g., four magnitude levels and sixteen phase levels may
be usefully employed. These levels are usefully allowed to vary
continuously, e.g., sixteen real-valued phases, four real-valued
magnitudes.
The frequency domain encoding process (blocks 36', 36", 38) usefully
comprises fast-Fourier transforming (FFT) M many samples of data
representing a single epoch, typically thirty to eighty samples which are
desirably cyclically shifted (block 36') in order to reduce phase slope.
These M samples are desirably indexed such that the sample indicating the
epoch peak, designated the N.sup.th sample, is placed in the first
position of the FFT input matrix, the samples preceding the N.sup.th
sample are placed in the last N-1 positions (i.e., positions 2.sup.n -N to
2.sup.n, where 2.sup.n is the frame size) of the FFT input matrix and the
N+1.sup.st through M.sup.th samples follow the N.sup.th sample. The sum of
these two cyclical shifts effectively reduces frequency domain phase
slope, improving coding precision and also improves the interpolation
process within receiver 9 (FIG. 2). The data are "zero filled" by placing
zero in the 2.sup.n -M elements of the FFT input matrix not occupied by
input data and the result is fast Fourier transformed, where 2.sup.n
represents the size of the FFT input matrix.
Amplitude and phase data in the frequency domain are desirably
characterized with relatively few samples. For example, the freque | | |