|
Description  |
|
|
CROSS-REFERENCE TO RELATED APPLICATIONS AND MATERIALS
The following U.S. patent applications filed concurrently with the present
application and assigned to the assignee of the present application are
related to the present application and each is hereby incorporated herein
as if set forth in its entirety: "A METHOD AND APPARATUS FOR THE
PERCEPTUAL CODING OF AUDIO SIGNALS," by A. Ferreira and J. D. Johnston;
"AN ENTROPY CODER" by J. D. Johnston and J. Reeds; and "RATE LOOP
PROCESSOR FOR PERCEPTUAL ENCODER/DECODER," by J. D. Johnston.
FIELD OF THE INVENTION
The present invention relates to processing of information signals, and
more particularly, to the efficient encoding and decoding of monophonic
and stereophonic audio signals, including signals representative of voice
and music information, for storage or transmission.
BACKGROUND OF THE INVENTION
Consumer, industrial, studio and laboratory products for storing,
processing and communicating high quality audio signals are in great
demand. For example, so-called compact disc ("CD") and digital audio tape
("DAT") recordings for music have largely replaced the long-popular
phonograph record and cassette tape. Likewise, recently available digital
audio tape ("DAT") recordings promise to provide greater flexibility and
high storage density for high quality audio signals. See, also, Tan and
Vermeulen, "Digital audio tape for data storage", IEEE Spectrum, pp. 34-38
(Oct. 1989). A demand is also arising for broadcast applications of
digital technology that offer CD-like quality.
While these emerging digital techniques are capable of producing high
quality signals, such performance is often achieved only at the expense of
considerable data storage capacity or transmission bandwidth. Accordingly,
much work has been done in an attempt to compress high quality audio
signals for storage and transmission.
Most of the prior work directed to compressing signals for transmission and
storage has sought to reduce the redundancies that the source of the
signals places on the signal. Thus, such techniques as ADPCM, sub-band
coding and transform coding described, e.g., in N. S. Jayant and P. Noll,
"Digital Coding of Waveforms," Prentice-Hall, Inc. 1984, have sought to
eliminate redundancies that otherwise would exist in the source signals.
In other approaches, the irrelevant information in source signals is sought
to be eliminated using techniques based on models of the human perceptual
system. Such techniques are described, e.g., in E. F. Schroeder and J. J.
Platte, "`MSC`: Stereo Audio Coding with CD-Quality and 256 kBIT/SEC,"
IEEE Trans. on Consumer Electronics, Vol. CE-33, No. 4, November 1987; and
Johnston, Transform Coding of Audio Signals Using Noise Criteria, Vol. 6,
No. 2, IEEE J.S.C.A. (February 1988).
Perceptual coding, as described, e.g., in the Johnston paper relates to a
technique for lowering required bitrates (or reapportioning available
bits) or total number of bits in representing audio signals. In this form
of coding, a masking threshold for unwanted signals is identified as a
function of frequency of the desired signal. Then, inter alia, the
coarseness of quantizing used to represent a signal component of the
desired signal is selected such that the quantizing noise introduced by
the coding does not rise above the noise threshold, though it may be quite
near this threshold. The introduced noise is therefore masked in the
perception process. While traditional signal-to-noise ratios for such
perceptually coded signals may be relatively low, the quality of these
signals upon decoding, as perceived by a human listener, is nevertheless
high.
Brandenburg et al, U.S. Pat. No. 5,040,217, issued Aug. 13, 1991, describes
a system for efficiently coding and decoding high quality audio signals
using such perceptual considerations. In particular, using a measure of
the "noise-like" or "tone-like" quality of the input signals, the
embodiments described in the latter system provides a very efficient
coding for monophonic audio signals.
It is, of course, important that the coding techniques used to compress
audio signals do not themselves introduce offensive components or
artifacts. This is especially important when coding stereophonic audio
information where coded information corresponding to one stereo channel,
when decoded for reproduction, can interfere or interact with coding
information corresponding to the other stereo channel. Implementation
choices for coding two stereo channels include so-called "dual mono"
coders using two independent coders operating at fixed bit rates. By
contrast, "joint mono" coders use two monophonic coders but share one
combined bit rate, i.e., the bit rate for the two coders is constrained to
be less than or equal to a fixed rate, but trade-offs can be made between
the bit rates for individual coders. "Joint stereo" coders are those that
attempt to use interchannel properties for the stereo pair for realizing
additional coding gain.
It has been found that the independent coding of the two channels of a
stereo pair, especially at low bit-rates, can lead to a number of
undesirable psychoacoustic artifacts. Among them are those related to the
localization of coding noise that does not match the localization of the
dynamically imaged signal. Thus the human stereophonic perception process
appears to add constraints to the encoding process if such mismatched
localization is to be avoided. This finding is consistent with reports on
binaural masking-level differences that appear to exist, at least for low
frequencies, such that noise may be isolated spatially. Such binaural
masking-level differences are considered to unmask a noise component that
would be masked in a monophonic system. See, for example, B. C. J. Morre,
"An Introduction to the Psychology of Hearing, Second Edition," especially
chapter 5, Academic Press, Orlando, Fla., 1982.
One technique for reducing psychoacoustic artifacts in the stereophonic
context employs the ISO-WG11-MPEG-Audio Psychoacoustic II [ISO] Model. In
this model, a second limit of signal-to-noise ratio ("SNR") is applied to
signal-to-noise ratios inside the psychoacoustic model. However, such
additional SNR constraints typically require the expenditure of additional
channel capacity or (in storage applications) the use of additional
storage capacity, at low frequencies, while also degrading the monophonic
performance of the coding.
SUMMARY OF THE INVENTION
Limitations of the prior art are overcome and a technical advance is made
in a method and apparatus for coding a stereo pair of high quality audio
channels in accordance with aspects of the present invention. Interchannel
redundancy and irrelevancy are exploited to achieve lower bit-rates while
maintaining high quality reproduction after decoding. While particularly
appropriate to stereophonic coding and decoding, the advantages of the
present invention may also be realized in conventional dual monophonic
stereo coders.
An illustrative embodiment of the present invention employs a filter bank
architecture using a Modified Discrete Cosine Transform (MDCT). In order
to code the full range of signals that may be presented to the system, the
illustrative embodiment advantageously uses both L/R (Left and Right) and
M/S (Sum/Difference) coding, switched in both frequency and time in a
signal dependent fashion. A new stereophonic noise masking model
advantageously detects and avoids binaural artifacts in the coded
stereophonic signal. Interchannel redundancy is exploited to provide
enhanced compression for without degrading audio quality.
The time behavior of both Right and Left audio channels is advantageously
accurately monitored and the results used to control the temporal
resolution of the coding process. Thus, in one aspect, an illustrative
embodiment of the present invention, provides processing of input signals
in terms of either a normal MDCT window, or, when signal conditions
indicate, shorter windows. Further, dynamic switching between RIGHT/LEFT
or SUM/DIFFERENCE coding modes is provided both in time and frequency to
control unwanted binaural noise localization, to prevent the need for
overcoding of SUM/DIFFERENCE signals, and to maximize the global coding
gain.
A typical bitstream definition and rate control loop are described which
provide useful flexibility in forming the coder output. Interchannel
irrelevancies, are advantageously eliminated and stereophonic noise
masking improved, thereby to achieve improved reproduced audio quality in
jointly coded stereophonic pairs. The rate control method used in an
illustrative embodiment uses an interpolation between absolute thresholds
and masking threshold for signals below the rate-limit of the coder, and a
threshold elevation strategy under rate-limited conditions.
In accordance with an overall coder/decoder system aspect of the present
invention, it proves advantageously to employ an improved Huffman-like
entropy coder/decoder to further reduce the channel bit rate requirements,
or storage capacity for storage applications. The noiseless compression
method illustratively used employs Huffman coding along with a
frequency-partitioning scheme to efficiently code the frequency samples
for L, R, M and S, as may be dictated by the perceptual threshold.
The present invention provides a mechanism for determining the scale
factors to be used in quantizing the audio signal (i.e., the MDCT
coefficients output from the analysis filter bank) by using an approach
different from the prior art, and while avoiding many of the restrictions
and costs of prior quantizer/rate-loops. The audio signals quantized
pursuant to the present invention introduce less noise and encode into
fewer bits than the prior art.
These results are obtained in an illustrative embodiment of the present
invention whereby the utilized scale factor, is iteratively derived by
interpolating between a scale factor derived from a calculated threshold
of hearing at the frequency corresponding to the frequency of the
respective spectral coefficient to be quantized and a scale factor derived
from the absolute threshold of hearing at said frequency until the
quantized spectral coefficients can be encoded within permissible limits.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 presents an illustrative prior art audio communication/storage
system of a type in which aspects of the present invention find
application, and provides improvement and extension.
FIG. 2 presents an illustrative perceptual audio coder (PAC) in which the
advances and teachings of the present invention find application, and
provide improvement and extension.
FIG. 3 shows a representation of a useful masking level difference factor
used in threshold calculations.
FIG. 4 presents an illustrative analysis filter bank according to an aspect
of the present invention.
FIG. 5(a) through 5(e) illustrate the operation of various window
functions.
FIG. 6 is a flow chart illustrating window switching functionality.
FIG. 7 is a block/flow diagram illustrating the overall processing of input
signals to derive the output bitstream.
FIG. 8 illustrates certain threshold variations.
FIG. 9 is a flowchart representation of certain bit allocation
functionality.
FIG. 10 shows bitstream organization.
FIGS. 11a through 11c illustrate certain Huffman coding operations.
FIG. 12 shows operations at a decoder that are complementary to those for
an encoder.
FIG. 13 is a flowchart illustrating certain quantization operations in
accordance with an aspect of the present invention.
FIG. 14(a) through 14(g) are illustrative windows for use with the filter
bank of FIG. 4.
DETAILED DESCRIPTION
1. Overview
To simplify the present disclosure, the following patents, patent
applications and publications are hereby incorporated by reference in the
present disclosure as if fully set forth herein: U.S. Pat. No. 5,040,217,
issued Aug. 13, 1991 by K. Brandenburg et al, U.S. patent application
Ser. No. 07/292,598, entitled Perceptual Coding of Audio Signals, filed
Dec. 30, 1988; J. D. Johnston, Transform Coding of Audio Signals Using
Perceptual Noise Criteria, IEEE Journal on Selected Areas in
Communications, Vol. 6, No. 2 (February 1988); International Patent
Application (PCT) WO 88/01811, filed Mar. 10, 1988; U.S. patent
application Ser. No. 07/491,373, entitled Hybrid Perceptual Coding, filed
Mar. 9, 1990, Brandenburg et al, Aspec: Adaptive Spectral Entropy Coding
of High Quality Music Signals, AES 90th Convention (1991); Johnston, J.,
Estimation of Perceptual Entropy Using Noise Masking Criteria, ICASSP,
(1988); J. D. Johnston, Perceptual Transform Coding of Wideband Stereo
Signals, ICASSP (1989); E. F. Schroeder and J. J. Platte, "`MSC`: Stereo
Audio Coding with CD-Quality and 256 kBIT/SEC," IEEE Trans. on Consumer
Electronics, Vol. CE-33, No. 4, November 1987; and Johnston, Transform
Coding of Audio Signals Using Noise Criteria, Vol. 6, No. 2, IEEE J.S.C.A.
(February 1988).
For clarity of explanation, the illustrative embodiment of the present
invention is presented as comprising individual functional blocks
(including functional blocks labeled as "processors"). The functions these
blocks represent may be provided through the use of either shared or
dedicated hardware, including, but not limited to, hardware capable of
executing software. (Use of the term "processor" should not be construed
to refer exclusively to hardware capable of executing software.)
Illustrative embodiments may comprise digital signal processor (DSP)
hardware, such as the AT&T DSP16 or DSP32C, and software performing the
operations discussed below. Very large scale integration (VLSI) hardware
embodiments of the present invention, as well as hybrid DSP/VLSI
embodiments, may also be provided.
FIG. 1 is an overall block diagram of a system useful for incorporating an
illustrative embodiment of the present invention. At the level shown, the
system of FIG. 1 illustrates systems known in the prior art, but
modifications, and extensions described herein will make clear the
contributions of the present invention. In FIG. 1, an analog audio signal
101 is fed into a preprocessor 102 where it is sampled (typically at 48
KHz) and convened into a digital pulse code modulation ("PCM") signal 103
(typically 16 bits) in standard fashion. The PCM signal 103 is fed into a
perceptual audio coder 104 ("PAC") which compresses the PCM signal and
outputs the compressed PAC signal to a communications channel/storage
medium 105. From the communications channel/storage medium the compressed
PAC signal is fed into a perceptual audio decoder 107 which decompresses
the compressed PAC signal and outputs a PCM signal 108 which is
representative of the compressed PAC signal. From the perceptual audio
decoder, the PCM signal 108 is fed into a post-processor 109 which mates
an analog representation of the PCM signal 108.
An illustrative embodiment of the perceptual audio coder 104 is shown in
block diagram form in FIG. 2. As in the case of the system illustrated in
FIG. 1, the system of FIG. 2, without more, may equally describe certain
prior art systems, e.g., the system disclosed in the Brandenburg, et al
U.S. Pat. No. 5,040,2 17. However, with the extensions and modifications
described herein, important new results are obtained. The perceptual audio
coder of FIG. 2 may advantageously be viewed as comprising an analysis
filter bank 202, a perceptual model processor 204, a quantizer/rate-loop
processor 206 and an entropy coder 208.
The filter bank 202 in FIG. 2 advantageously transforms an input audio
signal in time/frequency in such manner as to provide both some measure of
signal processing gain (i.e. redundancy extraction) and a mapping of the
filter bank inputs in a way that is meaningful in light of the human
perceptual system. Advantageously, the well-known Modified Discrete Cosine
Transform (MDCT) described, e.g., in J. P. Princen and A. B. Bradley,
"Analysis/Synthesis Filter Bank Design Based on Time Domain Aliasing
Cancellation," IEEE Trans. ASSP, Vol. 34, No. 5, October, 1986, may be
adapted to perform such transforming of the input signals.
Features of the MDCT that make it useful in the present context include its
critical sampling characteristic, i.e. for every n samples into the filter
bank, n samples are obtained from the filter bank. Additionally, the MDCT
typically provides half-overlap, i.e. the transform length is exactly
twice the length of the number of samples, n, shifted into the filterbank.
The half-overlap provides a good method of dealing with the control of
noise injected independently into each filter tap as well as providing a
good analysis window frequency response. In addition, in the absence of
quantization, the MDCT provides exact reconstruction of the input samples,
subject only to a delay of an integral number of samples.
One aspect in which the MDCT is advantageously modified for use in
connection with a highly efficient stereophonic audio coder is the
provision of the ability to switch the length of the analysis window for
signal sections which have strongly non-stationary components in such a
fashion that it retains the critically sampled and exact reconstruction
properties. The incorporated U.S. patent application by Ferreira and
Johnston, entitled "A METHOD AND APPARATUS FOR THE PERCEPTUAL CODING OF
AUDIO SIGNALS," (referred to hereinafter as the "filter bank application")
filed of even date with this application, describes a filter bank
appropriate for performing the functions of element 202 in FIG. 2.
The perceptual model processor 204 shown in FIG. 2 calculates an estimate
of the perceptual importance, noise masking properties, or just noticeable
noise floor of the various signal components in the analysis bank. Signals
representative of these quantities are then provided to other system
elements to provide improved control of the filtering operations and
organizing of the data to be sent to the channel or storage medium. Rather
than using the critical band by critical band analysis described in J. D.
Johnston, "Transform Coding of Audio Signals Using Perceptual Noise
Criteria," IEEE J. on Selected Areas in Communications, February 1988, an
illustrative embodiment of the present invention advantageously uses finer
frequency resolution in the calculation of thresholds. Thus instead of
using an overall tonality metric as in the last-cited Johnston paper, a
tonality method based on that mentioned in K. Brandenburg and J. D.
Johnston, "Second Generation Perceptual Audio Coding: The Hybrid Coder,"
AES 89th Convention, 1990 provides a tonality estimate that varies over
frequency, thus providing a better fit for complex signals.
The psychoacoustic analysis performed in the perceptual model processor 204
provides a noise threshold for the L (Left), R (Right), M (Sum) and S
(Difference) channels, as may be appropriate, for both the normal MDCT
window and the shorter windows. Use of the shorter windows is
advantageously controlled entirely by the psychoacoustic model processor.
In operation, an illustrative embodiment of the perceptual model processor
204 evaluates thresholds for the left and fight channels, denoted
THR.sub.l and THR.sub.r. The two thresholds are then compared in each of
the illustrative 35 coder frequency partitions (56 partitions in the case
of an active window-switched block). In each partition where the two
thresholds vary between left and fight by less than some amount, typically
2 dB, the coder is switched into M/S mode. That is, the left signal for
that band of frequencies is replaced by M=(L+R)/2, and the right signal is
replaced by S=(L-R)/2. The actual amount of difference that triggers the
last-mentioned substitution will vary with bitrate constraints and other
system parameters.
The same threshold calculation used for L and R thresholds is also used for
M and S thresholds, with the threshold calculated on the actual M and S
signals. First, the basic thresholds, denoted BTHR.sub.m and MLD.sub.s are
calculated. Then, the following steps are used to calculate the stereo
masking contribution of the M and S signals.
1. An additional factor is calculated for each of the M and S thresholds.
This factor, called MLD.sub.m, and MLD.sub.s, is calculated by multiplying
the spread signal energy, (as derived, e.g., in J. D. Johnston, "Transform
Coding of Audio Signals Using Perceptual Noise Criteria," IEEE J. on
Selected Areas in Communications, February 1988; K. Brandenburg and J. D.
Johnston, "Second Generation Perceptual Audio Coding: The Hybrid Coder,"
AES 89th Convention, 1990; and Brandenburg, et al U.S. Pat. No. 5,040,217)
by a masking level difference factor shown illustratively in FIG. 3. This
calculates a second level of detectability of noise across frequency in
the M and S channels, based on the masking level differences shown in
various sources.
2. The actual threshold for M (THR.sub.m) is calculated as THR.sub.m
=max(BTHR.sub.m, min(BTHR.sub.s,MLD.sub.s)) and the threshold
m=max(BTHR.sub.m,min(BTHR.sub.s,MLD.sub.s)) and the threshold for S is
calculated as THR.sub.s =max(BTHR.sub.s,min(BTHR.sub.m,MLD.sub.m)).
In effect, the MLD signal substitutes for the BTHR signal in cases where
there is a chance of stereo unmasking. It is not necessary to consider the
issue of M and S threshold depression due to unequal L and R thresholds,
because of the fact that L and R thresholds are known to be equal.
The quantizer and rate control processor 206 used in the illustrative coder
of FIG. 2 takes the outputs from the analysis bank and the perceptual
model, and allocates bits, noise, and controls other system parameters so
as to meet the required bit rate for the given application. In some
example coders this may consist of nothing more than quantization so that
the just noticeable difference of the perceptual model is never exceeded,
with no (explicit) attention to bit rate; in some coders this may be a
complex set of iteration loops that adjusts distortion and bitrate in
order to achieve a balance between bit rate and coding noise. A
particularly useful quantizer and rate control processor is described in
incorporated U.S. patent application by J. D. Johnston, entitled "RATE
LOOP PROCESSOR FOR PERCEPTUAL ENCODER/DECODER," (hereinafter referred to
as the "rate loop application") filed of even date with the present
application. Also desirably performed by the rate loop processor 206, and
described in the rate loop application, is the function of receiving
information from the quantized analyzed signal and any requisite side
information, inserting synchronization and framing information. Again,
these same functions are broadly described in the incorporated
Brandenburg, et al, U.S. Pat. No. 5,040,217.
Entropy coder 208 is used to achieve a further noiseless compression in
cooperation with the rate control processor 206. In particular, entropy
coder 208, in accordance with another aspect of the present invention,
advantageously receives inputs including a quantized audio signal output
from quantizer/rate-loop 206, performs a lossless encoding on the
quantized audio signal, and outputs a compressed audio signal to the
communications channel/storage medium 106.
Illustrative entropy coder 208 advantageously comprises a novel variation
of the minimum-redundancy Huffman coding technique to encode each
quantized audio signal. The Huffman codes are described, e.g., in D. A.
Huffman, "A Method for the Construction of Minimum Redundancy Codes",
Proc. IRE, 40:1098-1101 (1952) and T. M. Cover and J. A. Thomas, .us
Elements of Information Theory, pp. 92-101 (1991). The useful adaptations
of the Huffman codes advantageously used in the context of the coder of
FIG. 2 are described in more detail in the incorporated U.S. patent
application by J. D. Johnston and J. Reeds (hereinafter the "entropy coder
application") filed of even date with the present application and assigned
to the assignee of this application. Those skilled in the data
communications arts will readily perceive how to implement alternative
embodiments of entropy coder 208 using other noiseless data compression
techniques, including the well-known Lempel-Ziv compression methods.
The use of each of the elements shown in FIG. 2 will be described in
greater detail in the context of the overall system functionality; details
of operation will be provided for the perceptual model processor 204.
2.1. The Analysis Filter Bank
The analysis filter bank 202 of the perceptual audio coder 104 receives as
input pulse code modulated ("PCM") digital audio signals (typically 16-bit
signals sampled at 48 KHz), and outputs a representation of the input
signal which identifies the individual frequency components of the input
signal. Specifically, an output of the analysis filter bank 202 comprises
a Modified Discrete Cosine Transform ("MDCT") of the input signal. See, J.
Princen et al, "Sub-band Transform Coding Using Filter Bank Designs Based
on Time Domain Aliasing Cancellation," IEEE ICASSP, pp. 2161-2164 (1987).
An illustrative analysis filter bank 202 according to one aspect of the
present invention is presented in FIG. 4. Analysis filter bank 202
comprises an input signal buffer 302, a window multiplier 304, a window
memory 306, an FFT processor 308, an MDCT processor 310, a concatenator
311, a delay memory 312 and a dam selector 132.
The analysis filter bank 202 operates on frames. A frame is conveniently
chosen as the 2N PCM input audio signal samples held by input signal
buffer 302. As stated above, each PCM input audio signal sample is
represented by M bits. Illustratively, N=512 and M=16.
Input signal buffer 302 comprises two sections: a first section comprising
N samples in buffer locations 1 to N, and a second section comprising N
samples in buffer locations N+1 to 2N. Each frame to be coded by the
perceptual audio coder 104 is defined by shifting N consecutive samples of
the input audio signal into the input signal buffer 302. Older samples are
located at higher buffer locations than newer samples.
Assuming that, at a given time, the input signal buffer 302 contains a
frame of 2N audio signal samples, the succeeding frame is obtained by (1)
shifting the N audio signal samples in buffer locations 1 to N into buffer
locations N+1 to 2N, respectively, (the previous audio signal samples in
locations N+1 to 2N may be either overwritten or deleted), and (2) by
shifting into the input signal buffer 302, at buffer locations 1 to N, N
new audio signal samples from preprocessor 102. Therefore, it can be seen
that consecutive frames contain N samples in common: the first of the
consecutive frames having the common samples in buffer locations 1 to N,
and the second of the consecutive frames having the common samples in
buffer locations N+1 to 2N. Analysis filter bank 202 is a critically
sampled system (i.e., for every N audio signal samples received by the
input signal buffer 302, the analysis filter bank 202 outputs a vector of
N scalers to the quantizer/rate-loop 206).
Each frame of the input audio signal is provided to the window multiplier
304 by the input signal buffer 302 so that the window multiplier 304 may
apply seven distinct data windows to the frame. Each data window is a
vector of scalers called "coefficients". While all seven of the data
windows have 2N coefficients (i.e., the same number as there are audio
signal samples in the frame), four of the seven only have N/2 non-zero
coefficients (i.e., one-fourth the number of audio signal samples in the
frame). As is discussed below, the data window coefficients may be
advantageously chosen to reduce the perceptual entropy of the output of
the MDCT processor 310.
The information for the data window coefficients is stored in the window
memory 306. The window memory 306 may illustratively comprise a random
access memory ("RAM"), read only memory ("ROM"), or other magnetic or
optical media. Drawings of seven illustrative data windows, as applied by
window multiplier 304, are presented in FIG. 4. Typical vectors of
coefficients for each of the seven data windows presented in FIG. 4 are
presented in Appendix A. As may be seen in both FIG. 4 and in Appendix A,
some of the data window coefficients may be equal to zero.
Keeping in mind that the data window is a vector of 2N scalers and that the
audio signal frame is also a vector of 2N scalers, the data window
coefficients are applied to the audio signal frame scalers through
point-to-point multiplication (i.e., the first audio signal frame scaler
is multiplied by the first data window coefficient, the second audio
signal frame scaler is multiplied by the second data window coefficient,
etc.). Window multiplier 304 may therefore comprise seven microprocessors
operating in parallel, each performing 2N multiplications in order to
apply one of the seven data window to the audio signal frame held by the
input signal buffer 302. The output of the window multiplier 304 is seven
vectors of 2N scalers to be referred to as "windowed frame vectors".
The seven windowed frame vectors are provided by window multiplier 304 to
FFr processor 308. The FFT processor 308 performs an odd-frequency FFT on
each of the seven windowed frame vectors. The odd-frequency FFT is an
Discrete Fourier Transform evaluated at frequencies:
##EQU1##
where k=1, 3, 5, . . . ,2N, and f.sub.H equals one half the sampling rate.
The illustrative FFT processor 308 may comprise seven conventional
decimation-in-time FFT processors operating in parallel, each operating on
a different windowed frame vector. An output of the FFT processor 308 is
seven vectors of 2N complex elements, to be referred to collectively as
"FFT vectors".
FFT processor 308 provides the seven FFT vectors to both the perceptual
model processor 204 and the MDCT processor 310. The perceptual model
processor 204 uses the FFT vectors to direct the operation of the data
selector 314 and the quantizer/rate-loop processor 206. Details regarding
the operation of data selector 314 and perceptual model processor 204 are
presented below.
MDCT processor 310 performs an MDCT based on the real components of each of
the seven FFT vectors received from FFT processor 308. .P MDCT processor
310 may comprise seven microprocessors operating in parallel. Each such
microprocessor determines one of the seven "MDCT vectors" of N real
scalars based on one of the seven respective FFT vectors. For each FFT
vector, F(k), the resulting MDCT vector, X (k), is formed as follows:
##EQU2##
The procedure need run k only to N, not 2N, because of redundancy in the
result. To wit, for N<K.ltoreq.2N:
X(k)=-X(2N-k)
MDCT processor 310 provides the seven MDCT vectors to concatenator 311 and
delay memory 312.
As discussed above with reference to window multiplier 304, four of the
seven data windows have N/2 non-zero coefficients (see FIG. 4c-f). This
means that four of the windowed frame vectors contain only N/2 non-zero
values. Therefore, the non-zero values of these four vectors may be
concatenated into a single vector of length 2N by concatenator 311 upon
output from MDCT processor 310. The resulting concatenation of these
vectors is handled as a single vector for subsequent purposes. Thus, delay
memory 312 is presented with four MDCT vectors, rather than seven.
Delay memory 312 receives the four MDCT vectors from MDCT processor 314 and
concatenator 311 for the purpose of providing temporary storage. Delay
memory 312 provides a delay of one audio signal frame (as defined by input
signal buffer 302) on the flow of the four MDCT vectors through the filter
bank 202. The delay is provided by (i) storing the two most recent
consecutive sets of MDCT vectors representing consecutive audio signal
frames and (ii) presenting as input to data selector 314 the older of the
consecutive sets of vectors. Delay memory 312 may comprise random access
memory (RAM) of size:
M.times.2.times.4.times.N
where 2 is the number of consecutive sets of vectors, 4 is the number of
vectors in a set, N is the number of elements in an MDCT vector, and M is
the number of bits used to represent an MDCT vector element.
Data selector 314 selects one of the four MDCT vectors provided by delay
memory 312 to be output from the filter bank 202 to quantizer/rate-loop
206. As mentioned above, the perceptual model processor 204 directs the
operation of data selector 314 based on the FFT vectors provided by the
FFT processor 308. Due to the operation of delay memory 312, the seven FFT
vectors provided to the perceptual model processor 204 and the four MDCT
vectors concurrently provided to data selector 314 are not based on the
same audio input frame, but rather on two consecutive input signal
frames--the MDCT vectors based on the earlier of the frames, and the FFT
vectors based on the later of the frames. Thus, the selection of a
specific MDCT vector is based on information contained in the next
successive audio signal frame. The criteria according to which the
perceptual model processor 204 directs the selection of an MDCT vector is
described in Section 2.2, below.
For purposes of an illustrative stereo embodiment, the above analysis
filterbank 202 is provided for each of the left and right channels.
2.2. The Perceptual Model Processor
A perceptual coder achieves success in reducing the number of bits required
to accurately represent high quality audio signals, in part, by
introducing noise associated with quantization of information bearing
signals, such as the MDCT information from the filter bank 202. The goal
is, of course, to introduce this noise in an imperceptible or benign way.
This noise shaping is primarily a frequency analysis instrument, so it is
convenient to convert a signal into a spectral representation (e.g., the
MDCT vectors provided by filter bank 202), compute the shape and amount of
the noise that will be masked by these signals and injecting it by
quantizing the spectral values. These and other basic operations are
represented in the structure of the perceptual coder shown in FIG. 2.
The perceptual model processor 204 of the perceptual audio coder 104
illustratively receives its input from the analysis filter bank 202 which
operates on successive frames. The perceptual model processor inputs then
typically comprise seven Fast Fourier Transform (FFT) vectors from the
analysis filter bank 202. These are the outputs of the FFT processor 308
in the form of seven vectors of 2N complex elements, each corresponding to
one of the windowed frame vectors.
In order to mask the quantization noise by the signal, one must consider
the spectral contents of the signal and the duration of a particular
spectral pattern of the signal. These two aspects are related to masking
in the frequency domain where signal and noise are approximately steady
state--given the integration period of the hearing system--and also with
masking in the time domain where signal and noise are subjected to
different cochlear filters. The shape and length of these filters are
frequency dependent.
Masking in the frequency domain is described by the concept of simultaneous
masking. Masking in the time domain is characterized by the concept of
premasking and postmasking. These concepts are extensively explained in
the literature; see, for example, E. Zwicker and H. Fastl,
"Psychoacoustics, Facts, and Models," Springer-Verlag, 1990. To make these
concepts useful to perceptual coding, they are embodied in different ways.
Simultaneous masking is evaluated by using perceptual noise shaping models.
Given the spectral contents of the signal and its description in terms of
noise-like or tone-like behavior, these models produce an hypothetical
masking threshold that rules the quantization level of each spectral
component. This noise shaping represents the maximum amount of noise that
may be introduced in the original signal without causing any perceptible
difference. A measure called the PERCEPTUAL ENTROPY (PE) uses this
hypothetical masking threshold to estimate the theoretical lower bound of
the bitrate for transparent encoding. J. D. Jonston, Estimation of
Perceptual Entropy Using Noise Masking Criteria, ICASSP, 1989.
Premasking characterizes the (in)audibility of a noise that starts some
time before the masker signal which is louder than the noise. The noise
amplitude must be more attenuated as the delay increases. This attenuation
level is also frequency dependent. If the noise is the quantization noise
attenuated by the first half of the synthesis window, experimental
evidence indicates the maximum acceptable delay to be about 1 millisecond.
This problem is very sensitive and can conflict directly with achieving a
good coding gain. Assuming stationary conditions--which is a false
premiss--The coding gain is bigger for larger transforms, but, the
quantization error spreads till the beginning of the reconstructed time
segment. So, if a transform length of 1024 points is used, with a digital
signal sampled at a rate of 48000 Hz, the noise will appear at most 21
milliseconds before the signal. This scenario is particularly critical
when the signal takes the form of a sharp transient in the time domain
commonly known as an "attack". In this case the quantization noise is
audible before the attack. The effect is known as pre-echo.
Thus, a fixed length filter bank is a not a good perceptual solution nor a
signal processing solution for non-stationary regions of the signal. It
will be shown later that a possible way to circumvent this problem is to
improve the temporal resolution of the coder by reducing the
analysis/synthesis window length. This is implemented as a window
switching mechanism when conditions of attack are detected. In this way,
the coding gain achieved by using a long analysis/synthesis window will be
affected only when such detection occurs with a consequent need to switch
to a shorter analysis/synthesis window.
Postmasking characterizes the (in)audibility of a noise when it remains
after the cessation of a stronger masker signal. In this case the
acceptable delays are in the order of 20 milliseconds. Given that the
bigger transformed time segment lasts 21 milliseconds (1024 samples), no
special care is needed to handle this situation.
WINDOW SWITCHING
The PERCEPTUAL ENTROPY (PE) measure of a particular transform segment gives
the theoretical lower bound of bits/sample to code that segment
transparently. Due to its memory properties, which are related to
premasking protection, this measure shows a significant increase of the PE
value to its previous value--related with the previous segment--when some
situations of strong non-stationarity of the signal (e.g. an attack) are
presented. This important property is | | |