|
Description  |
|
|
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to an apparatus for producing wideband speech
signals from narrowband speech signals and, in particular, relates to an
apparatus for producing wideband speech from telephone-band speech.
2. Description of the Related Art
Among prior methods of expanding speech bandwidth, there is the method
described in Y. Yoshida, T. Abe, et al. "Recovery of wideband speech from
narrowband speech by codebook mapping", Denshi Joho Tsushin Gakkai
Shingakuho SP 93-61 (1993) (in Japanese language) and the method described
in Y. Cheng, D. O'Shaughnessy, P. Mermelstein, "Statistical recovery of
wideband speech from narrowband speech", Proceed. ICSLP 92 (1992), pp.
1577-1580.
According to the method by Yoshida et al. a large number of code words, for
instance 512 codes, have been necessary for reliably expanding speech
bandwidth, since the method relies on codebook mapping. On the other hand,
the method of Cheng et al. had a problem in the quality of the synthesized
speech, since white noise, which is not correlated to the original speech,
is added.
SUMMARY OF THE INVENTION
An object of the present invention is therefore to produce a wideband
speech signal from a narrowband speech signal using a small number of
codes.
Another object of the present invention is to produce a wideband speech
signal from a telephone-band speech signal.
A further object of the present invention is to produce a clear wideband
speech signal from a narrowband speech signal.
In order to achieve the aforementioned objects, the present invention
obtains a wideband speech signal from a narrowband speech signal by adding
thereto a signal of a frequency range outside the bandwidth of the
narrowband speech signal. Preferably, the present invention extracts
features from the narrowband speech signal to create a synthesized
wideband signal which is added to the narrowband speech signal. In a
further preferred composition, the present invention separates a
narrowband speech signal into a spectrum information signal and a residual
information signal to expand the bandwidth of both information signals and
to combine them.
By means of the above composition, the present invention expands the
bandwidth of a speech signal without altering the information contained in
the narrowband speech signal. Further, the present invention can produce a
synthesized signal having a great correlation with the narrowband speech
signal. Still further, the present invention can freely vary the precision
of the system by clarifying the process of expanding the bandwidth.
BRIEF DESCRIPTION OF THE DRAWINGS
These and other objects and features of the present invention will become
clear from the following description taken in conjunction with the
preferred embodiments thereof with reference to the accompanying drawings
throughout in which like parts are designated by like reference numerals,
and in which:
FIG. 1 is a block diagram illustrating the apparatus for expanding the
speech bandwidth of an embodiment in accordance with the present
invention;
FIG. 2 is a block diagram illustrating the spectral envelope converter
shown in FIG. 1;
FIG. 3 is a block diagram illustrating another spectral envelope converter
of the embodiment in accordance with the present invention;
FIG. 4 is a block diagram illustrating another spectral envelope converter
of the embodiment in accordance with the present invention;
FIG. 5 is a block diagram illustrating another spectral envelope converter
of the embodiment in accordance with the present invention;
FIG. 6 is a block diagram illustrating the residual converter shown in FIG.
1;
FIG. 7 is a block diagram illustrating the apparatus for expanding the
speech bandwidth of another embodiment in accordance with the present
invention;
FIG. 8 is a schematic drawing illustrating the waveform smoother shown in
FIG. 1;
FIGS. 9 and 10 illustrate a graph of the number of subspaces and mean
distances between the original word speech and the word speech synthesized
according to the present invention, in which FIG. 9 shows the results
obtained by male speech and FIG. 10 shows those obtained by female speech;
and
FIG. 11 illustrates the results of a subjective test for evaluating the
present invention.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
The preferred embodiments according to the present invention will be
described below with reference to the attached drawings.
FIG. 1 is a block diagram illustrating the apparatus for expanding the
speech bandwidth of an embodiment in accordance with the present
invention. In FIG. 1, 101 is an A-D converter that converts an original
narrowband speech analog signal input thereto into a digital speech
signal. The output of the A-D converter 101 is fed to a signal adder 103
and an addition signal generator 102. The addition signal generator 102
extracts features from the output signal of the A-D converter 101 so as to
output a signal having frequency characteristics of a bandwidth which are
wider than the bandwidth of the input signal. Signal adder 103
algebraically adds the output of the A-D converter 101 and the output of
the addition signal generator 102 and outputs the resulting signal. A D-A
converter 104 converts the digital signal outputted from the signal adder
103 into an analog signal which is outputted. The present embodiment
generates an output signal of a bandwidth which is wider than that of the
original signal by this composition.
Next, the composition of the addition signal generator 102 is described. A
bandwidth expander 106 reads the output signal of the A-D converter 101 to
generate a signal of a bandwidth which is wider than that of the read
signal. It comprises a bandwidth expander 106 and a filter section 105.
The output signal of the bandwidth expander 106 is fed to a filter section
105. The filter section 105 extracts frequency components which exist
outside the bandwidth of the original signal. For example, if the original
signal has frequency components of 300 Hz to 3,400 Hz, then the bandwidth
of the components extracted by the filter section 105 is the band below
300 Hz and the band above 3,400 Hz.
However, it is not necessary to extract all components which exist outside
the bandwidth of the original signal. The filter section 105 is preferably
configured with a digital filter, which may be either an FIR filter or an
IIR filter. The FIR and IIR filters are well known and can be realized,
for example, by the compositions described in Simon Haykin, "Instruction
to adaptive filters", (Macmillan).
Next, the composition and operation of the bandwidth expander 106 are
described. In the bandwidth expander 106, an LPC (Linear Predictive
Coding) analyzer 107 first reads the output signal of the A-D converter
101 to perform a linear predictive coding (LPC) analysis. The LPC analysis
is well known and can be realized, for example, by the methods described
in Lawrence R. Rabiner, "Digital processing of speech signals",
(Prentice-Hall). These methods are incorporated by reference. The LPC
analyzer 107 obtains LPC coefficients, which are also called linear
predictive codings. The number P of the LPC coefficients, i.e. dimension P
of the feature vector extracted by the LPC analyzer is chosen in relation
to the sampling frequency and is selected at ten or sixteen since the
sampling frequency is 16 kHz in the speech analysis. The LPC analyzer 107
then obtains other sets of feature amounts from the LPC coefficients by
transformations. These feature amounts are reflection coefficients, PARCOR
(partial correlation) coefficients, Cepstrum coefficients, LSP (line
spectrum pair) coefficients and other, and they are all spectral envelope
parameters obtained by the LPC coefficients. Further, the LPC analyzer 107
obtains a residual signal from the LPC coefficients. The residual signal
is the difference between the output signal of the A-D converter 101 and
the predicted signal output from an FIR filter having filter coefficients
given by the LPC coefficients. That is, if the output signal of the A-D
converter 101 is denoted by r(t.sub.n) wherein t.sub.n denotes a present
sampling time and t.sub.n-1 (i=1, 2, . . . , p) denotes a sampling time i
times before, and the LPC coefficients are denoted by a.sub.i, i=1, 2, . .
. , p, then the residual signal r(t.sub.n) is
r(t.sub.n)=y(t.sub.n)-a.sub.1 y(t.sub.n-1)-a.sub.2 y(t.sub.n-2)-. . .
-a.sub.p y(t.sub.n-p) (1)
The spectral envelope parameters outputted from the LPC analyzer 107 are
converted, by a spectral envelope converter 109, into spectral envelope
parameters of a bandwidth which is wider than the bandwidth of the IIR
filter constructed with the spectral envelope parameters outputted from
the LPC analyzer 107. On the other hand, the residual signal outputted
from the LPC analyzer 107 is converted, by a residual converter 110, into
a residual signal of a bandwidth which is wider than that of the residual
signal outputted from the LPC analyzer 107. An LPC synthesizer 108
synthesizes a digital speech signal from the output of the spectral
envelope converter 109 and the output of the residual converter 110.
The spectral envelope converter 109 converts the input spectral envelope
parameters into spectral envelope parameters of a wider bandwidth as
follows. Namely, assuming a and fa denote an input feature vector having p
elements comprising the input spectral envelope parameters and an output
or converted feature vector obtained by a k th linear mapping function of
matrix B.sub.k =(bij) (i,j=1, . . . , p, k=1, . . . , M M; the number of
linear mapping functions), respectively, fa is given by the following
equation:
##EQU1##
The spectral envelope converter 109 can also be realized by the composition
shown in FIG. 2. In this composition, the spectral envelope converter 109
comprises a spectral envelope codebook 201 that has a M spectral envelope
codes, for instance sixteen codes, each of which is representative of a
set of spectral envelope parameters, and a linear mapping function
codebook 202 that has M linear mapping functions, each of which
corresponds to a spectral envelope code of the spectral envelope codebook
201 one to one. The spectral envelope codes are created by dividing a
multi-dimensional space of the spectral envelope parameters into M
subspaces and by averaging the spectral envelope parameter vectors
belonging to each subspace. For example, if the jth feature value of the
ith spectral envelope parameter vector belonging to a subspace is
a.sub.ij, then the jth feature value c.sub.j of the spectral envelope code
corresponding to that subspace is
##EQU2##
where R is the number of spectral envelope parameter vectors (feature
vectors) belonging to the subspace.
The spectral envelope parameters obtained by the LPC analyzer 107 are fed
to a distance calculator 203, and a linear mapping function calculator
205. The distance calculator 203 calculates the distance between the
spectral envelope parameters a(j), j=1, . . . , p outputted from the LPC
analyzer 107 and each spectral envelope code stored in spectral envelope
codebook 201. If the jth feature value of the ith spectral envelope code
is c.sub.ij, then the distance is obtained by the equation
##EQU3##
where i=1, . . . , M, and M is the number of spectral envelope codes which
is equal to the number of the divided subspaces. The calculated results of
the distance calculator 203 are inputted to a comparator or selector 204.
The comparator 204 selects the minimum distance of the input multiple
distances and outputs, into a linear mapping function calculator 205, a
linear mapping function stored in the linear transformation codebook 202
and corresponding to the linear spectral code that gives the selected
minimum distance. The linear mapping function calculator 205 performs
computations similar to equation (2) based on the spectral envelope
parameters outputted from the LPC analyzer 107 and the linear
transformation outputted from the comparator 204. The output of linear
mapping function calculator 205 is the converted spectral envelope
parameters in the present composition.
In the following, a learning method for determining spectral envelope codes
and corresponding linear mapping functions is explained:
(a) A plurality of word speech samples of a wideband are prepared.
(b) Each of these word speech samples is LPC analyzed to obtain LPC
parameters of the wideband.
(c) Each of these word speech samples is transformed to corresponding word
speech samples of a narrowband by filtering each original speech using a
low frequency cut filter and a high frequency cut filter. Then, each word
speech sample of the narrowband is LPC analyzed to obtain LPC parameters
of the narrowband.
(d) Next, a multi-dimension space of the feature vectors thus obtained
regarding word speech samples of the narrowband is divided into subspaces
of an appropriate number. This is done so as to satisfy the following
conditions:
<d1> Consider M subspaces and calculate a mean value of feature vectors
belonging to one of M subspaces. A central value obtained by mean values
of M subspaces is as close as possible to a central value obtained by
averaging all feature vectors now considered.
<d2> The number of feature vectors belonging to each subspace is
substantially equal to each other. Namely, feature vectors are uniformly
distributed over all subspaces.
(e) When the division into M subspaces is achieved, linear mapping
functions are sought for M subspaces. Since the relationship between each
original word speech and the corresponding narrowband word speech has been
obtained, each linear mapping function is determined so that a distance
between the original word speech of the wideband and a word speech mapped
into the corresponding subspace by that linear mapping function can be
minimized.
FIGS. 9 and 10 illustrate a graph of the number of subspaces versus the
mean distances between the original word speech and the word speech
synthesized according to the present invention. FIG. 9 illustrates results
obtained for male speech and FIG. 10 illustrates results obtained for
female speech.
It is to be noted that the mean distance is minimized at 16 when 100 word
speech samples have been used for learning. In other words, enough
learning with an enough number of word speech samples does not necessitate
more of subspaces than 16. This fact indicates that the method of the
present invention can simplify the expansion operation from narrowband to
wideband resulting in a quick response.
FIG. 3 shows another composition of spectral envelope converter 109. In the
composition of the FIG. 3, the compositions of spectral envelope codebook
201, linear mapping function codebook 202, distance calculator 203, and
the linear mapping function calculator 205 are the same as in FIG. 2. The
spectral envelope parameters outputted from the LPC analyzer 107 are
inputted to a distance calculator 203 and a linear transformation
calculator 205. The distance calculator 203 calculates the distance
between the spectral envelope parameters outputted from the LPC analyzer
107 and each spectral envelope code stored in the spectral envelope
codebook 201. The results are inputted to a weights calculator 301. The
weights calculator 301 calculates a weight corresponding to each spectral
envelope code by the following equation (5).
##EQU4##
where w.sub.i is the weight corresponding to the ith spectral envelope
code, and d.sub.i is the distance to the ith spectral envelope code
calculated by the distance calculator 203. On the other hand, the linear
mapping function calculator 205 reads the spectral envelope parameters a
outputted from the LPC analyzer 107 and each linear mapping function
B.sub.i (i=1, . . . , M) stored in the linear mapping function codebook
202 to transform the former into spectral envelope parameters fa by a
method similar to equation (2). The output of the weights calculator 301
and the output of the linear mapping function calculator 205 are inputted
to a linear transformation results adder 302. The linear transformation
results adder 302 calculates the converted spectral envelope parameters wa
by the following equation (6):
##EQU5##
Another composition of the spectral envelope converter 109 is shown in FIG.
4. In this composition, the spectral envelope converter 109 has a
narrowband spectral envelope codebook 401 that has a plurality of spectral
envelope codes having narrowband spectral envelope information and a
wideband spectral envelope codebook 402 that has spectral envelope codes
having wideband spectral envelope information and a one-to-one
correspondence with the narrowband spectral codes. The spectral envelope
parameters outputted from the LPC analyzer 107 are inputted to the
distance calculator 203 of FIG. 2. Using the equation (4), the distance
calculator 203 calculates the distance between the spectral envelope
parameters outputted from the LPC analyzer 107 and each narrowband
spectral envelope code stored in narrowband spectral envelope codebook 401
to output the calculated results to the comparator 403. The distance
calculator 203 can use the following equation (7) in place of the equation
(4):
##EQU6##
where x may be a number other than 2. Preferably, x may be between 2 and
1.5. The comparator 403 extracts, from the wideband spectral envelope code
book 402, the wideband spectral envelope code corresponding to the
narrowband spectral envelope code that gives the minimum value of the
distances calculated by distance calculator 203. The extracted wideband
spectral envelope code is made to be the converted spectral envelope
parameters in the present composition.
Another composition of the spectral envelope converter 109 is described in
FIG. 5. In this composition, a neural network is used to convert the
spectral envelope parameters. Neural networks are well-known techniques,
and can be realized, for example, by the methods described in E. D.
Lipmann, "Introduction to computing with neural nets", IEEE ASSP Magazine
(1987), pp. 4-22. An example is shown in FIG. 5. The spectral envelope
parameters outputted from the LPC analyzer 107 are inputted to a neural
network 501. If the inputted spectral envelope parameters are a(i) i=1, .
. . , p, then the converted spectral envelope parameters in the present
method, fa(k), are
##EQU7##
where w.sub.ij and w.sub.jk are respectively the weights between the ith
layer and the jth layer and the weights between the jth layer and the kth
layer. Besides the three-layer composition shown in FIG. 5, the neural
network may be constructed with a greater number of layers. Further, the
equations for calculation may be different from (8) and (9).
Next, a preferred example of a residual converter 110 is described with
reference to FIG. 6. The residual signal outputted from the LPC analyzer
107 is fed to a power calculator 601 and a nonlinear processor 602. The
power calculator 601 calculates the power of the residual signal by
summing the powers of each value of the residual signal and dividing the
result by the sample number. Specifically, the power g is calculated by
##EQU8##
where r(i), i=1, . . . , p are the residual signal values. The nonlinear
processor 602 performs nonlinear processing of the residual signal to
obtain a processed residual signal. The processed residual signal is fed
to a power calculator 603 and a gain controller 604. The gain controller
604 multiplies the processed residual signal outputted from the nonlinear
processor 602 by the ratio of the power obtained by the power calculator
601 to the power obtained by the power calculator 603. That is, if the
residual signal values processed by the nonlinear processor 602 are nr(i),
i=1, . . . , p, then the residual signal values fnr(i), i=1, . . . , p
outputted from the gain controller 604 are calculated by
fnr(i)=g.sub.1 /g.sub.2 .multidot.nr(i), (11)
where g.sub.1 is the power obtained by the power calculator 601 and g.sub.2
is the power obtained by the power calculator 603. These fn(i) are the
outputs of the residual converter 110 of the present example.
The nonlinear processor 602 can be realized using full-wave rectification
or half-wave rectification. Alternatively, the nonlinear processor 602 can
be realized by setting a threshold value and fixing the residual signal
values at the threshold value if the magnitude of the original residual
signal values exceeds the threshold value. In this case, the threshold
value is preferably determined based on the power obtained by the power
calculator 601. For example, the threshold value is set at 0.8.g.sub.1,
where g.sub.1 is the power outputted from the power calculator 601. Other
methods of calculating the threshold value are also possible.
Another composition of the nonlinear processor 602 can be realized using
the multi-pulse method. The multi-pulse method is well known and
described, for example, in B. S. Atal et al., "A new model of LPC
excitation for producing natural sound speech at very low bit rates",
Proceed. ICASSP (1982), pp. 614-617. In this composition, the nonlinear
processor 602 generates multi-pulses to perform nonlinear processing of
the residual signal obtained by the LPC analyzer 107.
In the following is described a second embodiment in accordance with the
present invention. As shown in FIG. 7, the present embodiment has a
waveform smoother 111 between the bandwidth expander 106 and the filter
section 105 of FIG. 1.
The composition of the waveform smoother 111 is next described using the
schematic illustration of FIG. 8. When the output signal of a bandwidth
expander 106 is obtained for each determined time period (frame length),
there exists discontinuity between the subsequent frames if the subsequent
frame signals are simply connected to the filter 105 as they are. In the
composition of the second embodiment, the discontinuity between the frame
signals is mitigated by a waveform smoother 111. If the bandwidth expander
106 is constructed so as to temporarily overlap the subsequent frame
signals, then the output frame signals are overlapped as shown in (a) and
(d) of FIG. 8. The waveform smoother 111 multiplies the output signals of
the bandwidth expander 106 by waveform smoothing functions to add them
over the time domain, as shown in FIG. 8. Specifically, the output frame
signals (a) and (d) of the bandwidth expander 106 are respectively
multiplied by the smoothing function (b) and (e) of FIG. 8. The resulting
signals (c) and (f) are then added over the time domain to output the
signal (g). Let the output of the waveform smoother 111 and the output of
the bandwidth expander 106 be respectively D(N, x) and F(N, x), where N is
the frame number and x is the time within each frame. Let the waveform
smoothing weight functions for the past frame and the present frame be
respectively CFB and CFF,
D(N,x)=CFB(x).multidot.F(N-1, x)+CFF(x).multidot.F(N, x). (12)
Preferably, CFB and CFF are defined as
CFB(x)=(-2.multidot.x+L)/L, (13)
CFF(x)=2.multidot.x/L, (14)
where L is the frame length.
FIG. 11 illustrates results of a subjective test for evaluating the present
invention. Test conditions are as follows;
(a) Content of test
Hearing test of an original speech of narrowband and corresponding speech
of wideband recovered according to the present invention.
(b) Manner of evaluation
Seven steps evaluation of whether the synthesized speech has an expanded
frequency range in comparison with the original speech of narrowband.
0 point: not distinguishable,
1 (-1) point: slightly distinguishable from the original speech
(synthesized one),
2 (-2) point: distinguishable from the original speech (synthesized one),
and
3 (-3) point: clearly distinguishable from the original speech (synthesized
one)
(c) Number of tested persons
12 persons including researchers of phonetics.
(d) Number of linear mapping functions used
16 linear mapping functions having been obtained by learning 100 word
speech samples.
(e) Sample data used for the test
10 sentences by a single speaker each having a length of about ten seconds.
(f) Used speaker monoral speaker
The test was done by making each person hear one set of original and
synthesized speeches without noticing which is original one. Each person
scored after hearing every one set.
The axis of abscissa in FIG. 11 denotes values of the seven steps
evaluation and that of vertex denotes values of summation by 12 persons.
FIG. 11 indicates that the speech synthesized according to the present
invention have a widely expanded sensation relative to an original
narrowband speech.
It is to be noted that the A/D converter and the D/A converter are
omittable in the case where the input speech signal is a digital speech
signal for processing.
Although the present invention has been fully described in connection with
the preferred embodiments thereof with reference to the accompanying
drawings, it is to be noted that various changes and modifications are
apparent to those skilled in the art. Such changes and modifications are
to be understood as included within the scope of the present invention.
* * * * *
|
|
|
|
|
Description  |
|