|
Description  |
|
|
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to a speech dialogue system for interfacing
the man and the computer in a form of a dialogue using speech data.
2. Description of the Background Art
In recent years, the development of the speech dialogue system using the
speech data as an interface between the man and the computer has been
advanced considerably.
In a speech dialogue system, which is useful in a multi-media dialogue
system for displaying the visual data such as a graphic data and image
data along with the speech data output, when the human speaker utters
speech messages toward the microphone, the system recognizes these speech
messages, and outputs the appropriate response in speech data from a
loudspeaker, so as to carry out the dialogue with the human speaker.
For example, such a speech dialogue system may be employed in a hamburger
shop for taking the order from the customer. In this case, when the
customer utters the order such as "Two hamburgers and three orange juices"
toward the microphone, the system recognizes this speech input, and
outputs the synthetic speech response for making a confirmation such as
"Is it two hamburgers and three orange juices that you have just
ordered?". In response to this synthetic speech response, when the
customer utters "Yes", the recognized speech content is confirmed, and
subsequently notified to the shop worker.
In such a conventional speech dialogue system, however, in a case the
customer uttered "Three hamburgers . . ." by mistake, it is not possible
for the customer to make a correction immediately, and the customer must
deny the synthetic speech response such as "Is it three hamburgers . . .?"
for making a confirmation from the system first, and then make the correct
speech input such as "Two hamburgers . . ." again.
Moreover, in a case the customer uttered "Two hamburgers, one coke, and one
ice cream, please", and the system erroneously recognized this speech
input and outputs the synthetic speech response "Is it four potatoes, one
coke, and one ice cream you have just ordered?", the customer may very
well be tempted to make a correction by interrupting the synthetic speech
response as soon as the synthetic speech response reaches to a portion ".
. . four potatoes . . .", but even in such a case, in a conventional
speech dialogue system, the customer cannot make the correction until the
output of the entire synthetic speech response is completed.
For these reasons, in a conventional speech dialogue system, the dialogue
often requires a considerable amount of time, and it can be quite
cumbersome.
In other words, in a conventional speech dialogue system, it has not been
possible to carry out the reception of the speech input from the human
speaker and the output of the synthetic speech response simultaneously,
such that the input of the speech to be made by the human speaker can be
made only after the output of the entire synthetic speech response from
the system has been completed, and so consequently the dialogue can be
quite time consuming and inefficient especially when the system makes the
erroneous recognition of the speech input.
SUMMARY OF THE INVENTION
It is therefore an object of the present invention to provide a speech
dialogue system capable of carrying out the reception of the speech input
from the human speaker and the output of the synthetic speech response
simultaneously, such that the input of the speech can be made by the human
speaker even while the output of the synthetic speech response from the
system is still in progress, and so consequently the dialogue can be less
time consuming and more efficient.
It is another object of the present invention to provide a speech dialogue
system capable of realizing a natural communication between the system and
the user of the system.
According to one aspect of the present invention there is provided a speech
dialogue system, comprising: microphone means for receiving a speech input
uttered by a human speaker and outputting microphone output signals;
speech recognition means for recognizing the speech input received by the
microphone means; synthetic speech response generation means for
generating a synthetic speech response appropriate for the speech input
recognized by the speech recognition means; loudspeaker means for
outputting the synthetic speech response to the human speaker; and
synthetic speech response cancellation means for cancelling the synthetic
speech response, which is outputted from the loudspeaker means and then
received by the microphone means, from the microphone output signals, to
obtain input signals to be supplied to the speech recognition means from
which the speech recognition means recognizes the speech input.
According to another aspect of the present invention there is provided a
speech dialogue system, comprising: input means for receiving input from a
user; input recognition means for recognizing the input received by the
input means; response generation means for generating a response including
a synthetic speech response, appropriate for the input recognized by the
input recognition means; output means for outputting the response
generated by the response generation means to the user; and control means
for controlling a mode of an output of the response to be outputted from
the output means, when there is the input from the user received by the
input means.
Other features and advantages of the present invention will become apparent
from the following description taken in conjunction with the accompanying
drawings.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a schematic block diagram of a first embodiment of a speech
dialogue system according to the present invention.
FIG. 2 is a flow chart for the operation of the speech dialogue system of
FIG. 1.
FIG. 3 is a graph of subtracted signals used in the speech dialogue system
of FIG. 1 as a function of time, indicating the effect of the synthetic
speech response cancellation in the speech dialogue system of FIG. 1.
FIG. 4 is a detailed block diagram of one possible configuration for a
synthetic speech response generation unit in the speech dialogue system of
FIG. 1.
FIG. 5 is a schematic block diagram of a second embodiment of a speech
dialogue system according to the present invention.
FIG. 6 is a graph of speech input as a function of time, for explaining the
operation of the speech dialogue system of FIG. 5.
FIGS. 7A and 7B are graphs of a power and a pitch for a certain synthetic
speech response as a function of time, respectively, for explaining a
procedure to obtain a step gain in FLMS algorithm used in the speech
dialogue system of FIG. 5.
FIG. 8 is a flow chart for the procedure to obtain a step gain in FLMS
algorithm used in the speech dialogue system of FIG. 5.
FIG. 9 is a block diagram of a synthetic speech response cancellation unit
in a third embodiment of a speech dialogue system according to the present
invention.
FIGS. 10A and 10B are output power of two smoothing filters in the
synthetic speech response cancellation unit of FIG. 9.
FIG. 11 is an enlarged view of central portions in the output power of
FIGS. 10A and 10B, shown in superposition of one on top of the other.
FIG. 12 is a graph of an accuracy for estimate of filter coefficients as a
function of time, indicating the effect of the synthetic speech response
cancellation unit of FIG. 9.
FIG. 13 is a graph of a speech recognition rate versus an accuracy for
estimate of filter coefficients, indicating the effect of the synthetic
speech response cancellation unit of FIG. 9.
FIG. 14 is a flow chart for the procedure to obtain a step gain in LMS
algorithm used in the synthetic speech response cancellation unit of FIG.
9.
FIG. 15 is a perspective view of an external configuration of the speech
dialogue system according to the present invention.
FIG. 16 is a schematic block diagram of a fourth embodiment of a speech
dialogue system according to the present invention.
FIG. 17 is a detailed block diagram of a speech segment detection unit in
the speech dialogue system of FIG. 16.
FIG. 18 is a power of an exemplary speech input, for explaining the speech
segment detection by the speech detection unit of FIG. 17.
FIG. 19 is a state transition diagram for explaining the speech segment
detection by the speech detection unit of FIG. 17.
FIG. 20 is a flow chart for the operation of the speech segment detection
by the speech detection unit of FIG. 17.
FIG. 21 is a frequency spectrum of a specific synthetic speech response,
indicating the effect of the use of the spectral pre-whitening of the
frequency spectrum.
FIG. 22 is a graph of an accuracy for estimate of filter coefficients as a
function of time, indicating the effect of the use of the addition of the
wide frequency band noise and the spectral pre-whitening of the frequency
spectrum.
FIG. 23 is a schematic block diagram of a fifth embodiment of a speech
dialogue system according to the present invention.
FIGS. 24A to 24E are illustrations of various response to be determined by
a dialogue management unit in the speech dialogue system of FIG. 23.
FIG. 25 is a detailed block diagram of a response generation unit in the
speech dialogue system of FIG. 23.
FIG. 26 is an illustration of the response output timing data to be
generated at the response generation unit of FIG. 25.
FIG. 27 is an illustration of response interruption control data to be
generated by an interruption control unit of the speech dialogue system of
FIG. 23.
FIGS. 28 to 35 are diagrams indicating various exemplary modes of the
output of the response controlled by the response interruption control
data of FIG. 27.
FIG. 36 is an illustration of an evaluation of the content of the input to
be made at the interruption control unit of the speech dialogue system of
FIG. 23.
FIG. 37 is an illustration of an evaluation of the content of the response
to be made at the interruption control unit of the speech dialogue system
of FIG. 23.
FIGS. 38 to 41 are flow charts of various exemplary procedures for
controlling the mode of the output of the response by the interruption
control unit of the speech dialogue system of FIG. 23.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
Referring now to FIG. 1, a first embodiment of a speech dialogue system
according to the present invention will be described in detail.
The speech dialogue system of this first embodiment comprises: a microphone
1 for receiving a speech input from a human speaker; a loudspeaker 8 for
outputting a synthetic speech response of the system in response to the
speech input; a synthetic speech response cancellation unit 2 for
cancelling the synthetic speech response superposed onto the speech input
entered by the human speaker at the microphone 1; a speech recognition
unit 5 for recognizing the content of the speech input entered by the
human speaker according to the output of the synthetic speech response
cancellation unit 2; a dialogue control unit 6 for selectively controlling
the synthetic speech response appropriate for the content of the speech
input recognized at the speech recognition unit 5; a synthetic speech
response generation unit 7 for outputting the synthetic speech response
selected by the dialogue control unit 6 to the loudspeaker 8 as well as to
the synthetic speech response cancellation unit 2; and a display unit 16
for displaying visual data such as graphic data and image data to the
human speaker.
The synthetic speech response cancellation unit 2 further comprises: a
look-up table 3a for memorizing various information on the various
available synthetic speech responses such as power information, pitch
information, amplitude information, information on voiced and unvoiced
sounds, and information on silent periods; and an adaptive filter 3 for
correcting the synthetic speech response to be cancelled from the speech
input entered at the microphone 1 by calculating filter coefficients W of
the LMS(Least Mean Square)/Newton algorithm to be described below; and a
subtractor 4 for subtracting the output of the adaptive filter 3 from the
speech input entered at the microphone 1.
Here, the method of speech recognition used in the speech recognition unit
5 can be any known speech recognition method such as a word spotting
method or an HMM (Hidden Markov Model) method.
This speech dialogue system of FIG. 1 operates according to the flow chart
shown in FIG. 2, as follows.
First, when the human speaker enters the speech input at the microphone 1,
the speech signals of the speech input are supplied to the speech
recognition unit 5 through the synthetic speech response cancellation unit
2. At first, there is no synthetic speech response outputted from the
synthetic speech response generation unit 7, so that the processing at the
synthetic speech response cancellation unit 2 is not carried out and the
speech signals obtained at the microphone 1 are directly supplied to the
speech recognition unit 5.
Then, at the step ST1, the synthetic speech response appropriate for the
content of the speech input recognized by the speech recognition unit S is
selected by the dialogue control unit 6, and at the step ST2, the selected
synthetic speech response is transmitted from the synthetic speech
response generation unit 7 to the loudspeaker 8 while at the step ST3, the
selected synthetic speech response is transmitted to the adaptive filter
3.
Next, at the step ST4, at the adaptive filter 3, the filter coefficients W
of the LMS/Newton algorithm, which accounts for the effect on the
synthetic speech response received by the microphone 1 caused by the
reflection or the dissipation of the synthetic speech response outputted
from the loudspeaker 8 due to the environment surrounding the system, and
which is defined by the following equation (1), is calculated.
W.sub.(k+1) =W.sub.(k) +2.mu.R'.sub.(k) e.sub.(k) X.sub.(k)(1)
where: k is a factor indicating an iteration number; R' is an inverse of an
auto-correlation matrix of the synthetic speech response, which is to be
given by the look-up table 3a; .mu. is a convergence factor for
controlling the stability and the rate of adaptation; e is an error; and X
is an input vector representing the synthetic speech response.
Then, the output signal y to be supplied to the subtractor 4 is calculated
by multiplying the synthetic speech response X with the transpose of
filter coefficients W.sup.T calculated according to the equation (1). In
other words, the output signal y of the adaptive filter 3 is given by the
following equation (2).
y=W.sup.T X (2)
On the other hand, at the step ST5, the microphone 1 receives the speech
signals d to be supplied to the subtractor 4, representing the speech
input uttered by the human speaker which is superposed by the synthetic
speech response outputted by the loudspeaker 8.
Then, at the step ST6, the subtractor subtracts the output signals y
supplied by the adaptive filter 3 from the speech signals d supplied by
the microphone 1. In other words, the subtracted signals S to be outputted
from the subtractor 4 is given by the following equation (3).
S=d-y (3)
Next, at the step ST7, the subtracted signals S obtained at the subtractor
4 are supplied to the speech recognition unit 5, in order to recognize the
content of the speech input uttered by the human speaker at the speech
recognition unit 5, and then to select the synthetic speech response
appropriate for the recognized speech input at the dialogue control unit
6, and to output the selected synthetic speech response from the synthetic
speech response generation unit 7.
Then, at the step ST8, the adaptive filter 3 updates the set of filter
coefficients W according to the new synthetic speech response newly
outputted from the synthetic speech response generation unit 7, and by
means of the step ST9, the above described process is repeated until the
completion of the speech input is indicated in a predetermined manner.
Thus, according to this first embodiment, the speech input uttered by the
human speaker is separated out from the speech signals received by the
microphone 1 by subtracting the synthetic speech response outputted from
the loudspeaker 8 appropriately modified by utilizing the LMS/Newton
algorithm, so that the human speaker can make the input of the speech
input at the microphone 1 even when the synthetic speech response is
outputted from the loudspeaker 8.
It is to be noted that, instead of the inverse of the auto-correlation
matrix R' used in the equation (1) for calculating the filter coefficients
W described above, the power of the speech, indicating-such information as
the vocal and unvocal sounds, vowel and consonant sounds, silent periods,
and sound duration, may be used. In a case of using the power p of the
sound in calculating the set of filter coefficients W of the LMS/Newton
algorithm, the equation (1) described above should be replaced by the
following equation (4).
W.sub.(k+1) =W.sub.(k) +2(.mu./p.sub.(k) L)e.sub.(k) X.sub.(k)(4)
where L is a dimension of the speech input vector, and the factor
2(.mu./p.sub.(k) L)e.sub.(k) is the step gain. In such a case, since the
characteristics of the synthetic speech response such as power information
and pitch information are memorized in the look-up table 3a in advance,
the set of filter coefficients W can be calculated according to the
characteristics of the selected synthetic speech response accurately.
FIG. 3 shows characteristics of the level of the subtracted signals S
obtained at the subtractor 4, where a curve C1 represents a case in which
the synthetic speech response X has a constant power level, and a curve C2
represents a case in which the synthetic speech response X has the power
level characteristic indicated by a curve C3 which is memorized in the
look-up table 3a. As can be seen clearly in FIG. 3, the cancellation of
the synthetic speech response from the speech signals can be achieved more
effectively and accurately by carrying out the LMS/Newton algorithm using
the power information memorized in the look-up table 3a (a case of the
curve C2).
It is also to be noted that, in a case of outputting not only the synthetic
speech response but also a music from the loudspeaker 8, the synthetic
speech response generation unit 7 can be formed as shown in FIG. 4.
Namely, in such a case, the synthetic speech response generation unit 7
comprises: a speech synthesizing unit 10 for outputting the synthetic
speech response signals; a music synthesizing unit 11 for outputting the
music signals; and a mixer 9 for mixing the synthetic speech response
signals with the music signals. Here, the characteristics of the music
signals can easily be obtained from the musical notes used in the music,
so that by memorizing these characteristics in the look-up table 3a in
advance, the music signals can be cancelled from the speech signals
received at the microphone 1 in a manner similar to the cancellation of
the synthetic speech response described above.
In addition, the acoustic signals other than the speech and the music
including a natural sound such as a bird song and a buzzer sound may also
be incorporated. The cancellation of the buzzer sound can be achieved by
utilizing the fact that it is a periodical sound.
Furthermore, the cancellation of the random background noise can also be
achieved in a similar manner, by utilizing the fact that the random noise
is irregular but constantly present.
In a case where the signal outputted from the synthetic speech response
generation unit 7 is a wide frequency band noise (white noise), it is
known to be easy to estimate the set of filter coefficients W between the
loudspeaker 8 and the microphone 1. Now, the vocal sound (vowel sound) in
the speech signals gives the line spectrum in the short time frequency
spectrum as it is the periodical signal and has a property of being not
constantly present. For this reason, the spectral components for the vocal
sound (vowel sound) in the speech signals are not really distributed in
the wide frequency band, and this deteriorates the accuracy of the
estimation of the set of filter coefficients. Here, however, by adopting a
configuration shown in FIG. 4, the wide frequency band signals such as the
noise or the music can be added to the portions without the frequency
components in the synthetic speech response, so that it is possible to
improve the accuracy of the LMS and FLMS algorithms.
Referring now to FIG. 5, a second embodiment of a speech dialogue system
according to the present invention will be described in detail. Here,
those features which are substantially equivalent to the corresponding
features in the first embodiment described above will be given the same
reference numerals in the figure and their descriptions will be omitted.
This second embodiment differs from the first embodiment described above in
that, as shown in FIG. 5, there is provided a filter coefficient updating
control unit 15 between the speech recognition unit 5 and the dialogue
control unit 6, which controls the updating of the filter coefficients at
the adaptive filter 3. This filter coefficient updating control unit 15
functions to improve the accuracy of the estimation of the set of filter
coefficients at the adaptive filter 3 at the period at which there is a
speech input from the human speaker.
In this second embodiment, in estimating the filter coefficients W by using
the LMS/Newton algorithm, the filter coefficients W of the past are
maintained for five seconds per each 100 ms, for example. Namely, the
following filter coefficients are temporarily memorized in the adaptive
filter 3.
______________________________________
W.sub.0 for the present timing
W.sub.-1 for 100 ms before the present timing
W.sub.-2 for 200 ms before the present timing
.
.
W.sub.-50 for 5 sec before the present timing
______________________________________
Then, when the speech input is recognized at the speech recognition unit 5,
the setting of the filter coefficients W at the adaptive filter 3 is
changed to the filter coefficients before the utterance of the speech
input. For example, in a case the speech input has been entered by the
human speaker 750 ms ago, the filter coefficients W.sub.0 for the present
timing are changed to the filter coefficients W.sub.-8 for 800 ms before
the present timing. The reason for the effectiveness of this change of the
filter coefficient setting can be explained in conjunction with FIG. 6 as
follows.
In FIG. 6, a curve C4 indicates the synthetic speech response signals, and
a curve C5 indicates the speech input signals entered by the human
speaker. In the synthetic speech response cancellation unit 2, the
synthetic speech response is cancelled by updating the filter coefficients
in every 100 ms, while in the speech recognition unit 5, the start point
t.sub.s and an end point t.sub.E of the speech input are detected.
Meanwhile, the filter coefficient updating control unit 15 makes a
judgment at every 100 ms as to whether to update the present estimate
W.sub.0 for the filter coefficients straightforwardly or to use the past
estimate W.sub.i (i=-1 to -50), according to the start point t.sub.S
detected by the speech recognition unit 5. In this manner, it becomes
possible in this second embodiment to obtain more accurate estimate for
the set of filter coefficients W at the adaptive filter 3, even for the
period at which there is a speech input from the human speaker, so that
the overall efficiency of the cancellation of the synthetic speech
response at the synthetic speech response cancellation unit 2 can be
improved.
Now, the procedure for estimating the accurate set of filter coefficients
according to some internal information such as that of time series for
power and pitch utilized in synthesizing the synthetic speech response in
the speech dialogue system of FIG. 5 will be described. Here, as an
illustrative example, a case of a specific synthetic speech response of
"torikeshimasu" (meaning "we are cancelling" in Japanese) will be
described. For this specific synthetic speech response of "torikeshimasu",
the power and the pitch as a function of time appear as shown in FIGS. 7A
and 7B.
In the speech dialogue system of FIG. 5, the step gain for FLMS used in
updating the filter coefficients is obtained according to the convergence
factor determined by the flow chart of FIG. 8 as follows. Here, the FLMS
estimates a transfer function which is a frequency spectrum of the filter
coefficients.
First, at the step ST11, for the first timing n=0, whether it is a silent
period or not is judged from the power information shown in FIG. 7A.
In a case it is judged as a silent period at the step ST11, next at the
step ST14, all of the convergence factors .mu.(f) for FLMS for all the
frequencies are set equal to zero. By this setting, the estimate for the
transfer function will be unaffected by the adaptive estimation, so that
the estimate for the transfer function will be unaffected even when there
is an input of a noise from the microphone 1 during the silent period.
On the other hand, in a case it is judged as not a silent period at the
step ST11, next at the step ST12, whether it is a vowel sound or a
consonant sound is judged. This judgement can be made easily as the
phoneme is already known in advance.
In a case it is judged as a consonant sound at the step ST12, next at the
step ST15, whether it is over the predetermined threshold level (such as a
surrounding environmental noise level plus 20 dB) or not is judged. If it
is judged to be over the threshold level at the step ST15, all of the
convergence factors .mu.(f) for FLMS for all the frequencies are set equal
to a predetermined constant convergence factor C at the step ST17, whereas
otherwise all of the convergence factors .mu.(f) for FLMS for all the
frequencies are set equal to zero at the step ST16.
On the other hand, in a case it is judged as a vowel sound at the step
ST12, next at the step ST18, whether it is over the predetermined
threshold level (such as a surrounding environmental noise level plus 20
dB) or not is judged. If it is judged to be not over the threshold level
at the step ST13, all of the convergence factors .mu.(f) for FLMS for all
the frequencies are set equal to zero at the step ST18. On the contrary,
if it is judged to be over the threshold level at the step ST13, the
convergence factors .mu.(f) for FLMS are set to be such that .mu.(f)=C for
the frequencies which are in a range of .+-.(1/3)f.sub.p around the
integer multiples of the pitch frequency f.sub.p at each timing indicated
in FIG. 7B, and .mu.(f)=0 for the rests which are located outside of this
range. That is,
.mu.(f)=C for f.sub.n .multidot.n-1/3p.sub.P <f<f.sub.p
.multidot.n+1/3f.sub.p
.mu.(f)=0 for otherwise
where n is an integer.
This convergence factor setting procedure is then repeated for all of the
timings by means of the step ST20, where the timing is updated in units of
10 ms, for example.
Thus, in this second embodiment, the updating of the estimate for the
filter coefficients is carried out by placing more weights to the
frequency components having larger power among the synthetic speech
response.
Referring now to FIG. 9, a third embodiment of a speech dialogue system
according to the present invention will be described in detail.
In this third embodiment, the synthetic speech response cancellation unit 2
of the first or second embodiment described above is replaced by a
configuration shown in FIG. 9, while the rest of the speech dialogue
system is substantially the same as the first or second embodiment
described above, in order to obtain the filter coefficients stably at a
high accuracy even in a case there is a large fluctuation in the power of
the speech input signals.
In this third embodiment, the synthetic speech response cancellation unit
2A comprises: an A/D converter 31 for A/D converting the synthetic speech
response from the synthetic speech response generation unit 7; an A/D
converter 32 for A/D converting the speech input from the microphone 1;
first and second smoothing filters 33 and 34 for smoothing the A/D
converted synthetic speech response power signal, using different time
constants; a switching unit 35 for judging whether to carry out the
adaptation by the adaptive filter 3 according to the outputs of the first
and second smoothing filters 33 and 34; the adaptive filter 3 similar to
that used in the first or second embodiment described above; a convolution
calculation unit 36 for applying the convolution calculation to the output
of the adaptive filter 3; and the subtractor 4 for subtracting the output
of the convolution calculation unit 36 from the A/D converted speech input
to obtain the subtracted signals.
Here, the time constant for the first smoothing filter 33 is set to be
smaller than the time constant for the second smoothing filter 34, and for
example, the time constant t1 for the first smoothing filter 33 is set
equal to 10 ms while the time constant t2 for the second smoothing filter
34 is set equal to 100 ms.
The two channel A/D converters 31 and 32 make it possible to obtain the
synthetic speech response from the synthetic speech response generation
unit 7 and the speech input from the microphone 1 at the constant timings.
The sampling frequency of the A/D converters 31 and 32 can be set equal to
12 KHz, in view of the frequency range used in the speech signals.
The switching unit 35 prohibits the adaptation by the adaptive filter 3
whenever the output of the first smoothing filter 33 is less than or equal
to a predetermined first threshold Va, and activates the adaptation by the
adaptive filter 3 whenever the output of the second smoothing filter 33 is
greater than or equal to a predetermined second threshold Vb.
For example, for a case of a specific synthetic speech response of "douzo"
(meaning "go ahead, please" in Japanese), the power for the outputs of the
first and second smoothing filters 33 and 34 appear as shown in FIG. 10A
and FIG. 10B, respectively, where the power Pb(k) of the output of the
second smoothing filter 34 shown in FIG. 10B is smoother than the power
spectrum Pa(K) of the output of the first smoothing filter 33 shown in
FIG. 10A, due to the larger time constant setting for the second smoothing
filter 34.
The portions of the power shown in FIGS. 10A and 10B at which the sound is
disrupted are shown together in enlargement in FIG. 11.
Now, in general, the accuracy of the estimate of the filter coefficients
drops abruptly in a short period of time such as 1 ms when the power of
the speech changes largely as in the border of the speech section and
silent section. For this reason, it is possible to maintain the high
accuracy for the estimate of the filter coefficients by stopping the
adaptation as soon as a large change of the power of the speech occurs.
Consequently, in this third embodiment, the switching unit 35 prohibits the
adaptation by the adaptive filter 3 whenever the output of the first
smoothing filter 33 becomes less than or equal to a predetermined first
threshold Va indicated in FIG. 11, and activates the adaptation by the
adaptive filter 3 whenever the output of the second smoothing filter 33
becomes greater than or equal to a predetermined second threshold Vb
indicated in FIG. 11, such that the adaptation by the adaptive filter 3 is
not carried out when the power of the speech changed largely.
Here, the appropriate values for the first and second thresholds Va and Vb
are empirically determined, and for example, the value of the first
threshold Va can be set equal to -20 dB which is a mean power of the
synthetic speech response.
In order to demonstrate the effect of this third embodiment, the accuracy
for the estimate of the filter coefficients for a case of using a specific
synthetic speech response of "irasshaimase" (meaning "welcome" in
Japanese) is shown in FIG. 12, where a curve C11 indicates a case with the
stopping of the adaptation as described above, while a curve C12 indicates
a case without the stopping of the adaptation. As can be clearly seen in
FIG. 12, the estimate of the filter coefficients can be obtained at much
higher accuracy with the stopping of the adaptation.
In addition, the speech recognition rate as a function of the accuracy of
the estimate for the filter coefficients is shown in FIG. 13, which
clearly indicates that the speech recognition rate becomes higher for the
higher accuracy of the estimate for the filter coefficients, i.e., for the
larger amount of cancellation of the synthetic speech response.
Thus, according to this third embodiment, the high accuracy for the
estimate of the filter coefficients can be achieved by stopping the
adaptation by the adaptive filter whenever there is a large change in the
power of the speech, and this high accuracy for the estimate of the filter
coefficients in turn ensures the high speech recognition rate, so that it
becomes possible in this third embodiment to carry out the cancellation of
the synthetic speech response more effectively.
In the speech dialogue system of this third embodiment, the step gain for
LMS used in updating the filter coefficients is obtained according to the
flow chart of FIG. 14 as follows.
First, the initial timing is set to k=0 at the step ST31, and for the first
timing k=0, whether the output power Pa(k) of the first smoothing filter
88 is not greater than the first threshold Va is judged at the step ST32.
In a case it is judged as not greater than the first threshold Va at the
step ST32, next at the step ST36, the step gain for LMS is set equal to
zero by setting the convergence factor .mu.=0, so as to prohibit the
updating of the filter coefficients.
On the other hand, in a case it is judged as greater than the first
threshold Va at the step ST32, next at the step ST33, whether the output
power Pb(k) of the second smoothing filter 34 is not greater than the
second threshold Vb is judged.
In a case it is judged as not greater than the second threshold Vb at the
step ST33, next at the step ST37, the step gain for LMS is set equal to
zero by setting the convergence factor .mu.=0, so as to prohibit the
updating of the filter coefficients.
On the other hand, in a case it is judged as greater than the second
threshold Vb at the step ST32, next at the step ST34, the step gain for
LMS is set according to the following equation (5), so as to carry out the
updating of the filter coefficients.
step gain=2.mu..multidot.e(k)/(Pb(t).multidot.L) (5)
where the convergence factor .mu. is a constant.
This step gain setting procedure is then repeated for all of the timings by
means of the step ST35.
It is to be noted that the amount of calculation required in estimating the
filter coefficients can be quite large, so that in order to carry out such
a large amount of calculation in real time fashion, a DSP board may be
used in the synthetic speech response cancellation unit 2A.
It is also to be noted that the synthetic speech response cancellation unit
2A of FIG. 9 may be modified to have a different number of the smoothing
filters such as one or three, instead of two as described above.
The speech dialogue system according to the present invention has an outer
appearance as sho | | |