WikiPatents - Community Patent Review
Create Free Account  |  License or Sell Your Patent  |  WikiPatents Marketplace  |  WikiPatents Blog
Username:  Password:  
    
Advanced Search
Speech dialogue system for realizing improved communication between user and system    
United States Patent5548681   
Link to this pagehttp://www.wikipatents.com/5548681.html
Inventor(s)Gleaves; David (Kanagawa-ken, JP); Nagata; Yoshifumi (Kanagawa-ken, JP); Takebayashi; Yoichi (Kanagawa-ken, JP)
AbstractIn the system, a speech input uttered by a human is received by a microphone which outputs microphone output signals. The speech input received by the microphone is then recognized by a speech recognition unit, and a synthetic speech response appropriate for the speech input recognized by the speech recognition unit is generated and outputted from a loudspeaker to the human. In recognizing the speech input, the speech recognition unit receives input signals in which the synthetic speech response, outputted from the loudspeaker and then received by the microphone, is cancelled from the microphone output signals.
   














 Title Information Submit all comments and votes
 
Patent Text Patent PDF Print Page Summary File History
Plain text PDF images Print Summary File History
Drawing from US Patent 5548681
Speech dialogue system for realizing improved communication between user

     and system - US Patent 5548681 Drawing
Speech dialogue system for realizing improved communication between user and system
Inventor     Gleaves; David (Kanagawa-ken, JP); Nagata; Yoshifumi (Kanagawa-ken, JP); Takebayashi; Yoichi (Kanagawa-ken, JP)
Owner/Assignee     Kabushiki Kaisha Toshiba (Kawasaki, JP)
Patent assignment
All assignments
Publication Date     August 20, 1996
Application Number     07/929,106
PAIR File History     Application Data   Transaction History
Image File Wrapper   Patent Term   Fees
Litigation
Filing Date     August 13, 1992
US Classification     704/233 704/226 704/256 704/256.1
Int'l Classification     G10L 005/06 G10L 009/14 G10L 003/02
Examiner     Knepper; David D.
Assistant Examiner    
Attorney/Law Firm     Oblon, Spivak, McClelland, Maier & Neustadt, P.C.
Address
Parent Case    
Priority Data     Aug 13, 1991[JP]3-202957 Mar 16, 1992[JP]4-058338
USPTO Field of Search     395/2..37 395/2.42 395/2.4 395/2 395/2.65 381/41 381/42 381/43 381/46 381/47
Patent Tags     speech dialogue realizing improved communication between user
   
Enter a comma (,) or semicolon (;) between multiple tag words/phrases.
Describe this patent:
 Amusing   
 Clever   
 Complex   
 Efficient   
 Historic   
 Important   
 Innovative   
 Interesting   
 Practical   
 Simple   
[no votes]
Patent WIKI

Share information and news about this patent, including information and news about the technology, inventors, company, ligation and licensing.

 References Submit all comments and votes
 
*references marked with an asterisk below are user-added references
 U.S. References
 
Add a new US reference:  
ReferenceRelevancyCommentsReferenceRelevancyComments
4937871
Hattori
704/233
Jun,1990

[0 after 0 votes]
4905288
Gerson
704/245
Feb,1990

[0 after 0 votes]
4852181
Morito
704/233
Jul,1989

[0 after 0 votes]
 Foreign References
 Other References
 Market Review Submit all comments and votes
   
Market Size
Estimate the gross annual revenues of the relevant market sector:
> $10B
$5B - $10B
$2B - $5B
$500M - $2B
$100M - $500M
$10M - $100M
$1M - $10M
$500K - $1M
$100K - $500K
< $100K
[No votes]
$0
 
$0   $2.5B   $5B   $7.5B   $10B
Market Share
Estimate the percentage of the relevant market sector this invention will capture:
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Reasonable Royalty
What percentage of gross sales should the inventor or assignee be paid?
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Public's "Guesstimation" of Royalty Value
Market SizeN/A[No votes]
xMarket ShareN/A[No votes]
xReasonable RoyaltyN/A[No votes]

N/A

License Availablity
If you are NOT the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
License Availablity
If you ARE the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
Competitive Advantage
Does this invention have a significant competitive advantage over similar technologies?
Yes

No



[No votes]
Most helpful competitive advantage comment
[No comments]

Commercial Alternatives
Are there viable commercial alternatives for this invention?
Yes

No



[No votes]
Most helpful commercial alternative comment
[No comments]

 Technical Review Submit all comments and votes
 Claims Submit all comments and votes
 


What is claimed is:

1. A speech dialogue system, comprising:

microphone means for receiving a speech input uttered by a human speaker and outputting microphone output signals;

speech recognition means for receiving input signals and recognizing the speech input received by the microphone means;

synthetic speech response generation means for generating a synthetic speech response appropriate for the speech input recognized by the speech recognition means;

loudspeaker means for outputting the synthetic speech response to the human speaker; and

synthetic speech response cancellation means for cancelling the synthetic speech response, which is outputted from the loudspeaker means and then received by the microphone means, from the microphone output signals, to obtain input signals to be supplied to the speech recognition means from which the speech recognition means recognizes the speech input, the synthetic speech response cancellation means further comprising:

look-up table means for memorizing speech characteristic information on the synthetic speech response to be outputted from the loudspeaker means;

adaptive filter means for adapting the synthetic speech response generated by the synthetic speech response generation means by multiplying the synthetic speech response with filter coefficients calculated according to the speech characteristic information memorized in the look-up table means, to obtain an adapted synthetic speech response; and

subtractor means for subtracting the adapted synthetic speech response obtained by the adaptive filter means from the microphone output signals, to obtain the input signals supplied to the speech recognition means.

2. The speech dialogue system of claim 1, wherein the speech characteristic information memorized in the look-up table means indicates an auto-correlation characteristic of the synthetic speech response.

3. The speech dialogue system of claim 1, wherein the speech characteristic information memorized in the look-up table means indicates a power characteristic of the synthetic speech response.

4. The speech dialogue system of claim 1, further comprising control means for controlling an updating of the filter coefficients used in the adaptive filter means according to a presence of the speech input recognized by the speech recognition means.

5. The speech dialogue system of claim 4, wherein the control means sets values of the filter coefficients in the adaptive filter to values before the utterance of the speech input when the presence of the speech input is recognized by the speech recognition means.

6. The speech dialogue system of claim 5, wherein the control means controls the updating of the filter coefficients used in the adaptive filter means such that the filter coefficients are not changed unless a power level of the speech input is greater than a predetermined threshold.

7. The speech dialogue system of claim 1, further comprising speech segment detection means for detecting a speech segment in the input signals, wherein the speech recognition means recognize the speech input from the input signals only at the speech segment detected by the speech segment detection means.

8. The speech dialogue system of claim 7, wherein the speech segment detection means uses detection thresholds for determining the speech segment which vary according to a power level of the synthetic speech response to be cancelled at the synthetic speech response cancellation means.

9. The speech dialogue system of claim 8, wherein the speech segment detection means obtains the power level of the synthetic speech response by calculating a convolution of the synthetic speech response and an estimate for the filter coefficients.

10. The speech dialogue system of claim 9, wherein the estimate for the filter coefficients is obtained by adding a wide frequency band noise to the synthetic speech response.

11. The speech dialogue system of claim 9, wherein the estimate for the filter coefficients is obtained by a spectral pre-whitening of a frequency spectrum of the synthetic speech response.

12. A speech dialogue system, comprising:

microphone means for receiving a speech input uttered by a human speaker and outputting microphone output signals;

speech recognition means for receiving input signals and recognizing the speech input received by the microphone means;

synthetic speech response generation means for generating a synthetic speech response appropriate for the speech input recognized by the speech recognition means;

loudspeaker means for outputting the synthetic speech response to the human speaker; and

synthetic speech response cancellation means for cancelling the synthetic speech response, which is outputted from the loudspeaker means and then received by the microphone means, from the microphone output signals, to obtain input signals to be supplied to the speech recognition means from which the speech recognition means recognizes the speech input, the synthetic speech response cancellation means further comprising:

adaptive filter means for adapting the synthetic speech response generated by the synthetic speech response generation means, to obtain an adaptive filter output;

convolution means for calculating a convolution of the adaptive filter output obtained by the adaptive filter means and the synthetic speech response generated by the synthetic speech response generation means;

subtractor means for subtracting the convolution calculated by the convolution means from the microphone output signals, to obtain the input signals supplied to the speech recognition means;

at least one smoothing filter means for smoothing the synthetic speech response generated by the synthetic speech response generation means; and

switching means for controlling an operation of the adaptive filter means according to an output power level of the smoothing filter means.

13. The speech dialogue system of claim 7, wherein the smoothing filter means includes a first smoothing filter having a first time constant and a second smoothing filter having a second time constant which is larger than the first time constant, and the switching means stops the operation of the adaptive filter means when the output power level of the first smoothing filter is below a predetermined first threshold, and starts the operation of the adaptive filter means when the output power level of the second smoothing filter is above a predetermined second threshold.

14. The speech dialogue system of claim 13, wherein the filter coefficients used in the adaptive filter means are updated according to the output power level of the second smoothing filter.

15. The speech dialogue system of claim 12, wherein the synthetic speech response cancellation means further comprises two channel A/D converter means for inputting the synthetic speech response generated by the synthetic speech response generation means and the microphone output signals outputted by the microphone means into the synthetic speech response cancellation means synchronously.

16. The speech dialogue system of claim 12, further comprising speech segment detection means for detecting a speech segment in the input signals, wherein the speech recognition means recognize the speech input from the input signals only at the speech segment detected by the speech segment detection means.
 Description Submit all comments and votes
 


BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention relates to a speech dialogue system for interfacing the man and the computer in a form of a dialogue using speech data.

2. Description of the Background Art

In recent years, the development of the speech dialogue system using the speech data as an interface between the man and the computer has been advanced considerably.

In a speech dialogue system, which is useful in a multi-media dialogue system for displaying the visual data such as a graphic data and image data along with the speech data output, when the human speaker utters speech messages toward the microphone, the system recognizes these speech messages, and outputs the appropriate response in speech data from a loudspeaker, so as to carry out the dialogue with the human speaker.

For example, such a speech dialogue system may be employed in a hamburger shop for taking the order from the customer. In this case, when the customer utters the order such as "Two hamburgers and three orange juices" toward the microphone, the system recognizes this speech input, and outputs the synthetic speech response for making a confirmation such as "Is it two hamburgers and three orange juices that you have just ordered?". In response to this synthetic speech response, when the customer utters "Yes", the recognized speech content is confirmed, and subsequently notified to the shop worker.

In such a conventional speech dialogue system, however, in a case the customer uttered "Three hamburgers . . ." by mistake, it is not possible for the customer to make a correction immediately, and the customer must deny the synthetic speech response such as "Is it three hamburgers . . .?" for making a confirmation from the system first, and then make the correct speech input such as "Two hamburgers . . ." again.

Moreover, in a case the customer uttered "Two hamburgers, one coke, and one ice cream, please", and the system erroneously recognized this speech input and outputs the synthetic speech response "Is it four potatoes, one coke, and one ice cream you have just ordered?", the customer may very well be tempted to make a correction by interrupting the synthetic speech response as soon as the synthetic speech response reaches to a portion ". . . four potatoes . . .", but even in such a case, in a conventional speech dialogue system, the customer cannot make the correction until the output of the entire synthetic speech response is completed.

For these reasons, in a conventional speech dialogue system, the dialogue often requires a considerable amount of time, and it can be quite cumbersome.

In other words, in a conventional speech dialogue system, it has not been possible to carry out the reception of the speech input from the human speaker and the output of the synthetic speech response simultaneously, such that the input of the speech to be made by the human speaker can be made only after the output of the entire synthetic speech response from the system has been completed, and so consequently the dialogue can be quite time consuming and inefficient especially when the system makes the erroneous recognition of the speech input.

SUMMARY OF THE INVENTION

It is therefore an object of the present invention to provide a speech dialogue system capable of carrying out the reception of the speech input from the human speaker and the output of the synthetic speech response simultaneously, such that the input of the speech can be made by the human speaker even while the output of the synthetic speech response from the system is still in progress, and so consequently the dialogue can be less time consuming and more efficient.

It is another object of the present invention to provide a speech dialogue system capable of realizing a natural communication between the system and the user of the system.

According to one aspect of the present invention there is provided a speech dialogue system, comprising: microphone means for receiving a speech input uttered by a human speaker and outputting microphone output signals; speech recognition means for recognizing the speech input received by the microphone means; synthetic speech response generation means for generating a synthetic speech response appropriate for the speech input recognized by the speech recognition means; loudspeaker means for outputting the synthetic speech response to the human speaker; and synthetic speech response cancellation means for cancelling the synthetic speech response, which is outputted from the loudspeaker means and then received by the microphone means, from the microphone output signals, to obtain input signals to be supplied to the speech recognition means from which the speech recognition means recognizes the speech input.

According to another aspect of the present invention there is provided a speech dialogue system, comprising: input means for receiving input from a user; input recognition means for recognizing the input received by the input means; response generation means for generating a response including a synthetic speech response, appropriate for the input recognized by the input recognition means; output means for outputting the response generated by the response generation means to the user; and control means for controlling a mode of an output of the response to be outputted from the output means, when there is the input from the user received by the input means.

Other features and advantages of the present invention will become apparent from the following description taken in conjunction with the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a schematic block diagram of a first embodiment of a speech dialogue system according to the present invention.

FIG. 2 is a flow chart for the operation of the speech dialogue system of FIG. 1.

FIG. 3 is a graph of subtracted signals used in the speech dialogue system of FIG. 1 as a function of time, indicating the effect of the synthetic speech response cancellation in the speech dialogue system of FIG. 1.

FIG. 4 is a detailed block diagram of one possible configuration for a synthetic speech response generation unit in the speech dialogue system of FIG. 1.

FIG. 5 is a schematic block diagram of a second embodiment of a speech dialogue system according to the present invention.

FIG. 6 is a graph of speech input as a function of time, for explaining the operation of the speech dialogue system of FIG. 5.

FIGS. 7A and 7B are graphs of a power and a pitch for a certain synthetic speech response as a function of time, respectively, for explaining a procedure to obtain a step gain in FLMS algorithm used in the speech dialogue system of FIG. 5.

FIG. 8 is a flow chart for the procedure to obtain a step gain in FLMS algorithm used in the speech dialogue system of FIG. 5.

FIG. 9 is a block diagram of a synthetic speech response cancellation unit in a third embodiment of a speech dialogue system according to the present invention.

FIGS. 10A and 10B are output power of two smoothing filters in the synthetic speech response cancellation unit of FIG. 9.

FIG. 11 is an enlarged view of central portions in the output power of FIGS. 10A and 10B, shown in superposition of one on top of the other.

FIG. 12 is a graph of an accuracy for estimate of filter coefficients as a function of time, indicating the effect of the synthetic speech response cancellation unit of FIG. 9.

FIG. 13 is a graph of a speech recognition rate versus an accuracy for estimate of filter coefficients, indicating the effect of the synthetic speech response cancellation unit of FIG. 9.

FIG. 14 is a flow chart for the procedure to obtain a step gain in LMS algorithm used in the synthetic speech response cancellation unit of FIG. 9.

FIG. 15 is a perspective view of an external configuration of the speech dialogue system according to the present invention.

FIG. 16 is a schematic block diagram of a fourth embodiment of a speech dialogue system according to the present invention.

FIG. 17 is a detailed block diagram of a speech segment detection unit in the speech dialogue system of FIG. 16.

FIG. 18 is a power of an exemplary speech input, for explaining the speech segment detection by the speech detection unit of FIG. 17.

FIG. 19 is a state transition diagram for explaining the speech segment detection by the speech detection unit of FIG. 17.

FIG. 20 is a flow chart for the operation of the speech segment detection by the speech detection unit of FIG. 17.

FIG. 21 is a frequency spectrum of a specific synthetic speech response, indicating the effect of the use of the spectral pre-whitening of the frequency spectrum.

FIG. 22 is a graph of an accuracy for estimate of filter coefficients as a function of time, indicating the effect of the use of the addition of the wide frequency band noise and the spectral pre-whitening of the frequency spectrum.

FIG. 23 is a schematic block diagram of a fifth embodiment of a speech dialogue system according to the present invention.

FIGS. 24A to 24E are illustrations of various response to be determined by a dialogue management unit in the speech dialogue system of FIG. 23.

FIG. 25 is a detailed block diagram of a response generation unit in the speech dialogue system of FIG. 23.

FIG. 26 is an illustration of the response output timing data to be generated at the response generation unit of FIG. 25.

FIG. 27 is an illustration of response interruption control data to be generated by an interruption control unit of the speech dialogue system of FIG. 23.

FIGS. 28 to 35 are diagrams indicating various exemplary modes of the output of the response controlled by the response interruption control data of FIG. 27.

FIG. 36 is an illustration of an evaluation of the content of the input to be made at the interruption control unit of the speech dialogue system of FIG. 23.

FIG. 37 is an illustration of an evaluation of the content of the response to be made at the interruption control unit of the speech dialogue system of FIG. 23.

FIGS. 38 to 41 are flow charts of various exemplary procedures for controlling the mode of the output of the response by the interruption control unit of the speech dialogue system of FIG. 23.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

Referring now to FIG. 1, a first embodiment of a speech dialogue system according to the present invention will be described in detail.

The speech dialogue system of this first embodiment comprises: a microphone 1 for receiving a speech input from a human speaker; a loudspeaker 8 for outputting a synthetic speech response of the system in response to the speech input; a synthetic speech response cancellation unit 2 for cancelling the synthetic speech response superposed onto the speech input entered by the human speaker at the microphone 1; a speech recognition unit 5 for recognizing the content of the speech input entered by the human speaker according to the output of the synthetic speech response cancellation unit 2; a dialogue control unit 6 for selectively controlling the synthetic speech response appropriate for the content of the speech input recognized at the speech recognition unit 5; a synthetic speech response generation unit 7 for outputting the synthetic speech response selected by the dialogue control unit 6 to the loudspeaker 8 as well as to the synthetic speech response cancellation unit 2; and a display unit 16 for displaying visual data such as graphic data and image data to the human speaker.

The synthetic speech response cancellation unit 2 further comprises: a look-up table 3a for memorizing various information on the various available synthetic speech responses such as power information, pitch information, amplitude information, information on voiced and unvoiced sounds, and information on silent periods; and an adaptive filter 3 for correcting the synthetic speech response to be cancelled from the speech input entered at the microphone 1 by calculating filter coefficients W of the LMS(Least Mean Square)/Newton algorithm to be described below; and a subtractor 4 for subtracting the output of the adaptive filter 3 from the speech input entered at the microphone 1.

Here, the method of speech recognition used in the speech recognition unit 5 can be any known speech recognition method such as a word spotting method or an HMM (Hidden Markov Model) method.

This speech dialogue system of FIG. 1 operates according to the flow chart shown in FIG. 2, as follows.

First, when the human speaker enters the speech input at the microphone 1, the speech signals of the speech input are supplied to the speech recognition unit 5 through the synthetic speech response cancellation unit 2. At first, there is no synthetic speech response outputted from the synthetic speech response generation unit 7, so that the processing at the synthetic speech response cancellation unit 2 is not carried out and the speech signals obtained at the microphone 1 are directly supplied to the speech recognition unit 5.

Then, at the step ST1, the synthetic speech response appropriate for the content of the speech input recognized by the speech recognition unit S is selected by the dialogue control unit 6, and at the step ST2, the selected synthetic speech response is transmitted from the synthetic speech response generation unit 7 to the loudspeaker 8 while at the step ST3, the selected synthetic speech response is transmitted to the adaptive filter 3.

Next, at the step ST4, at the adaptive filter 3, the filter coefficients W of the LMS/Newton algorithm, which accounts for the effect on the synthetic speech response received by the microphone 1 caused by the reflection or the dissipation of the synthetic speech response outputted from the loudspeaker 8 due to the environment surrounding the system, and which is defined by the following equation (1), is calculated.

W.sub.(k+1) =W.sub.(k) +2.mu.R'.sub.(k) e.sub.(k) X.sub.(k)(1)

where: k is a factor indicating an iteration number; R' is an inverse of an auto-correlation matrix of the synthetic speech response, which is to be given by the look-up table 3a; .mu. is a convergence factor for controlling the stability and the rate of adaptation; e is an error; and X is an input vector representing the synthetic speech response.

Then, the output signal y to be supplied to the subtractor 4 is calculated by multiplying the synthetic speech response X with the transpose of filter coefficients W.sup.T calculated according to the equation (1). In other words, the output signal y of the adaptive filter 3 is given by the following equation (2).

y=W.sup.T X (2)

On the other hand, at the step ST5, the microphone 1 receives the speech signals d to be supplied to the subtractor 4, representing the speech input uttered by the human speaker which is superposed by the synthetic speech response outputted by the loudspeaker 8.

Then, at the step ST6, the subtractor subtracts the output signals y supplied by the adaptive filter 3 from the speech signals d supplied by the microphone 1. In other words, the subtracted signals S to be outputted from the subtractor 4 is given by the following equation (3).

S=d-y (3)

Next, at the step ST7, the subtracted signals S obtained at the subtractor 4 are supplied to the speech recognition unit 5, in order to recognize the content of the speech input uttered by the human speaker at the speech recognition unit 5, and then to select the synthetic speech response appropriate for the recognized speech input at the dialogue control unit 6, and to output the selected synthetic speech response from the synthetic speech response generation unit 7.

Then, at the step ST8, the adaptive filter 3 updates the set of filter coefficients W according to the new synthetic speech response newly outputted from the synthetic speech response generation unit 7, and by means of the step ST9, the above described process is repeated until the completion of the speech input is indicated in a predetermined manner.

Thus, according to this first embodiment, the speech input uttered by the human speaker is separated out from the speech signals received by the microphone 1 by subtracting the synthetic speech response outputted from the loudspeaker 8 appropriately modified by utilizing the LMS/Newton algorithm, so that the human speaker can make the input of the speech input at the microphone 1 even when the synthetic speech response is outputted from the loudspeaker 8.

It is to be noted that, instead of the inverse of the auto-correlation matrix R' used in the equation (1) for calculating the filter coefficients W described above, the power of the speech, indicating-such information as the vocal and unvocal sounds, vowel and consonant sounds, silent periods, and sound duration, may be used. In a case of using the power p of the sound in calculating the set of filter coefficients W of the LMS/Newton algorithm, the equation (1) described above should be replaced by the following equation (4).

W.sub.(k+1) =W.sub.(k) +2(.mu./p.sub.(k) L)e.sub.(k) X.sub.(k)(4)

where L is a dimension of the speech input vector, and the factor 2(.mu./p.sub.(k) L)e.sub.(k) is the step gain. In such a case, since the characteristics of the synthetic speech response such as power information and pitch information are memorized in the look-up table 3a in advance, the set of filter coefficients W can be calculated according to the characteristics of the selected synthetic speech response accurately.

FIG. 3 shows characteristics of the level of the subtracted signals S obtained at the subtractor 4, where a curve C1 represents a case in which the synthetic speech response X has a constant power level, and a curve C2 represents a case in which the synthetic speech response X has the power level characteristic indicated by a curve C3 which is memorized in the look-up table 3a. As can be seen clearly in FIG. 3, the cancellation of the synthetic speech response from the speech signals can be achieved more effectively and accurately by carrying out the LMS/Newton algorithm using the power information memorized in the look-up table 3a (a case of the curve C2).

It is also to be noted that, in a case of outputting not only the synthetic speech response but also a music from the loudspeaker 8, the synthetic speech response generation unit 7 can be formed as shown in FIG. 4. Namely, in such a case, the synthetic speech response generation unit 7 comprises: a speech synthesizing unit 10 for outputting the synthetic speech response signals; a music synthesizing unit 11 for outputting the music signals; and a mixer 9 for mixing the synthetic speech response signals with the music signals. Here, the characteristics of the music signals can easily be obtained from the musical notes used in the music, so that by memorizing these characteristics in the look-up table 3a in advance, the music signals can be cancelled from the speech signals received at the microphone 1 in a manner similar to the cancellation of the synthetic speech response described above.

In addition, the acoustic signals other than the speech and the music including a natural sound such as a bird song and a buzzer sound may also be incorporated. The cancellation of the buzzer sound can be achieved by utilizing the fact that it is a periodical sound.

Furthermore, the cancellation of the random background noise can also be achieved in a similar manner, by utilizing the fact that the random noise is irregular but constantly present.

In a case where the signal outputted from the synthetic speech response generation unit 7 is a wide frequency band noise (white noise), it is known to be easy to estimate the set of filter coefficients W between the loudspeaker 8 and the microphone 1. Now, the vocal sound (vowel sound) in the speech signals gives the line spectrum in the short time frequency spectrum as it is the periodical signal and has a property of being not constantly present. For this reason, the spectral components for the vocal sound (vowel sound) in the speech signals are not really distributed in the wide frequency band, and this deteriorates the accuracy of the estimation of the set of filter coefficients. Here, however, by adopting a configuration shown in FIG. 4, the wide frequency band signals such as the noise or the music can be added to the portions without the frequency components in the synthetic speech response, so that it is possible to improve the accuracy of the LMS and FLMS algorithms.

Referring now to FIG. 5, a second embodiment of a speech dialogue system according to the present invention will be described in detail. Here, those features which are substantially equivalent to the corresponding features in the first embodiment described above will be given the same reference numerals in the figure and their descriptions will be omitted.

This second embodiment differs from the first embodiment described above in that, as shown in FIG. 5, there is provided a filter coefficient updating control unit 15 between the speech recognition unit 5 and the dialogue control unit 6, which controls the updating of the filter coefficients at the adaptive filter 3. This filter coefficient updating control unit 15 functions to improve the accuracy of the estimation of the set of filter coefficients at the adaptive filter 3 at the period at which there is a speech input from the human speaker.

In this second embodiment, in estimating the filter coefficients W by using the LMS/Newton algorithm, the filter coefficients W of the past are maintained for five seconds per each 100 ms, for example. Namely, the following filter coefficients are temporarily memorized in the adaptive filter 3.

______________________________________ W.sub.0 for the present timing W.sub.-1 for 100 ms before the present timing W.sub.-2 for 200 ms before the present timing . . W.sub.-50 for 5 sec before the present timing ______________________________________

Then, when the speech input is recognized at the speech recognition unit 5, the setting of the filter coefficients W at the adaptive filter 3 is changed to the filter coefficients before the utterance of the speech input. For example, in a case the speech input has been entered by the human speaker 750 ms ago, the filter coefficients W.sub.0 for the present timing are changed to the filter coefficients W.sub.-8 for 800 ms before the present timing. The reason for the effectiveness of this change of the filter coefficient setting can be explained in conjunction with FIG. 6 as follows.

In FIG. 6, a curve C4 indicates the synthetic speech response signals, and a curve C5 indicates the speech input signals entered by the human speaker. In the synthetic speech response cancellation unit 2, the synthetic speech response is cancelled by updating the filter coefficients in every 100 ms, while in the speech recognition unit 5, the start point t.sub.s and an end point t.sub.E of the speech input are detected. Meanwhile, the filter coefficient updating control unit 15 makes a judgment at every 100 ms as to whether to update the present estimate W.sub.0 for the filter coefficients straightforwardly or to use the past estimate W.sub.i (i=-1 to -50), according to the start point t.sub.S detected by the speech recognition unit 5. In this manner, it becomes possible in this second embodiment to obtain more accurate estimate for the set of filter coefficients W at the adaptive filter 3, even for the period at which there is a speech input from the human speaker, so that the overall efficiency of the cancellation of the synthetic speech response at the synthetic speech response cancellation unit 2 can be improved.

Now, the procedure for estimating the accurate set of filter coefficients according to some internal information such as that of time series for power and pitch utilized in synthesizing the synthetic speech response in the speech dialogue system of FIG. 5 will be described. Here, as an illustrative example, a case of a specific synthetic speech response of "torikeshimasu" (meaning "we are cancelling" in Japanese) will be described. For this specific synthetic speech response of "torikeshimasu", the power and the pitch as a function of time appear as shown in FIGS. 7A and 7B.

In the speech dialogue system of FIG. 5, the step gain for FLMS used in updating the filter coefficients is obtained according to the convergence factor determined by the flow chart of FIG. 8 as follows. Here, the FLMS estimates a transfer function which is a frequency spectrum of the filter coefficients.

First, at the step ST11, for the first timing n=0, whether it is a silent period or not is judged from the power information shown in FIG. 7A.

In a case it is judged as a silent period at the step ST11, next at the step ST14, all of the convergence factors .mu.(f) for FLMS for all the frequencies are set equal to zero. By this setting, the estimate for the transfer function will be unaffected by the adaptive estimation, so that the estimate for the transfer function will be unaffected even when there is an input of a noise from the microphone 1 during the silent period.

On the other hand, in a case it is judged as not a silent period at the step ST11, next at the step ST12, whether it is a vowel sound or a consonant sound is judged. This judgement can be made easily as the phoneme is already known in advance.

In a case it is judged as a consonant sound at the step ST12, next at the step ST15, whether it is over the predetermined threshold level (such as a surrounding environmental noise level plus 20 dB) or not is judged. If it is judged to be over the threshold level at the step ST15, all of the convergence factors .mu.(f) for FLMS for all the frequencies are set equal to a predetermined constant convergence factor C at the step ST17, whereas otherwise all of the convergence factors .mu.(f) for FLMS for all the frequencies are set equal to zero at the step ST16.

On the other hand, in a case it is judged as a vowel sound at the step ST12, next at the step ST18, whether it is over the predetermined threshold level (such as a surrounding environmental noise level plus 20 dB) or not is judged. If it is judged to be not over the threshold level at the step ST13, all of the convergence factors .mu.(f) for FLMS for all the frequencies are set equal to zero at the step ST18. On the contrary, if it is judged to be over the threshold level at the step ST13, the convergence factors .mu.(f) for FLMS are set to be such that .mu.(f)=C for the frequencies which are in a range of .+-.(1/3)f.sub.p around the integer multiples of the pitch frequency f.sub.p at each timing indicated in FIG. 7B, and .mu.(f)=0 for the rests which are located outside of this range. That is,

.mu.(f)=C for f.sub.n .multidot.n-1/3p.sub.P <f<f.sub.p .multidot.n+1/3f.sub.p

.mu.(f)=0 for otherwise

where n is an integer.

This convergence factor setting procedure is then repeated for all of the timings by means of the step ST20, where the timing is updated in units of 10 ms, for example.

Thus, in this second embodiment, the updating of the estimate for the filter coefficients is carried out by placing more weights to the frequency components having larger power among the synthetic speech response.

Referring now to FIG. 9, a third embodiment of a speech dialogue system according to the present invention will be described in detail.

In this third embodiment, the synthetic speech response cancellation unit 2 of the first or second embodiment described above is replaced by a configuration shown in FIG. 9, while the rest of the speech dialogue system is substantially the same as the first or second embodiment described above, in order to obtain the filter coefficients stably at a high accuracy even in a case there is a large fluctuation in the power of the speech input signals.

In this third embodiment, the synthetic speech response cancellation unit 2A comprises: an A/D converter 31 for A/D converting the synthetic speech response from the synthetic speech response generation unit 7; an A/D converter 32 for A/D converting the speech input from the microphone 1; first and second smoothing filters 33 and 34 for smoothing the A/D converted synthetic speech response power signal, using different time constants; a switching unit 35 for judging whether to carry out the adaptation by the adaptive filter 3 according to the outputs of the first and second smoothing filters 33 and 34; the adaptive filter 3 similar to that used in the first or second embodiment described above; a convolution calculation unit 36 for applying the convolution calculation to the output of the adaptive filter 3; and the subtractor 4 for subtracting the output of the convolution calculation unit 36 from the A/D converted speech input to obtain the subtracted signals.

Here, the time constant for the first smoothing filter 33 is set to be smaller than the time constant for the second smoothing filter 34, and for example, the time constant t1 for the first smoothing filter 33 is set equal to 10 ms while the time constant t2 for the second smoothing filter 34 is set equal to 100 ms.

The two channel A/D converters 31 and 32 make it possible to obtain the synthetic speech response from the synthetic speech response generation unit 7 and the speech input from the microphone 1 at the constant timings. The sampling frequency of the A/D converters 31 and 32 can be set equal to 12 KHz, in view of the frequency range used in the speech signals.

The switching unit 35 prohibits the adaptation by the adaptive filter 3 whenever the output of the first smoothing filter 33 is less than or equal to a predetermined first threshold Va, and activates the adaptation by the adaptive filter 3 whenever the output of the second smoothing filter 33 is greater than or equal to a predetermined second threshold Vb.

For example, for a case of a specific synthetic speech response of "douzo" (meaning "go ahead, please" in Japanese), the power for the outputs of the first and second smoothing filters 33 and 34 appear as shown in FIG. 10A and FIG. 10B, respectively, where the power Pb(k) of the output of the second smoothing filter 34 shown in FIG. 10B is smoother than the power spectrum Pa(K) of the output of the first smoothing filter 33 shown in FIG. 10A, due to the larger time constant setting for the second smoothing filter 34.

The portions of the power shown in FIGS. 10A and 10B at which the sound is disrupted are shown together in enlargement in FIG. 11.

Now, in general, the accuracy of the estimate of the filter coefficients drops abruptly in a short period of time such as 1 ms when the power of the speech changes largely as in the border of the speech section and silent section. For this reason, it is possible to maintain the high accuracy for the estimate of the filter coefficients by stopping the adaptation as soon as a large change of the power of the speech occurs.

Consequently, in this third embodiment, the switching unit 35 prohibits the adaptation by the adaptive filter 3 whenever the output of the first smoothing filter 33 becomes less than or equal to a predetermined first threshold Va indicated in FIG. 11, and activates the adaptation by the adaptive filter 3 whenever the output of the second smoothing filter 33 becomes greater than or equal to a predetermined second threshold Vb indicated in FIG. 11, such that the adaptation by the adaptive filter 3 is not carried out when the power of the speech changed largely.

Here, the appropriate values for the first and second thresholds Va and Vb are empirically determined, and for example, the value of the first threshold Va can be set equal to -20 dB which is a mean power of the synthetic speech response.

In order to demonstrate the effect of this third embodiment, the accuracy for the estimate of the filter coefficients for a case of using a specific synthetic speech response of "irasshaimase" (meaning "welcome" in Japanese) is shown in FIG. 12, where a curve C11 indicates a case with the stopping of the adaptation as described above, while a curve C12 indicates a case without the stopping of the adaptation. As can be clearly seen in FIG. 12, the estimate of the filter coefficients can be obtained at much higher accuracy with the stopping of the adaptation.

In addition, the speech recognition rate as a function of the accuracy of the estimate for the filter coefficients is shown in FIG. 13, which clearly indicates that the speech recognition rate becomes higher for the higher accuracy of the estimate for the filter coefficients, i.e., for the larger amount of cancellation of the synthetic speech response.

Thus, according to this third embodiment, the high accuracy for the estimate of the filter coefficients can be achieved by stopping the adaptation by the adaptive filter whenever there is a large change in the power of the speech, and this high accuracy for the estimate of the filter coefficients in turn ensures the high speech recognition rate, so that it becomes possible in this third embodiment to carry out the cancellation of the synthetic speech response more effectively.

In the speech dialogue system of this third embodiment, the step gain for LMS used in updating the filter coefficients is obtained according to the flow chart of FIG. 14 as follows.

First, the initial timing is set to k=0 at the step ST31, and for the first timing k=0, whether the output power Pa(k) of the first smoothing filter 88 is not greater than the first threshold Va is judged at the step ST32.

In a case it is judged as not greater than the first threshold Va at the step ST32, next at the step ST36, the step gain for LMS is set equal to zero by setting the convergence factor .mu.=0, so as to prohibit the updating of the filter coefficients.

On the other hand, in a case it is judged as greater than the first threshold Va at the step ST32, next at the step ST33, whether the output power Pb(k) of the second smoothing filter 34 is not greater than the second threshold Vb is judged.

In a case it is judged as not greater than the second threshold Vb at the step ST33, next at the step ST37, the step gain for LMS is set equal to zero by setting the convergence factor .mu.=0, so as to prohibit the updating of the filter coefficients.

On the other hand, in a case it is judged as greater than the second threshold Vb at the step ST32, next at the step ST34, the step gain for LMS is set according to the following equation (5), so as to carry out the updating of the filter coefficients.

step gain=2.mu..multidot.e(k)/(Pb(t).multidot.L) (5)

where the convergence factor .mu. is a constant.

This step gain setting procedure is then repeated for all of the timings by means of the step ST35.

It is to be noted that the amount of calculation required in estimating the filter coefficients can be quite large, so that in order to carry out such a large amount of calculation in real time fashion, a DSP board may be used in the synthetic speech response cancellation unit 2A.

It is also to be noted that the synthetic speech response cancellation unit 2A of FIG. 9 may be modified to have a different number of the smoothing filters such as one or three, instead of two as described above.

The speech dialogue system according to the present invention has an outer appearance as sho