WikiPatents - Community Patent Review
Create Free Account  |  License or Sell Your Patent  |  WikiPatents Marketplace  |  WikiPatents Blog
Username:  Password:  
    
Advanced Search
Speech dialogue system for facilitating improved human-computer interaction    
United States Patent5577165   
Link to this pagehttp://www.wikipatents.com/5577165.html
Inventor(s)Takebayashi; Yoichi (Kanagawa-ken, JP); Tsuboi; Hiroyuki (Hyogo-ken, JP); Sadamoto; Yoichi (Chiba-ken, JP); Yamashita; Yasuki (Hyogo-ken, JP); Nagata; Yoshifumi (Kanagawa-ken, JP); Seto; Shigenobu (Kanagawa-ken, JP); Shinchi; Hideaki (Kanagawa-ken, JP); Hashimoto; Hideki (Kanagawa-ken, JP)
AbstractA speech dialogue system capable of realizing natural and smooth dialogue between the system and a human user, and easy maneuverability of the system. In the system, a semantic content of input speech from a user is understood and a semantic content determination of a response output is made according to the understood semantic content of the input speech. Then, a speech response and a visual response according to the determined response output are generated and outputted to the user. The dialogue between the system and the user is managed by controlling transitions between user states during which the input speech is to be entered and system states during which the system response is to be outputted. The understanding of a semantic content of input speech from a user is made by detecting keywords in the input speech, with the keywords to be detected in the input speech limited in advance, according to a state of a dialogue between the user and the system.
   














 Title Information Submit all comments and votes
 
Patent Text Patent PDF Print Page Summary File History
Plain text PDF images Print Summary File History
Drawing from US Patent 5577165
Speech dialogue system for facilitating improved human-computer

     interaction - US Patent 5577165 Drawing
Speech dialogue system for facilitating improved human-computer interaction
Inventor     Takebayashi; Yoichi (Kanagawa-ken, JP); Tsuboi; Hiroyuki (Hyogo-ken, JP); Sadamoto; Yoichi (Chiba-ken, JP); Yamashita; Yasuki (Hyogo-ken, JP); Nagata; Yoshifumi (Kanagawa-ken, JP); Seto; Shigenobu (Kanagawa-ken, JP); Shinchi; Hideaki (Kanagawa-ken, JP); Hashimoto; Hideki (Kanagawa-ken, JP)
Owner/Assignee     Kabushiki Kaisha Toshiba (Kawasaki, JP); Toshiba Software Engineering Corp. (Tokyo, JP)
Patent assignment
All assignments
Publication Date     November 19, 1996
Application Number     08/312,541
PAIR File History     Application Data   Transaction History
Image File Wrapper   Patent Term   Fees
Litigation
Filing Date     September 26, 1994
US Classification     704/275 704/251 704/270
Int'l Classification     G01L 009/00
Examiner     MacDonald; Allen R.
Assistant Examiner     Chowdhury; Indranil
Attorney/Law Firm     Oblon, Spivak, McClelland, Maier & Neustadt, P.C.
Address
Parent Case     This is a continuation of application Ser. No. 07/978,521, filed on Nov. 18, 1992, U.S. Pat. No. 5,357,596.
Priority Data     Nov 18, 1991[JP]3-329475
USPTO Field of Search     381/41 381/42 381/43 381/44 381/45 395/2.79 395/2.84 395/2.4 395/2.6
Patent Tags     speech dialogue facilitating improved human-computer interaction
   
Enter a comma (,) or semicolon (;) between multiple tag words/phrases.
Describe this patent:
 Amusing   
 Clever   
 Complex   
 Efficient   
 Historic   
 Important   
 Innovative   
 Interesting   
 Practical   
 Simple   
[no votes]
Patent WIKI

Share information and news about this patent, including information and news about the technology, inventors, company, ligation and licensing.

 References Submit all comments and votes
 
*references marked with an asterisk below are user-added references
 U.S. References
 
Add a new US reference:  
ReferenceRelevancyCommentsReferenceRelevancyComments
5357596
Takebayashi
704/275
Oct,1994

[0 after 0 votes]
5219291
Fong
434/323
Jun,1993

[0 after 0 votes]
5068645
Drumm

Nov,1991

[0 after 0 votes]
4856066
Lemelson
704/275
Aug,1989

[0 after 0 votes]
4677569
Nakano
704/275
Jun,1987

[0 after 0 votes]
 Foreign References
 Other References
 Market Review Submit all comments and votes
   
Market Size
Estimate the gross annual revenues of the relevant market sector:
> $10B
$5B - $10B
$2B - $5B
$500M - $2B
$100M - $500M
$10M - $100M
$1M - $10M
$500K - $1M
$100K - $500K
< $100K
[No votes]
$0
 
$0   $2.5B   $5B   $7.5B   $10B
Market Share
Estimate the percentage of the relevant market sector this invention will capture:
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Reasonable Royalty
What percentage of gross sales should the inventor or assignee be paid?
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Public's "Guesstimation" of Royalty Value
Market SizeN/A[No votes]
xMarket ShareN/A[No votes]
xReasonable RoyaltyN/A[No votes]

N/A

License Availablity
If you are NOT the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
License Availablity
If you ARE the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
Competitive Advantage
Does this invention have a significant competitive advantage over similar technologies?
Yes

No



[No votes]
Most helpful competitive advantage comment
[No comments]

Commercial Alternatives
Are there viable commercial alternatives for this invention?
Yes

No



[No votes]
Most helpful commercial alternative comment
[No comments]

 Technical Review Submit all comments and votes
 Claims Submit all comments and votes
 


What is claimed is:

1. A speech dialogue system, comprising:

speech understanding means for understanding an input speech from a user;

dialogue management means for determining a response output content according to the input speech understood by the speech understanding means;

response generation means for generating a speech response and a visual response according to the response output determined by the dialogue management means; and

output means for outputting the speech response and the visual response generated by the response generation means to the user.

2. The speech dialogue system for claim 1, wherein the response generation means generates the visual response which indicates at least one of an image of a human character delivering the speech response, text data of the speech response, and a content visualizing image of a content of the speech response.

3. The speech dialogue system of claim 1, wherein the output means outputs the speech response and the visual response by controlling at least one of an output order, an output timing, and a visual response output position.

4. The speech dialogue system of claim 1, further comprising user state detection means for detecting a state of the user, wherein the state of the user detected by the user state detection means is taken into account by the dialogue management means in determining the response output.

5. The speech dialogue system of claim 1, wherein the response generation means generates the visual response which includes an image of a human character delivering the speech response, where the image incorporates movement and facial expression of the human character.

6. The speech dialogue system of claim 5, wherein the response generation means generates the speech response incorporating a speech characteristic corresponding to the movement and the facial expression of the human character.

7. The speech dialogue system of claim 6, wherein the speech characteristic of the speech response includes at least one of an emotional expression, an intentional expression, and an intonation.

8. The speech dialogue system of claim 1, wherein the speech understanding means supplies a plurality of candidates for the input speech, and the dialogue management means determines the response output by evaluating said plurality of candidates in accordance with a dialogue history/state.

9. The speech dialogue system of claim 1, wherein the output means also outputs a visual indication for informing the user as to whether the speech dialogue system is ready to receive the input speech.

10. The speech dialogue system of claim 1, wherein the response generation means generates the visual response including a content visualizing image formed by pictures of objects mentioned in the speech response and a numerical figure indicating a quantity of each of the objects.

11. The speech dialogue system of claim 1, wherein the response generation means generates the speech response for making a confirmation of the input speech, while generating the visual response on the basis of a past history of a dialogue between the user and the speech dialogue system.

12. The speech dialogue system of claim 1, wherein the response generation means generates the visual response which includes text data of the speech response and graphic images other than the text data, and the response generation means generates the speech response and the text data for making a confirmation of the input speech while generating the graphic images on the basis of a past history of a dialogue between the user and the speech dialogue system.

13. The speech dialogue system of claim 1, wherein the response generation means generates the speech response for making a confirmation of the input speech, the speech response being changed from a full speech response to a simplified speech response according to a length of the speech response for making the confirmation.

14. The speech dialogue system of claim 13, wherein the length of the speech response for making the confirmation is determined from information items to be confirmed by the confirmation.

15. The speech dialogue system of claim 14, wherein the full speech response mentions all of the information items to be confirmed while the simplified speech response does directly not mention the information items to be confirmed.

16. The speech dialogue system of claim 15, wherein the simplified speech response contains a pronoun to refer to the visual response.

17. The speech dialogue system of claim 13, wherein the full speech response recites the response output explicitly while the simplified speech response does not recite the response output explicitly.

18. The speech dialogue system of claim 17, wherein the simplified speech response contains a pronoun to refer to the visual response.

19. The speech dialogue system of claim 13, wherein the output means outputs the visual response at an earlier timing than a timing for outputting the visual response when the speech response is the full speech response.

20. The speech dialogue system of claim 13, wherein the output means outputs the visual response before the speech response is outputted.

21. A method of speech dialogue between a human user and a speech dialogue system, comprising the steps of:

understanding an input speech from a user;

determining a response output according to the input speech;

generating a speech response and a visual response according to the response output; and

outputting the speech response and the visual response generated to the user.

22. The method of claim 21, wherein the generating step generates the visual response which includes at least one of an image of a human character delivering the speech response, text data of the speech response, and a content visualizing image of a content of the speech response.

23. The method of claim 21, wherein the outputting step outputs the speech response and the visual response by controlling at least one of an output order, an output timing, and a visual response output position.

24. The method of claim 21, further comprising the step of detecting a state of the user, and wherein the determining step determines the response output by taking the state of the user detected at the detecting step into account.

25. The method of claim 21, wherein the generating step generates the visual response which includes an image of a human character delivering the speech response, where the image incorporates movement and facial expression of the human character.

26. The method of claim 25, wherein the generating step generates the speech response incorporating a speech characteristic corresponding to the movement and the facial expression of the human character.

27. The method of claim 26, wherein the speech characteristic of the speech response includes at least one of an emotional expression, an intentional expression, and an intonation.

28. The method of claim 21, wherein the understanding step obtains a plurality of candidates for the input speech, and the determining step determines the response output by evaluating said plurality of candidates in accordance with a dialogue history/state.

29. The method of claim 21, wherein the outputting step also outputs a visual indication for informing the user as to whether the speech dialogue system is ready to receive the input speech.

30. The method of claim 21, wherein the generating step generates the visual response which includes a content visualizing image formed by pictures of objects mentioned in the speech response and a numerical figure indicating a quantity of each of the objects.

31. The method of claim 21, wherein at the generating step, the speech response for making a confirmation of the input speech is generated, while the visual response reflecting a past history of a dialogue between the user and the speech dialogue system is generated.

32. The method of claim 21, wherein at the generating step, the visual response includes text data of the speech response and graphic images other than the text data, and the speech response and the text data for making a confirmation of the input speech is generated while the graphic images reflecting a past history of a dialogue between the user and the speech dialogue system are generated.

33. The method of claim 21, wherein at the generating step, the speech response for making a confirmation of the input speech is generated, the speech response being changed from a full speech response to a simplified speech response, according to a length of the speech response for making the confirmation.

34. The method of claim 33, wherein the length of the speech response for making the confirmation is determined from information items to be confirmed by the confirmation.

35. The method of claim 34, wherein the full speech responses mentions all of the information items to be confirmed while the simplified speech response does not directly mention the information items to be confirmed.

36. The speech dialogue system of claim 35, wherein the simplified speech response contains a pronoun to refer to the visual response.

37. The method of claim 33, wherein the full speech response recites the response output explicitly while the simplified speech response does not recite the response output explicitly.

38. The method of claim 37, wherein the simplified speech response contains a pronoun to refer to the visual response.

39. The method of claim 33, wherein at the outputting step, the visual response is outputted at an earlier timing than a timing for outputting the visual response when the speech response is the full speech response.

40. The method of claim 33, wherein at the outputting step, the visual response is outputted before the speech response is outputted.

41. A speech response system, comprising:

speech analyzing means for analyzing an input speech from a user;

response output means for outputting a speech response and a visual response according to the input speech analyzed by the speech analyzing means; and

management means for managing a dialogue between the user and the speech response system, by accepting a new input speech from the user after the visual response is outputted by the response output means even when the speech response is still being outputted by the response output means, and controlling the speech analyzing means to analyze the new input speech and the response output means to output a new speech response and a new visual response according to the new speech input analyzed by the speech analyzing means.

42. A speech dialogue system, comprising:

extracting means for extracting words of the input speech;

detecting means for detecting predetermined keywords among the extracted words; and

response output means for outputting a system response according to the detected keywords, the system response being generated in accordance with a prescribed rule corresponding to each of the keywords.
 Description Submit all comments and votes
 


BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention relates to a speech dialogue system for realizing an interaction between a computer based system and a human speaker by utilizing various input and output techniques such as speech recognition and speech synthesis.

2. Description of the Background Art

In recent years, it has become possible to realize a so called human-computer interaction in various forms by inputting, outputting and processing multi-media such as characters, speech, graphics, and images.

In particular, in conjunction with a significant improvement in the capacities of a computer and a memory device, various applications of a work station and a personal computer which can handle the multi-media have been developed. However, such a conventional workstation or personal computer is only capable of handling various media separately and does not realize any coordination of the various media employed.

Meanwhile, it has become popular to use linguistic data of characters instead of the numerical data ordinarily used in a conventional computer.

As for the visual data, a capacity to handle the monochromatic image data ordinarily used in a conventional computer is expanded to deal with color images, animated images, three dimensional graphic images, and dynamic images.

As for the audio data, in addition to a conventionally used technique for handling speech signal levels, progress has been made to develop various other techniques such as a speech recognition and a speech synthesis, but these techniques are still too unstable to realize any practical applications except in some very limited fields.

Thus, for various types of data to be used in a computer based system such as character data, text data, speech data, and graphic data, there is a trend to make progress from a conventional input and output (recording and reproduction) functions to the understanding and generation functions. In other words, there is progress toward the construction of a dialogue system utilizing the understanding and generation functions for various media such as speech and graphics for the purpose of realizing more natural and pleasant human-computer interaction, by dealing with the content, structure, and meaning expressed in the media rather than the superficial manipulation of the media.

As for speech recognition, the development has been made from an isolated word recognition toward continuous word recognition and continuous speech recognition, primarily in specific task oriented environments accounting for the practical implementations. In such a practical application, it is more important for the speech dialogue system to understand the content of the speech rather than to recognize the individual words, and there has been a progress of a speech understanding system utilizing the specialized knowledge of the application field on a basis a keyword spotting technique.

On the other hand, as For the speech synthesis, development has been made from a simple text-to-speech system toward a speech synthesis system suitable for a speech dialogue system in which a greater weight is given to the intonation.

However, the understanding and the generation of the media such as speech are not so simple as the ordinary input and output of data, so that errors or loss of information at a time of conversion among the media are inevitable. Namely, the speech understanding is a type of processing which extracts the content of the speech and the intention of the human speaker from the speech pattern data expressed in enormous data size, so that it is unavoidable to produce the speech recognition error or ambiguity in a process of compressing the data.

Consequently, it is necessary for the speech dialogue system to actively control the dialogue with the human speaker to make it progress as natural and efficient as possible by issuing appropriate questions and confirmations from the system side, so as to make up for the incompleteness of the speech recognition due to the unavoidable recognition error or ambiguity.

Now, in order to realize a natural and efficient dialogue with a human speaker, it is important for the speech dialogue system to be capable of conveying as much information on the state of the computer as possible to the human speaker. However, in a conventional speech dialogue system, the speech response is usually given by a mechanical voice reading of a response obtained by a text composition without any modulation of speech tone, so that it has often been difficult for the user to hear the message, and the message has been sometimes quite redundant. In the other types of a conventional speech dialogue system not using the speech response, the response from the system has usually been given only as a visual information in terms of text, graphics, images, icons, or numerical data displayed on a display screen, so that the human-computer dialogue has been heavily relying upon the visual sense of the user.

As described, in a conventional speech dialogue system, a sufficient consideration has not been given to the use of the various media in the response from the system for the purpose of making up the incompleteness of the speech recognition and this has been the critical problem in the practical implementation of the speech recognition technique.

In other words, the speech recognition technique is associated with an instability due to the influence of the noises and unnecessary utterances by the human speaker, so that it is often difficult to convey the real intention of the human speaker in terms of speech, and consequently the application of the speech recognition technique has been confined to the severely limited field such as a telephone in which only the speech media is involved.

Thus, the conventional speech dialogue system has been a simple combination of the separately developed techniques related to the speech recognition, speech synthesis, and image display, and the sufficient consideration from a point of view of the naturalness and comfortableness of speech dialogue has been lacking.

More precisely, the conventional speech dialogue system has been associated with the essential problem regarding the lack of the naturalness due to the instability of the speech recognition caused by the recognition error or ambiguity, and the insufficient speech synthesis function to convey the feeling and intent resulting from the insufficient intonation control and the insufficient clarity of the speech utterance.

Moreover, the conventional speech dialogue system also lacked the sufficient function to generate the appropriate response on a basis of the result of the speech recognition.

Furthermore, there is an expectation for the improvement of the information transmission function by utilizing the image display along with the speech response, but the exact manner of using the two dimensional or three dimensional image displays in relation to the instantaneously and continuously varying speech response remains as the unsolved problem.

Also, it is important to determine what should be displayed in the speech dialogue system utilizing various other media.

SUMMARY OF THE INVENTION

It is therefore an object of the present invention to provide a speech dialogue system capable of realizing natural and smooth dialogue between the system and a human user, and easy maneuverability of the system.

According to one aspect of the present invention there is provided a speech dialogue system comprising: speech understanding means for understanding a semantic content of an input speech from a user: dialogue management means for making a semantic determination of a response output content according to the semantic content of the input speech understood by the speech understanding means: response generation means for generating a speech response and a visual response according to the response output content determined by the dialogue management means; and output means for outputting the speech response and the visual response generated by the response generation means to the user.

According to another aspect of the present invention there is provided a method of speech dialogue between a human user and a speech dialogue system, comprising the steps of: understanding a semantic content of an input speech from a user; making a semantic determination of a response output content according to the semantic content of the input speech understood at the understanding step; generating a speech response and a visual response according to the response output content determined at the making step; and outputting the speech response and the visual response generated at the generating step to the user.

According to another aspect of the present invention there is provided a speech dialogue system comprising: speech understanding means for understanding a semantic content of an input speech from a user; response output means for outputting a system response according to the semantic content of the input speech understood by the speech understanding means; and dialogue management means for managing the dialogue between the user and the system by controlling transitions between user states in which the input speech is to be entered into the speech understanding is means and system states in which the system response is to be outputted from the response output means.

According to another aspect of the present invention there is provided a speech dialogue system, comprising: speech understanding means for understanding a semantic content of an input speech from a user by detecting keywords in the input speech; dialogue management means for limiting the keywords to be detected in the input speech by the speech understanding means in advance, according to a state of a dialogue between the user and the system; and response output means for outputting a system response according to the semantic content of the input speech understood by the speech understanding means.

Other features and advantages of the present invention will become apparent from the following description taken in conjunction with the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a schematic block diagram of a first embodiment of a speech dialogue system according to the present invention.

FIG. 2 is a detailed block diagram of a speech understanding unit in the speech dialogue system of FIG. 1,

FIG. 3 is an illustration of an example of a keyword lattice obtained from a continuous input speech in the speech understanding unit of FIG. 2.

FIG. 4 is an illustration of an example of a semantic utterance representation to be obtained by the speech understanding unit of FIG. 2.

FIG 5 is an illustration of an exemplary list of keywords to be used in the speech understanding unit of FIG. 2.

FIG. 6 is an illustration of an example of a semantic response representation to be obtained by a dialogue management unit in the speech dialogue system of FIG. 1.

FIG. 7 is an illustration of an order table to be used in a dialogue management unit in the speech dialogue system of FIG. 1.

FIG. 8 is an illustration of a past order table to be used in a dialogue management unit in the speech dialogue system of FIG. 1.

FIG. 9 is a state transition diagram for an operation of a dialogue management unit in the speech dialogue system of FIG. 1.

FIG. 10 is a flow chart for an operation in a user state in the state transition diagram of FIG. 9.

FIG. 11 is a flow chart for an operation in a system state in the state transition diagram of FIG. 9.

FIGS. 12A and 12B are illustrations of examples of a semantic response representation and an order table for an exemplary case of the operation in a dialogue management unit in the speech dialogue system of FIG. 1.

FIG. 12C is an illustration indicating an exemplary dialogue between the system and the user in an exemplary case of the operation in a dialogue management unit in the speech dialogue system of FIG. 1.

FIGS. 12D and 12E are illustrations of examples of two semantic utterance representation candidates for an exemplary case of the operation in a dialogue management unit in the speech dialogue system of FIG. 1.

FIG. 13 is a flow chart for an operation in a user state in an exemplary case of the operation in a dialogue management unit in the speech dialogue system of FIG. 1 using the examples shown in FIGS. 12A to 12E.

FIG. 14 is a flow chart for an operation in a system state in an exemplary case of the operation in a dialogue management unit in the speech dialogue system of FIG. 1.

FIGS. 15A, 15B and 15C are illustrations of examples of a semantic utterance representation, a response act list, and a semantic response representation for an exemplary case of the operation in a dialogue management unit in the operation shown in the flow chart of FIG. 14.

FIG. 16B is an illustration of a table summarizing system responses for various cases in the speech dialogue system of FIG. 1.

FIG. 17 is an illustration of an input speech signal for explaining a determination of an input speech speed in the speech dialogue system of FIG. 1.

FIG. 18 is an illustration of an example of a semantic response representation supplied from the dialogue management unit to the response generation unit in the speech dialogue system of FIG. 1.

FIG. 19 is a detailed block diagram of a response generation unit in the speech dialogue system of FIG. 1.

FIG. 20 is an illustration of an example of a human character image information to be used in the response generation unit of FIG. 19.

FIG. 21 is an illustration of examples of a response sentence structure to be used in a response sentence generation unit in the response generation unit of FIG. 19.

FIG. 22A is a flow chart for an operation of the response sentence generation unit in the response generation unit of FIG. 19.

FIGS. 22B, 22C and 22D are illustrations of exemplary semantic response representation, response sentence structure, and generated response sentence to be used in the response sentence generation unit in the operation shown in the flow chart of FIG. 22A.

FIG. 23 is an illustration of a table used in a human character feature determination unit in the response generation unit of FIG. 19.

FIG. 24 is an illustration of a table used in a speech characteristic determination unit in the response generation unit of FIG. 19.

FIG. 25 is a detailed block diagram of a speech response generation unit in the response generation unit of FIG. 19.

FIG. 26 is a diagram for a fundamental frequency pattern model used in the speech response generation unit of FIG. 25.

FIGS. 27A-27F are diagrams of a fundamental frequency pattern used in the speech response generation unit of FIG. 25, without and with a modification for generating a speech response with a joyful expression.

FIGS. 28A-28F are diagrams of a fundamental frequency pattern used in the speech response generation unit of FIG. 25, without and with a modification for generating a speech response with a regretful expression.

FIG. 29 is a detailed block diagram of a speech waveform generation unit in the speech response generation unit of FIG. 25.

FIG. 30A is a timing chart for one example of a display timing control to be made in a response output control unit in the response generation unit of FIG. 19.

FIG. 30B is a timing chart for another example of a display timing control to be made in a response output control unit in the response generation unit of FIG. 19.

FIG. 31A is a timing chart for another example of a display timing control to be made in a response output control unit in the response generation unit of FIG. 19.

FIG. 31B is a timing chart for another example of a display timing control to be made in a response output control unit in the response generation unit of FIG. 19.

FIG. 32 to FIG. 38 are illustrations of various examples of display images to be used in the speech dialogue system of FIG. 1 obtained by the response generation unit of FIG. 19.

FIG. 39 is a diagram summarizing an overall operation in the speech dialogue system of FIG. 1.

FIG. 40 is a schematic block diagram of a second embodiment of a speech dialogue system according to the present invention.

FIG. 41 is a diagram for explaining an operation of a user state detection unit in the speech dialogue system of FIG. 40.

FIG. 42 is a timing chart for one example of an operation in the speech dialogue system of FIG. 40.

FIG. 43 is a timing chart for another example of an operation in the speech dialogue system of FIG. 40.

FIG. 44 is a flow chart for an operation in the speech dialogue system of FIG. 40.

FIG. 45 is a schematic block diagram of a third embodiment of a speech dialogue system according to the present invention.

FIGS. 46A and 46B are block diagrams of two alternative configurations for an A/D and D/A conversion units in the speech dialogue system of FIG. 45.

FIG. 47 is an illustration of one example of a display image to be used in the speech dialogue system of FIG. 45.

FIG 48 is an illustration of another example of a display, image to be used in the speech dialogue system of FIG. 45.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

I. FIRST EMBODIMENT

Referring now to FIG. 1, a first embodiment of a speech dialogue system according to the present invention will be described in detail.

1. Overall System Configuration

In this first embodiment, the speech dialogue system comprises: a speech understanding unit 11 for understanding a semantic content of an input speech uttered by a user; a dialogue management unit 12 for making a semantic determination of a response output content according to the semantic content of the input speech understood by the speech understanding unit; a response generation unit for generating a speech response and a visual response according to the response output content determined by the dialogue management unit; a display unit 14 for outputting the visual response generated by the response generation unit to the user; and a loudspeaker unit 15 for outputting the speech response generated by the response generation unit to the user.

The speech understanding unit 11 is not a simple speech recognition device for only recognizing words or sentences in the input speech, but capable of extracting a semantic content intended to be expressed in the input speech by analyzing the input speech, in a form of a semantic utterance representation as will be described in detail below.

The dialogue management unit 12 makes the semantic determination of the response output content by using a dialogue history, a current dialogue state, a dialogue management procedure, and a knowledge on specialized application field, and supplies the response output content information indicating the appropriate response output to be generated to the response generation unit 13.

In addition, the dialogue management unit 12 achieves the improvement of the speech understanding and the reduction of the processing amount by properly treating the spoken input speech containing elipsis and demonstrative pronouns, so as to enable the natural dialogue between the system and the user.

Moreover, the dialogue management unit 12 supplies the generated response output content information back to the speech understanding unit 11 in order to improve the efficiency of the speech understanding at the speech understanding unit 11 for the subsequent input speech by preliminarily limiting the candidates of the keywords, as well as syntactic and semantic rules to be utilized in the speech understanding, according to the response output content information generated in response to the current input speech, before the subsequent input speech is entered into the speech understanding unit. This preliminary limiting of the keywords and the syntactic and semantic rules is effective in reducing an amount of calculations required in a keyword spotting operation to be used in speech understanding.

Furthermore, the dialogue management unit 12 also supplies the response generation unit 13 with a human character image information indicating a human character image of a human character to deliver the speech response which is to be displayed on the display unit 14 while the speech response is outputted from the loudspeaker unit 15, and a content visualizing image information indicating a content visualizing image for visualizing the content of the speech response for the purpose of assisting the user's comprehension of the response from the system, which is also to be displayed on the display unit 14 while the speech response is outputted from the loudspeaker unit 15.

The response generation unit 13 generates the speech response in a synthesized voice to be outputted from the loudspeaker unit 15 according to the response output content information supplied from the dialogue management unit 12, and the visual response including the text data of the speech response, and the human character image and the content visualizing image to be displayed on the display unit 14 according to the human character image information and the content visualizing image information supplied from the dialogue management unit 12. Here, the human character image to be displayed on the display unit 14 incorporates the movement and the facial expression of the human character which are determined according to the response output content information and the human character image information supplied from the dialogue management unit 12. In other words, the response generation unit 13 generates the multimodal system response incorporating both the audio information and the visual information for supporting the smooth comprehension of the system response by the user, so as to establish the natural dialogue between the user and the system.

In addition, while the generated speech and visual responses are outputted from the response generation unit 13, the response generation unit 13 notifies the dialogue management system 12 that the output of the responses is in progress. In response, the dialogue management unit 12 controls the timings of the speech understanding operation such as the start and end point detection and the keyword spotting for the subsequent input speech which is to be carried out by the speech understanding unit 11, according to this notification from the response generation unit 13, in order to improve the efficiency of the speech understanding at the speech understanding unit 11.

2. Individual System Elements

Now, the further details of the each element in this first embodiment of the speech dialogue system shown in FIG. 1 will be described. In the following description, a case of employing this speech dialogue system to a task of order taking in a fast food store will be used for the sake of definiteness of the description.

2.1. Speech Understanding Unit 11

The speech understanding unit 11 is required to achieve the understanding of the input speech uttered by the user by extracting a semantic content intended to be expressed in the input speech.

In general, the use of speech recognition of the speech uttered by the unspecified user has been contemplated for the specialized applications such as a ticket sale service system, a seat reservation service system, and a bank transaction service system, but such a speech recognition for the unspecified user has been encountering a considerable difficulty in achieving the accurate recognition of the actually spoken sentences because of the different manners of the speech utterance used by different users, the unnecessary words uttered by the user in conjunction with the actual message, the personal variations in the spoken language, and the influence of the background noises.

As a solution to such a difficulty encountered by the speech recognition for the unspecified user, there has been a proposition for the method of continuous speech understanding based on keyword lattice parsing in which the understanding of the semantic content of the continuously uttered speech is achieved by analyzing the keywords detected in the speech, as disclosed in H. Tsuboi and Y. Takebayashi: "A Real-Time Task-Oriented Speech Understanding System using Keyword Spotting", Proceedings of 1992 International Conference on Acoustics, Speech, and Signal Processing (ICASSP 92), I-197 to I-200, San Francisco, U.S.A. (March 1992). Under the properly controlled circumstances, this method is capable of achieving the high speed understanding of the almost freely uttered speech by using very little restrictions regarding the manner of speech utterance imposed on the user. Thus, in this first embodiment, this method of continuous speech understanding based on the keyword lattice parsing is employed in the speech understanding unit 11 of FIG. 1. A detailed implementation of the speech understanding unit 11 for realizing this method will now be described.

As shown in FIG. 2, the speech understanding unit 11 of this first embodiment generally comprises a keyword detection unit 21 and a syntactic and semantic analysis unit 22, where the keyword detection unit 21 further comprises the speech analyzer 21a and a keyword spotter 21b, while the syntactic and semantic analysis unit 22 further comprises a sentence start point detector 22a, a sentence candidate analyzer 22b, a sentence end point detector 22c, and a sentence candidate table 22d which is accessible from all of the sentence start point detector 22a, the sentence candidate analyzer 22b, and the sentence end point detector 22c.

The keyword detection unit 21 carries out the keyword spotting operation as follows. First, at the speech analyzer 21a, the input speech is passed through a low pass filter (not shown) and A/D converted by using the sampling frequency of 12 KHz and the 12 bits quantization. Then, at the speech analyzer 21a, the spectral analysis and the smoothing in the frequency region after the fast Fourier transformation are carried out to the obtained digital signals, and then the speech analysis result is obtained for each 8 ms by using the 16 channel band pass filter (not shown) after the logarithmic transformation. Then, at the keyword spotter 21b, the known keyword spotting procedure is applied to the speech analysis result obtained by the speech analyzer 21a. Here, the known keyword spotting procedure such as that disclosed in Y. Takebayashi, H. Tsuboi, and H. Kanazawa: "A Robust Speech Recognition System using Word-Spotting with Noise Immunity Learning", Proceedings of 1991 International Conference on Acoustics, Speech, and Signal Processing (ICASSP 91), pp. 905-908, Toronto, Canada (May, 1991), can be used, for example.

As a result of this keyword spotting procedure at the keyword spotter 21b, the keyword detection unit 21 obtains the keyword lattice enumerating all the keyword candidates from the continuous input speech. FIG. 3 shows an example of the keyword lattice obtained by the keyword detection unit 21 from the continuous input speech in Japanese equivalent to the English sentence of "Three hamburgers, coffees, and potatoes, please" uttered in Japanese, where the shaded words are the keywords detected in this continuous input speech. Here, it is to be noted that, in this FIG. 3, there is a correspondence between the continuous input speech uttered in Japanese as shown in FIG. 8 and the keywords of the keyword lattice in Japanese equivalents of those shown in FIG. 8, and consequently there is no correspondence between the continuous input speech as shown in FIG. 3 and the keywords of the keyword lattice as shown in FIG. 3. In other words, the keyword lattice shown in FIG. 3 is obtained for the Japanese keywords in the continuous input speech uttered in Japanese, and the English words appearing in the keyword lattice of FIG. 3 are straightforward translations of the Japanese keywords. Consequently, for the continuous input speech of "Three hamburgers, coffees, and potatoes, please" uttered in English, the keyword lattice expressed in En