|
|
|
| United States Patent | 5577165 |
| Link to this page | http://www.wikipatents.com/5577165.html |
| Inventor(s) | Takebayashi; Yoichi (Kanagawa-ken, JP);
Tsuboi; Hiroyuki (Hyogo-ken, JP);
Sadamoto; Yoichi (Chiba-ken, JP);
Yamashita; Yasuki (Hyogo-ken, JP);
Nagata; Yoshifumi (Kanagawa-ken, JP);
Seto; Shigenobu (Kanagawa-ken, JP);
Shinchi; Hideaki (Kanagawa-ken, JP);
Hashimoto; Hideki (Kanagawa-ken, JP) |
| Abstract | A speech dialogue system capable of realizing natural and smooth dialogue
between the system and a human user, and easy maneuverability of the
system. In the system, a semantic content of input speech from a user is
understood and a semantic content determination of a response output is
made according to the understood semantic content of the input speech.
Then, a speech response and a visual response according to the determined
response output are generated and outputted to the user. The dialogue
between the system and the user is managed by controlling transitions
between user states during which the input speech is to be entered and
system states during which the system response is to be outputted. The
understanding of a semantic content of input speech from a user is made by
detecting keywords in the input speech, with the keywords to be detected
in the input speech limited in advance, according to a state of a dialogue
between the user and the system. |
|
|
|
Title Information  |
|
|
|
|
|
Drawing from US Patent 5577165 |
|
|
Speech dialogue system for facilitating improved human-computer
interaction |
|
|
|
|
|
| Publication Date |
November 19, 1996 |
|
|
|
|
|
| Filing Date |
September 26, 1994 |
|
|
|
|
|
|
|
|
|
|
|
| Parent Case |
This is a continuation of application Ser. No. 07/978,521, filed on Nov.
18, 1992, U.S. Pat. No. 5,357,596. |
|
| Priority Data |
Nov 18, 1991[JP]3-329475 |
|
|
|
|
|
|
|
|
|
|
|
Title Information  |
|
|
Claims  |
|
|
What is claimed is:
1. A speech dialogue system, comprising:
speech understanding means for understanding an input speech from a user;
dialogue management means for determining a response output content
according to the input speech understood by the speech understanding
means;
response generation means for generating a speech response and a visual
response according to the response output determined by the dialogue
management means; and
output means for outputting the speech response and the visual response
generated by the response generation means to the user.
2. The speech dialogue system for claim 1, wherein the response generation
means generates the visual response which indicates at least one of an
image of a human character delivering the speech response, text data of
the speech response, and a content visualizing image of a content of the
speech response.
3. The speech dialogue system of claim 1, wherein the output means outputs
the speech response and the visual response by controlling at least one of
an output order, an output timing, and a visual response output position.
4. The speech dialogue system of claim 1, further comprising user state
detection means for detecting a state of the user, wherein the state of
the user detected by the user state detection means is taken into account
by the dialogue management means in determining the response output.
5. The speech dialogue system of claim 1, wherein the response generation
means generates the visual response which includes an image of a human
character delivering the speech response, where the image incorporates
movement and facial expression of the human character.
6. The speech dialogue system of claim 5, wherein the response generation
means generates the speech response incorporating a speech characteristic
corresponding to the movement and the facial expression of the human
character.
7. The speech dialogue system of claim 6, wherein the speech characteristic
of the speech response includes at least one of an emotional expression,
an intentional expression, and an intonation.
8. The speech dialogue system of claim 1, wherein the speech understanding
means supplies a plurality of candidates for the input speech, and the
dialogue management means determines the response output by evaluating
said plurality of candidates in accordance with a dialogue history/state.
9. The speech dialogue system of claim 1, wherein the output means also
outputs a visual indication for informing the user as to whether the
speech dialogue system is ready to receive the input speech.
10. The speech dialogue system of claim 1, wherein the response generation
means generates the visual response including a content visualizing image
formed by pictures of objects mentioned in the speech response and a
numerical figure indicating a quantity of each of the objects.
11. The speech dialogue system of claim 1, wherein the response generation
means generates the speech response for making a confirmation of the input
speech, while generating the visual response on the basis of a past
history of a dialogue between the user and the speech dialogue system.
12. The speech dialogue system of claim 1, wherein the response generation
means generates the visual response which includes text data of the speech
response and graphic images other than the text data, and the response
generation means generates the speech response and the text data for
making a confirmation of the input speech while generating the graphic
images on the basis of a past history of a dialogue between the user and
the speech dialogue system.
13. The speech dialogue system of claim 1, wherein the response generation
means generates the speech response for making a confirmation of the input
speech, the speech response being changed from a full speech response to a
simplified speech response according to a length of the speech response
for making the confirmation.
14. The speech dialogue system of claim 13, wherein the length of the
speech response for making the confirmation is determined from information
items to be confirmed by the confirmation.
15. The speech dialogue system of claim 14, wherein the full speech
response mentions all of the information items to be confirmed while the
simplified speech response does directly not mention the information items
to be confirmed.
16. The speech dialogue system of claim 15, wherein the simplified speech
response contains a pronoun to refer to the visual response.
17. The speech dialogue system of claim 13, wherein the full speech
response recites the response output explicitly while the simplified
speech response does not recite the response output explicitly.
18. The speech dialogue system of claim 17, wherein the simplified speech
response contains a pronoun to refer to the visual response.
19. The speech dialogue system of claim 13, wherein the output means
outputs the visual response at an earlier timing than a timing for
outputting the visual response when the speech response is the full speech
response.
20. The speech dialogue system of claim 13, wherein the output means
outputs the visual response before the speech response is outputted.
21. A method of speech dialogue between a human user and a speech dialogue
system, comprising the steps of:
understanding an input speech from a user;
determining a response output according to the input speech;
generating a speech response and a visual response according to the
response output; and
outputting the speech response and the visual response generated to the
user.
22. The method of claim 21, wherein the generating step generates the
visual response which includes at least one of an image of a human
character delivering the speech response, text data of the speech
response, and a content visualizing image of a content of the speech
response.
23. The method of claim 21, wherein the outputting step outputs the speech
response and the visual response by controlling at least one of an output
order, an output timing, and a visual response output position.
24. The method of claim 21, further comprising the step of detecting a
state of the user, and wherein the determining step determines the
response output by taking the state of the user detected at the detecting
step into account.
25. The method of claim 21, wherein the generating step generates the
visual response which includes an image of a human character delivering
the speech response, where the image incorporates movement and facial
expression of the human character.
26. The method of claim 25, wherein the generating step generates the
speech response incorporating a speech characteristic corresponding to the
movement and the facial expression of the human character.
27. The method of claim 26, wherein the speech characteristic of the speech
response includes at least one of an emotional expression, an intentional
expression, and an intonation.
28. The method of claim 21, wherein the understanding step obtains a
plurality of candidates for the input speech, and the determining step
determines the response output by evaluating said plurality of candidates
in accordance with a dialogue history/state.
29. The method of claim 21, wherein the outputting step also outputs a
visual indication for informing the user as to whether the speech dialogue
system is ready to receive the input speech.
30. The method of claim 21, wherein the generating step generates the
visual response which includes a content visualizing image formed by
pictures of objects mentioned in the speech response and a numerical
figure indicating a quantity of each of the objects.
31. The method of claim 21, wherein at the generating step, the speech
response for making a confirmation of the input speech is generated, while
the visual response reflecting a past history of a dialogue between the
user and the speech dialogue system is generated.
32. The method of claim 21, wherein at the generating step, the visual
response includes text data of the speech response and graphic images
other than the text data, and the speech response and the text data for
making a confirmation of the input speech is generated while the graphic
images reflecting a past history of a dialogue between the user and the
speech dialogue system are generated.
33. The method of claim 21, wherein at the generating step, the speech
response for making a confirmation of the input speech is generated, the
speech response being changed from a full speech response to a simplified
speech response, according to a length of the speech response for making
the confirmation.
34. The method of claim 33, wherein the length of the speech response for
making the confirmation is determined from information items to be
confirmed by the confirmation.
35. The method of claim 34, wherein the full speech responses mentions all
of the information items to be confirmed while the simplified speech
response does not directly mention the information items to be confirmed.
36. The speech dialogue system of claim 35, wherein the simplified speech
response contains a pronoun to refer to the visual response.
37. The method of claim 33, wherein the full speech response recites the
response output explicitly while the simplified speech response does not
recite the response output explicitly.
38. The method of claim 37, wherein the simplified speech response contains
a pronoun to refer to the visual response.
39. The method of claim 33, wherein at the outputting step, the visual
response is outputted at an earlier timing than a timing for outputting
the visual response when the speech response is the full speech response.
40. The method of claim 33, wherein at the outputting step, the visual
response is outputted before the speech response is outputted.
41. A speech response system, comprising:
speech analyzing means for analyzing an input speech from a user;
response output means for outputting a speech response and a visual
response according to the input speech analyzed by the speech analyzing
means; and
management means for managing a dialogue between the user and the speech
response system, by accepting a new input speech from the user after the
visual response is outputted by the response output means even when the
speech response is still being outputted by the response output means, and
controlling the speech analyzing means to analyze the new input speech and
the response output means to output a new speech response and a new visual
response according to the new speech input analyzed by the speech
analyzing means.
42. A speech dialogue system, comprising:
extracting means for extracting words of the input speech;
detecting means for detecting predetermined keywords among the extracted
words; and
response output means for outputting a system response according to the
detected keywords, the system response being generated in accordance with
a prescribed rule corresponding to each of the keywords. |
|
|
|
|
Claims  |
|
|
Description  |
|
|
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to a speech dialogue system for realizing an
interaction between a computer based system and a human speaker by
utilizing various input and output techniques such as speech recognition
and speech synthesis.
2. Description of the Background Art
In recent years, it has become possible to realize a so called
human-computer interaction in various forms by inputting, outputting and
processing multi-media such as characters, speech, graphics, and images.
In particular, in conjunction with a significant improvement in the
capacities of a computer and a memory device, various applications of a
work station and a personal computer which can handle the multi-media have
been developed. However, such a conventional workstation or personal
computer is only capable of handling various media separately and does not
realize any coordination of the various media employed.
Meanwhile, it has become popular to use linguistic data of characters
instead of the numerical data ordinarily used in a conventional computer.
As for the visual data, a capacity to handle the monochromatic image data
ordinarily used in a conventional computer is expanded to deal with color
images, animated images, three dimensional graphic images, and dynamic
images.
As for the audio data, in addition to a conventionally used technique for
handling speech signal levels, progress has been made to develop various
other techniques such as a speech recognition and a speech synthesis, but
these techniques are still too unstable to realize any practical
applications except in some very limited fields.
Thus, for various types of data to be used in a computer based system such
as character data, text data, speech data, and graphic data, there is a
trend to make progress from a conventional input and output (recording and
reproduction) functions to the understanding and generation functions. In
other words, there is progress toward the construction of a dialogue
system utilizing the understanding and generation functions for various
media such as speech and graphics for the purpose of realizing more
natural and pleasant human-computer interaction, by dealing with the
content, structure, and meaning expressed in the media rather than the
superficial manipulation of the media.
As for speech recognition, the development has been made from an isolated
word recognition toward continuous word recognition and continuous speech
recognition, primarily in specific task oriented environments accounting
for the practical implementations. In such a practical application, it is
more important for the speech dialogue system to understand the content of
the speech rather than to recognize the individual words, and there has
been a progress of a speech understanding system utilizing the specialized
knowledge of the application field on a basis a keyword spotting
technique.
On the other hand, as For the speech synthesis, development has been made
from a simple text-to-speech system toward a speech synthesis system
suitable for a speech dialogue system in which a greater weight is given
to the intonation.
However, the understanding and the generation of the media such as speech
are not so simple as the ordinary input and output of data, so that errors
or loss of information at a time of conversion among the media are
inevitable. Namely, the speech understanding is a type of processing which
extracts the content of the speech and the intention of the human speaker
from the speech pattern data expressed in enormous data size, so that it
is unavoidable to produce the speech recognition error or ambiguity in a
process of compressing the data.
Consequently, it is necessary for the speech dialogue system to actively
control the dialogue with the human speaker to make it progress as natural
and efficient as possible by issuing appropriate questions and
confirmations from the system side, so as to make up for the
incompleteness of the speech recognition due to the unavoidable
recognition error or ambiguity.
Now, in order to realize a natural and efficient dialogue with a human
speaker, it is important for the speech dialogue system to be capable of
conveying as much information on the state of the computer as possible to
the human speaker. However, in a conventional speech dialogue system, the
speech response is usually given by a mechanical voice reading of a
response obtained by a text composition without any modulation of speech
tone, so that it has often been difficult for the user to hear the
message, and the message has been sometimes quite redundant. In the other
types of a conventional speech dialogue system not using the speech
response, the response from the system has usually been given only as a
visual information in terms of text, graphics, images, icons, or numerical
data displayed on a display screen, so that the human-computer dialogue
has been heavily relying upon the visual sense of the user.
As described, in a conventional speech dialogue system, a sufficient
consideration has not been given to the use of the various media in the
response from the system for the purpose of making up the incompleteness
of the speech recognition and this has been the critical problem in the
practical implementation of the speech recognition technique.
In other words, the speech recognition technique is associated with an
instability due to the influence of the noises and unnecessary utterances
by the human speaker, so that it is often difficult to convey the real
intention of the human speaker in terms of speech, and consequently the
application of the speech recognition technique has been confined to the
severely limited field such as a telephone in which only the speech media
is involved.
Thus, the conventional speech dialogue system has been a simple combination
of the separately developed techniques related to the speech recognition,
speech synthesis, and image display, and the sufficient consideration from
a point of view of the naturalness and comfortableness of speech dialogue
has been lacking.
More precisely, the conventional speech dialogue system has been associated
with the essential problem regarding the lack of the naturalness due to
the instability of the speech recognition caused by the recognition error
or ambiguity, and the insufficient speech synthesis function to convey the
feeling and intent resulting from the insufficient intonation control and
the insufficient clarity of the speech utterance.
Moreover, the conventional speech dialogue system also lacked the
sufficient function to generate the appropriate response on a basis of the
result of the speech recognition.
Furthermore, there is an expectation for the improvement of the information
transmission function by utilizing the image display along with the speech
response, but the exact manner of using the two dimensional or three
dimensional image displays in relation to the instantaneously and
continuously varying speech response remains as the unsolved problem.
Also, it is important to determine what should be displayed in the speech
dialogue system utilizing various other media.
SUMMARY OF THE INVENTION
It is therefore an object of the present invention to provide a speech
dialogue system capable of realizing natural and smooth dialogue between
the system and a human user, and easy maneuverability of the system.
According to one aspect of the present invention there is provided a speech
dialogue system comprising: speech understanding means for understanding a
semantic content of an input speech from a user: dialogue management means
for making a semantic determination of a response output content according
to the semantic content of the input speech understood by the speech
understanding means: response generation means for generating a speech
response and a visual response according to the response output content
determined by the dialogue management means; and output means for
outputting the speech response and the visual response generated by the
response generation means to the user.
According to another aspect of the present invention there is provided a
method of speech dialogue between a human user and a speech dialogue
system, comprising the steps of: understanding a semantic content of an
input speech from a user; making a semantic determination of a response
output content according to the semantic content of the input speech
understood at the understanding step; generating a speech response and a
visual response according to the response output content determined at the
making step; and outputting the speech response and the visual response
generated at the generating step to the user.
According to another aspect of the present invention there is provided a
speech dialogue system comprising: speech understanding means for
understanding a semantic content of an input speech from a user; response
output means for outputting a system response according to the semantic
content of the input speech understood by the speech understanding means;
and dialogue management means for managing the dialogue between the user
and the system by controlling transitions between user states in which the
input speech is to be entered into the speech understanding is means and
system states in which the system response is to be outputted from the
response output means.
According to another aspect of the present invention there is provided a
speech dialogue system, comprising: speech understanding means for
understanding a semantic content of an input speech from a user by
detecting keywords in the input speech; dialogue management means for
limiting the keywords to be detected in the input speech by the speech
understanding means in advance, according to a state of a dialogue between
the user and the system; and response output means for outputting a system
response according to the semantic content of the input speech understood
by the speech understanding means.
Other features and advantages of the present invention will become apparent
from the following description taken in conjunction with the accompanying
drawings.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a schematic block diagram of a first embodiment of a speech
dialogue system according to the present invention.
FIG. 2 is a detailed block diagram of a speech understanding unit in the
speech dialogue system of FIG. 1,
FIG. 3 is an illustration of an example of a keyword lattice obtained from
a continuous input speech in the speech understanding unit of FIG. 2.
FIG. 4 is an illustration of an example of a semantic utterance
representation to be obtained by the speech understanding unit of FIG. 2.
FIG 5 is an illustration of an exemplary list of keywords to be used in the
speech understanding unit of FIG. 2.
FIG. 6 is an illustration of an example of a semantic response
representation to be obtained by a dialogue management unit in the speech
dialogue system of FIG. 1.
FIG. 7 is an illustration of an order table to be used in a dialogue
management unit in the speech dialogue system of FIG. 1.
FIG. 8 is an illustration of a past order table to be used in a dialogue
management unit in the speech dialogue system of FIG. 1.
FIG. 9 is a state transition diagram for an operation of a dialogue
management unit in the speech dialogue system of FIG. 1.
FIG. 10 is a flow chart for an operation in a user state in the state
transition diagram of FIG. 9.
FIG. 11 is a flow chart for an operation in a system state in the state
transition diagram of FIG. 9.
FIGS. 12A and 12B are illustrations of examples of a semantic response
representation and an order table for an exemplary case of the operation
in a dialogue management unit in the speech dialogue system of FIG. 1.
FIG. 12C is an illustration indicating an exemplary dialogue between the
system and the user in an exemplary case of the operation in a dialogue
management unit in the speech dialogue system of FIG. 1.
FIGS. 12D and 12E are illustrations of examples of two semantic utterance
representation candidates for an exemplary case of the operation in a
dialogue management unit in the speech dialogue system of FIG. 1.
FIG. 13 is a flow chart for an operation in a user state in an exemplary
case of the operation in a dialogue management unit in the speech dialogue
system of FIG. 1 using the examples shown in FIGS. 12A to 12E.
FIG. 14 is a flow chart for an operation in a system state in an exemplary
case of the operation in a dialogue management unit in the speech dialogue
system of FIG. 1.
FIGS. 15A, 15B and 15C are illustrations of examples of a semantic
utterance representation, a response act list, and a semantic response
representation for an exemplary case of the operation in a dialogue
management unit in the operation shown in the flow chart of FIG. 14.
FIG. 16B is an illustration of a table summarizing system responses for
various cases in the speech dialogue system of FIG. 1.
FIG. 17 is an illustration of an input speech signal for explaining a
determination of an input speech speed in the speech dialogue system of
FIG. 1.
FIG. 18 is an illustration of an example of a semantic response
representation supplied from the dialogue management unit to the response
generation unit in the speech dialogue system of FIG. 1.
FIG. 19 is a detailed block diagram of a response generation unit in the
speech dialogue system of FIG. 1.
FIG. 20 is an illustration of an example of a human character image
information to be used in the response generation unit of FIG. 19.
FIG. 21 is an illustration of examples of a response sentence structure to
be used in a response sentence generation unit in the response generation
unit of FIG. 19.
FIG. 22A is a flow chart for an operation of the response sentence
generation unit in the response generation unit of FIG. 19.
FIGS. 22B, 22C and 22D are illustrations of exemplary semantic response
representation, response sentence structure, and generated response
sentence to be used in the response sentence generation unit in the
operation shown in the flow chart of FIG. 22A.
FIG. 23 is an illustration of a table used in a human character feature
determination unit in the response generation unit of FIG. 19.
FIG. 24 is an illustration of a table used in a speech characteristic
determination unit in the response generation unit of FIG. 19.
FIG. 25 is a detailed block diagram of a speech response generation unit in
the response generation unit of FIG. 19.
FIG. 26 is a diagram for a fundamental frequency pattern model used in the
speech response generation unit of FIG. 25.
FIGS. 27A-27F are diagrams of a fundamental frequency pattern used in the
speech response generation unit of FIG. 25, without and with a
modification for generating a speech response with a joyful expression.
FIGS. 28A-28F are diagrams of a fundamental frequency pattern used in the
speech response generation unit of FIG. 25, without and with a
modification for generating a speech response with a regretful expression.
FIG. 29 is a detailed block diagram of a speech waveform generation unit in
the speech response generation unit of FIG. 25.
FIG. 30A is a timing chart for one example of a display timing control to
be made in a response output control unit in the response generation unit
of FIG. 19.
FIG. 30B is a timing chart for another example of a display timing control
to be made in a response output control unit in the response generation
unit of FIG. 19.
FIG. 31A is a timing chart for another example of a display timing control
to be made in a response output control unit in the response generation
unit of FIG. 19.
FIG. 31B is a timing chart for another example of a display timing control
to be made in a response output control unit in the response generation
unit of FIG. 19.
FIG. 32 to FIG. 38 are illustrations of various examples of display images
to be used in the speech dialogue system of FIG. 1 obtained by the
response generation unit of FIG. 19.
FIG. 39 is a diagram summarizing an overall operation in the speech
dialogue system of FIG. 1.
FIG. 40 is a schematic block diagram of a second embodiment of a speech
dialogue system according to the present invention.
FIG. 41 is a diagram for explaining an operation of a user state detection
unit in the speech dialogue system of FIG. 40.
FIG. 42 is a timing chart for one example of an operation in the speech
dialogue system of FIG. 40.
FIG. 43 is a timing chart for another example of an operation in the speech
dialogue system of FIG. 40.
FIG. 44 is a flow chart for an operation in the speech dialogue system of
FIG. 40.
FIG. 45 is a schematic block diagram of a third embodiment of a speech
dialogue system according to the present invention.
FIGS. 46A and 46B are block diagrams of two alternative configurations for
an A/D and D/A conversion units in the speech dialogue system of FIG. 45.
FIG. 47 is an illustration of one example of a display image to be used in
the speech dialogue system of FIG. 45.
FIG 48 is an illustration of another example of a display, image to be used
in the speech dialogue system of FIG. 45.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
I. FIRST EMBODIMENT
Referring now to FIG. 1, a first embodiment of a speech dialogue system
according to the present invention will be described in detail.
1. Overall System Configuration
In this first embodiment, the speech dialogue system comprises: a speech
understanding unit 11 for understanding a semantic content of an input
speech uttered by a user; a dialogue management unit 12 for making a
semantic determination of a response output content according to the
semantic content of the input speech understood by the speech
understanding unit; a response generation unit for generating a speech
response and a visual response according to the response output content
determined by the dialogue management unit; a display unit 14 for
outputting the visual response generated by the response generation unit
to the user; and a loudspeaker unit 15 for outputting the speech response
generated by the response generation unit to the user.
The speech understanding unit 11 is not a simple speech recognition device
for only recognizing words or sentences in the input speech, but capable
of extracting a semantic content intended to be expressed in the input
speech by analyzing the input speech, in a form of a semantic utterance
representation as will be described in detail below.
The dialogue management unit 12 makes the semantic determination of the
response output content by using a dialogue history, a current dialogue
state, a dialogue management procedure, and a knowledge on specialized
application field, and supplies the response output content information
indicating the appropriate response output to be generated to the response
generation unit 13.
In addition, the dialogue management unit 12 achieves the improvement of
the speech understanding and the reduction of the processing amount by
properly treating the spoken input speech containing elipsis and
demonstrative pronouns, so as to enable the natural dialogue between the
system and the user.
Moreover, the dialogue management unit 12 supplies the generated response
output content information back to the speech understanding unit 11 in
order to improve the efficiency of the speech understanding at the speech
understanding unit 11 for the subsequent input speech by preliminarily
limiting the candidates of the keywords, as well as syntactic and semantic
rules to be utilized in the speech understanding, according to the
response output content information generated in response to the current
input speech, before the subsequent input speech is entered into the
speech understanding unit. This preliminary limiting of the keywords and
the syntactic and semantic rules is effective in reducing an amount of
calculations required in a keyword spotting operation to be used in speech
understanding.
Furthermore, the dialogue management unit 12 also supplies the response
generation unit 13 with a human character image information indicating a
human character image of a human character to deliver the speech response
which is to be displayed on the display unit 14 while the speech response
is outputted from the loudspeaker unit 15, and a content visualizing image
information indicating a content visualizing image for visualizing the
content of the speech response for the purpose of assisting the user's
comprehension of the response from the system, which is also to be
displayed on the display unit 14 while the speech response is outputted
from the loudspeaker unit 15.
The response generation unit 13 generates the speech response in a
synthesized voice to be outputted from the loudspeaker unit 15 according
to the response output content information supplied from the dialogue
management unit 12, and the visual response including the text data of the
speech response, and the human character image and the content visualizing
image to be displayed on the display unit 14 according to the human
character image information and the content visualizing image information
supplied from the dialogue management unit 12. Here, the human character
image to be displayed on the display unit 14 incorporates the movement and
the facial expression of the human character which are determined
according to the response output content information and the human
character image information supplied from the dialogue management unit 12.
In other words, the response generation unit 13 generates the multimodal
system response incorporating both the audio information and the visual
information for supporting the smooth comprehension of the system response
by the user, so as to establish the natural dialogue between the user and
the system.
In addition, while the generated speech and visual responses are outputted
from the response generation unit 13, the response generation unit 13
notifies the dialogue management system 12 that the output of the
responses is in progress. In response, the dialogue management unit 12
controls the timings of the speech understanding operation such as the
start and end point detection and the keyword spotting for the subsequent
input speech which is to be carried out by the speech understanding unit
11, according to this notification from the response generation unit 13,
in order to improve the efficiency of the speech understanding at the
speech understanding unit 11.
2. Individual System Elements
Now, the further details of the each element in this first embodiment of
the speech dialogue system shown in FIG. 1 will be described. In the
following description, a case of employing this speech dialogue system to
a task of order taking in a fast food store will be used for the sake of
definiteness of the description.
2.1. Speech Understanding Unit 11
The speech understanding unit 11 is required to achieve the understanding
of the input speech uttered by the user by extracting a semantic content
intended to be expressed in the input speech.
In general, the use of speech recognition of the speech uttered by the
unspecified user has been contemplated for the specialized applications
such as a ticket sale service system, a seat reservation service system,
and a bank transaction service system, but such a speech recognition for
the unspecified user has been encountering a considerable difficulty in
achieving the accurate recognition of the actually spoken sentences
because of the different manners of the speech utterance used by different
users, the unnecessary words uttered by the user in conjunction with the
actual message, the personal variations in the spoken language, and the
influence of the background noises.
As a solution to such a difficulty encountered by the speech recognition
for the unspecified user, there has been a proposition for the method of
continuous speech understanding based on keyword lattice parsing in which
the understanding of the semantic content of the continuously uttered
speech is achieved by analyzing the keywords detected in the speech, as
disclosed in H. Tsuboi and Y. Takebayashi: "A Real-Time Task-Oriented
Speech Understanding System using Keyword Spotting", Proceedings of 1992
International Conference on Acoustics, Speech, and Signal Processing
(ICASSP 92), I-197 to I-200, San Francisco, U.S.A. (March 1992). Under the
properly controlled circumstances, this method is capable of achieving the
high speed understanding of the almost freely uttered speech by using very
little restrictions regarding the manner of speech utterance imposed on
the user. Thus, in this first embodiment, this method of continuous speech
understanding based on the keyword lattice parsing is employed in the
speech understanding unit 11 of FIG. 1. A detailed implementation of the
speech understanding unit 11 for realizing this method will now be
described.
As shown in FIG. 2, the speech understanding unit 11 of this first
embodiment generally comprises a keyword detection unit 21 and a syntactic
and semantic analysis unit 22, where the keyword detection unit 21 further
comprises the speech analyzer 21a and a keyword spotter 21b, while the
syntactic and semantic analysis unit 22 further comprises a sentence start
point detector 22a, a sentence candidate analyzer 22b, a sentence end
point detector 22c, and a sentence candidate table 22d which is accessible
from all of the sentence start point detector 22a, the sentence candidate
analyzer 22b, and the sentence end point detector 22c.
The keyword detection unit 21 carries out the keyword spotting operation as
follows. First, at the speech analyzer 21a, the input speech is passed
through a low pass filter (not shown) and A/D converted by using the
sampling frequency of 12 KHz and the 12 bits quantization. Then, at the
speech analyzer 21a, the spectral analysis and the smoothing in the
frequency region after the fast Fourier transformation are carried out to
the obtained digital signals, and then the speech analysis result is
obtained for each 8 ms by using the 16 channel band pass filter (not
shown) after the logarithmic transformation. Then, at the keyword spotter
21b, the known keyword spotting procedure is applied to the speech
analysis result obtained by the speech analyzer 21a. Here, the known
keyword spotting procedure such as that disclosed in Y. Takebayashi, H.
Tsuboi, and H. Kanazawa: "A Robust Speech Recognition System using
Word-Spotting with Noise Immunity Learning", Proceedings of 1991
International Conference on Acoustics, Speech, and Signal Processing
(ICASSP 91), pp. 905-908, Toronto, Canada (May, 1991), can be used, for
example.
As a result of this keyword spotting procedure at the keyword spotter 21b,
the keyword detection unit 21 obtains the keyword lattice enumerating all
the keyword candidates from the continuous input speech. FIG. 3 shows an
example of the keyword lattice obtained by the keyword detection unit 21
from the continuous input speech in Japanese equivalent to the English
sentence of "Three hamburgers, coffees, and potatoes, please" uttered in
Japanese, where the shaded words are the keywords detected in this
continuous input speech. Here, it is to be noted that, in this FIG. 3,
there is a correspondence between the continuous input speech uttered in
Japanese as shown in FIG. 8 and the keywords of the keyword lattice in
Japanese equivalents of those shown in FIG. 8, and consequently there is
no correspondence between the continuous input speech as shown in FIG. 3
and the keywords of the keyword lattice as shown in FIG. 3. In other
words, the keyword lattice shown in FIG. 3 is obtained for the Japanese
keywords in the continuous input speech uttered in Japanese, and the
English words appearing in the keyword lattice of FIG. 3 are
straightforward translations of the Japanese keywords. Consequently, for
the continuous input speech of "Three hamburgers, coffees, and potatoes,
please" uttered in English, the keyword lattice expressed in En | | |