|
|
|
| United States Patent | 5946658 |
| Link to this page | http://www.wikipatents.com/5946658.html |
| Inventor(s) | Miyazawa; Yasunaga (Suwa, JP);
Inazumi; Mitsuhiro (Suwa, JP);
Hasegawa; Hiroshi (Suwa, JP);
Edatsune; Isao (Suwa, JP);
Urano; Osamu (Suwa, JP) |
| Abstract | A technique for improving speech recognition in low-cost, speech
interactive devices. This technique calls for selectively implementing a
speaker-specific word enrollment and detection unit in parallel with a
word detection unit to permit comprehension of spoken commands or messages
when no recognizable words are found. Preferably, specific speaker
detection will be based on the speaker's own personal list of words or
expression. Other facets include complementing non-specific pre-registered
word characteristic information with individual, speaker-specific verbal
characteristics to improve recognition in cases where the speaker has
unusual speech mannerisms or accent and response alteration in which
speaker-specification registration functions are leveraged to provide
access and permit changes to a predefined responses table according to
user needs and tastes. Also disclosed is the externalization and
modularization of non-specific speaker recognition, action and response
information to enhance adaptability of the speech recognizer without
sacrificing product cost competitiveness or overall device responsiveness. |
|
|
|
Title Information  |
|
|
|
|
|
Drawing from US Patent 5946658 |
|
|
Cartridge-based, interactive speech recognition method with a response
creation capability |
|
|
|
|
|
| Publication Date |
August 31, 1999 |
|
|
|
|
|
| Filing Date |
October 2, 1998 |
|
|
|
|
|
|
|
|
|
|
|
| Parent Case |
CROSS REFERENCE TO RELATED APPLICATIONS
This is a Continuation of prior application Ser. No. 08/700,175 filed on
Aug. 20, 1996, now U.S. Pat. No. 5,842,168, which is a
continuation-in-part of Ser. No. 08/536,563 filed on Sep. 29, 1995 which
is now U.S. Pat. No. 5,794,204.
This application is related to copending application Ser. No. 08/700,181,
filed on Aug. 20, 1996, entitled "Voice Activated Interactive Speech
Recognition Device And Method", and copending application Ser. No.
08/699,874, filed on Aug. 20, 1996, entitled "Speech Recognition Device
And Processing Method", all commonly assigned with the present invention
to the Seiko Epson Corporation of Tokyo, Japan. This application is also
related to the following applications: application Ser. No. 08/078,027,
filed Jun. 18, 1993, entitled "Speech Recognition System", now abandoned;
application Ser. No. 08/641,268, filed Sep. 29, 1995, entitled Speech
Recognition System Using Neural Networks, which is a continuation of
application Ser. No. 08/078,027 and which is now U.S. Pat. No. 5,751,904,
issued May 12, 1998; application Ser. No. 08/102,859, filed Aug. 6, 1993,
entitled "Speech Recognition Apparatus", now U.S. Pat. No. 5,481,644,
issued Jan. 2, 1996; application Ser. No. 08/485,134, filed Jun. 7, 1995,
entitled "Speech Recognition Apparatus Using Neural Network and Learning
Method Therefor", now U.S. Pat. No. 5,787,393, issued Jul. 28, 1998; and
application Ser. No. 08/536,550, filed Sep. 29, 1996, entitled
"Interactive Voice Recognition Method And Apparatus Using
Affirmative/Negative Content Discrimination"; all commonly assigned with
the present invention to the Seiko Epson Corporation of Tokyo, Japan. |
|
| Priority Data |
Aug 21, 1995[JP]7-212249 |
|
|
|
|
|
|
|
|
|
|
|
Title Information  |
|
|
Description  |
|
|
BACKGROUND OF THE INVENTION
1. Field of the Invention
This invention relates generally to speech recognition technology and is
particularly concerned with portable, intelligent, interactive devices
responsive to non-speaker specific commands or instructions.
2. Description of the Related Art
An example of conventional portable interactive speech recognition
equipment is a speech recognition toy. For example, the speech recognition
toy that was disclosed by the Japanese Laid Open Publication S62-253093
contains a plurality of pre-registered commands that are objects of
recognition. The equipment compares the voice signals emitted by the
children or others who are playing with the toy to voice signals
pre-registered by a specific speaker. If perceived voice happens to match
one or more of the pre-registered signals, the equipment generates a
pre-determined electrical signal corresponding to the matched voice
command, and causes the toy to perform specific operations based on the
electrical signal.
However, because these toys rely on a particular individual's speaking
characteristics (such as intonation, inflection, and accent) captured at a
particular point in time and recognize only a prestored vocabulary, they
quite frequently fail to recognize words and expressions spoken by another
person, and apt not even to tolerate even slight variations in
pronunciation by the registered speaker. These limitations typically lead
to misrecognition or nonrecognition errors which may frustrate or confuse
users of the toy, especially children, which, in turn, leads to disuse
once the initial novelty has worn off. Further, speaker and word
pre-registration is extremely time-consuming and cumbersome, since every
desired expression must be individually registered one-by-one basis prior
to use by a new speaker.
One potential solution may be to incorporate into such devices non-specific
speech recognition equipment which uses exemplars from a large population
of potential speakers (e.g. 200+ individuals). This technology does a much
better job in correctly recognizing a wide range of speakers, but it too
is limited to a predefined vocabulary. However, unlike speaker-specific
recognition equipment, the predefined vocabulary cannot be altered by the
user to suit individual needs or tastes. Further, proper implementation of
these non-speaker specific techniques for suitably large vocabularies
require copious amounts of memory and processing power currently beyond
the means of most commercially available personal computers and digital
assistants, as typically each pre-registered word, along with every
speaker variation thereof, must be consulted in order to determine a
match. Accordingly, conventional non-speaker specific recognition simply
does not provide a practical recognition solution for the ultra-cost
sensitive electronic toy, gaming or appliance markets.
Moreover, although specific speech recognition devices can nevertheless
achieve relatively high recognition rates for a range of typical users,
they cannot always achieve high recognition rate for all types of users.
For example, voice characteristics such as interaction and pitch very
widely depending on the age and sex of the speaker. The speech recognition
device attuned to adult style speech may achieve extremely high
recognition rates for adults but may fail miserably with toddlers' voices.
Further, conventional non-specific speaker speech recognition could be
used by a wide range of people for a wide ranging purposes. Consider the
case of a speech recognition device used in an interactive toy context. In
this scenario, the degree and type of interaction must be rich and
developed enough to handle a wide age range from the toddler speaking his
or her first words to mature adolescents, and all the conversation content
variations and canned response variation must accommodate this broad range
of users in order to enhance the longevity and commercial appeal of such a
recognition toy. However as already discussed, a limited memory in
processing resources can be devoted to speech recognition in order to make
such a speech recognition device cost effective and reasonable responsive.
So, heretofore a trade off between hardware costs and responsiveness
versus interactably has been observed in nonspecific speaker voice
recognizers.
It is, therefore, an object of the present invention to implement an
interactive speech recognition method and apparatus that can perform
natural-sounding conversations without increasing the number of
pre-registered words or canned responses characterized by conventional
canned matching type speech recognition. Moreover, it is a further object
of the present invention to incorporate recognition accuracy and features
approaching non-specific speaker speech recognition in a device relatively
simple in configuration, low in price, easily manufactured, and easily
adaptable to suit changing needs and uses. It is yet a further object of
the present invention to provide a highly capable, low-cost interactive
speech recognition method and apparatus which can be applied to a wide
range of devices such as toys, game machines and ordinary electronic
devices.
It is still a further object of the present invention to prove nonspecific
speaker recognition rates for a wider range of voices then heretofore
could be accommodated using conventional memory constructs. It is even a
further object of the present invention that a wider range of conversation
responses and detected phrases be accommodated on an as needed basis.
SUMMARY OF THE INVENTION
In accordance with these and related objects, the speech recognition
technique of the present invention include: 1) voice analysis, which
generates characteristic voice data by analyzing perceived voice; 2)
non-specific speaker word identification, which reads the characteristic
voice data and outputs detected data corresponding to pre-registered words
contained within a word registry; 3) potentially, in addition to
nonspecific speaker word identification, specific-speaker word enrollment
that registers standard voice characteristic data for a select number of
words spoken by an individual speaker and outputs detected data when these
expressions are subsequently detected; 4) speech recognition and dialogue
management, which, based off either/both non-specific or specific speaker
word identification, reads the detected voice data, comprehends its
meaning and determines a corresponding response; 5) voice synthesis, which
generates a voice synthesis output based on the determined response; and
6) voice output, which externally outputs the synthesized response.
According to the preferred embodiments, optional specific speaker word
registration outputs word identification data by DP-matching based on the
input voice from a specific speaker. It can comprise the following: an
initial word enrollment that creates standard patterns by reading
characteristic data relative to a specific speaker's prescribed voice
input from the voice analysis process; a standard pattern memory process
that stores the standard patterns created by the word enrollment process;
and a word detection process that outputs word detection data by reading
characteristic data relative to the specific-speaker's prescribed voice
input and by comparing the characteristic data with said standard
patterns. Further, specific speaker word enrollment comprises at least the
following: additional word enrollment that creates standard voice patterns
that are speaker-adapted based on the standard characteristic voice data
for non-specific speakers as spoken by the selected speaker along with
speaker-adapted standard pattern memory for storing both the standard
patterns that are speaker-adapted and those installed by speaker specific
word enrollment. Moreover, specific speaker word enrollment may read
characteristic data relative to the specific speaker's prescribed voice
input through voice analysis and outputs word detection data by comparing
the input characteristic data with the speaker-adapted standard patterns.
Further, the preferred embodiments may include a response creation
function. When a particular speaker wishes to add to or modify the
existing response list, the preferred embodiment can create response data
based on voice signals that have been input by a particular speaker and
register them according to instructions given by speech recognition and
dialogue management. This permits the creation of new and useful response
messages using the voices of a wide variety of people and allows a wide
variety of exchanges between the embodiment and users.
Moreover, according to the preferred embodiments of the present invention:
1) word registry storage, including standard pattern memories containing
the characteristic voice vectors for each registered word (either speaker
specific, non-speaker specific or a combination thereof; and/or 2)
conversation content storage for retaining canned context rules and
response procedures when recognized words or phrases are encountered;
and/or 3) response data storage for retaining response voice vector data
used in formulating an appropriate response to perceived and recognized
words and phrases and corresponding context and action rules, may
collectively or singularly reside within memory provided on a removable
cartridge external to and in communication with the speech recognition
processor. Of course, necessary protocol glue and buffering logic, along
with conventional bus architecture control drivers and protocols will be
included as necessary to permit proper (at least read-only) communications
between these cartridge memories and the various components of the speech
recognition processor, including, but not limited to, the word or phrase
identifier (preferably non-speaker specific), the speech recognition and
dialogue management unit, and the voice synthesis unit.
By offloading these memories and information onto a modular removable
cartridge and away from a central speech recognition processor, it becomes
possible to tailor conversations to users of various ages, backgrounds or
gender, as well as increase the available groups of pre-registered words
and/or responses, all without dramatically increasing memory size and
costly memory parts counts. Only a small additional expense will be
required to accommodate cartridge information transfer operations to the
speech processor, as well as engagement hardware to complete the
electrical interconnection between the cartridge memories and the main
speech recognition processing unit. Moreover, since it is anticipated that
the overall memory size of each cartridge approximates the memory size of
a conventional internalized memory speech recognition system, processing
matching speed and overall responsiveness should not be seriously impacted
by inclusion of the external cartridge paradigm. Again, here, the speech
recognition processing unit in this embodiment may be required to
implement additional communication overhead in order to communicate with
the coupled memory cartridge, but incorporating such additional processing
burdens is more than out weighed by the benefits of modularity and
adaptability secured by including recognition, context and response
information on removable storage such as the memory cartridge.
Thus, one aspect of the present invention couples simple non-specific
speaker speech recognition with specific speaker expression enrollment and
detection. Further, non-specific pre-registered words can be
speaker-adapted to permit more accurate and quicker recognition. In
certain situations, some words are recognized and other words are not
depending on the manner in which a particular speaker utters them. With
some speakers, no non-specific pre-registered words can be recognized at
all. In such cases, words that fail to be recognized can be enrolled using
a specific-speaker voice enrollment function. This virtually eliminates
words that cannot be recognized and thus substantially improves the
overall recognition capability of the equipment. This function also allows
specific speakers to enroll new words suited to the user's individual
needs and tastes which are not included in the non-specific word registry.
Further, the preferred embodiments may include a response creation function
which permits alteration or additions to a predefined response list,
thereby improving its depth and range of usefulness.
Moreover, the non-speaker specific or speaker-specific word registries,
recognition contextual rules, conversation response action rules, and
audible response information may all be stored singularly or in
combination or external cartridge memory to accommodate wider ranges of
speakers and applications having disparate conversation sets without
significantly impacting device cost or composite recognition performance.
This is true, even though the rest of the speech recognition processing
equipment may be unitized to reduce cost and case manufacturability. If,
in the case of a toy application, a cartridge is used to store
recognition, conversation control and response information, the toy can
adapt and grow with the child, even when "canned" non-speaker specific
phrase identification techniques are utilized. Also, the recognition
registry, conversation and response information can be changed or updated
as the general culture changes, thereby greatly increasing the longevity
and usefulness of the cartridge-equipped speech recognition apparatus. Of
course, the cartridge information can also be used to broaden potential
speakers and maintain acceptable recognition rates by tailoring the
"canned" non-speaker specific registration list to particular dialects,
regional lingual idiosyncrasies or even different languages. In such
cases, a given speaker may simply select and connect the most appropriate
cartridge for his or her own inflections, accent or language.
Other objects and attainments together with a fuller understanding of the
invention will become apparent and appreciated by referring to the
following description of the presently preferred embodiments and claims
taken in conjunction with the accompanying drawings.
BRIEF DESCRIPTION OF THE DRAWINGS
In the drawings, wherein like reference symbols refer to like parts:
FIG. 1 is an overall block diagram of the first preferred embodiment;
FIGS. 2A-2E diagrammatically illustrate a sample input voice waveform and
resultant word lattice generated by the non-specific speaker word
identification unit of the embodiment shown in FIG. 1;
FIG. 3 illustrates an example setup switch according to the first and
second preferred embodiments;
FIGS. 4A-4E diagrammatically illustrate another sample input voice waveform
and resultant word lattice generated by the non-specific speaker word
identification unit of the embodiment shown in FIG. 1;
FIG. 5 shows a example response table stored in the response data memory
unit of the embodiment shown in FIG. 1;
FIG. 6 is an overall block diagram of a second preferred embodiment;
FIGS. 7A-7E diagrammatically illustrate a sample input voice waveform and
resultant word lattice generated by both the specific and non-specific
speaker word identification and enrollment units of the embodiment shown
in FIG. 6;
FIG. 8 is an overall block diagram of a third preferred embodiment;
FIG. 9 illustrates an example setup switch according to the embodiment
shown in FIG. 8;
FIG. 10 shows a example response table stored in the response data memory
unit of the embodiment shown in FIG. 8;
FIG. 11 is an overall block diagram of a fourth embodiment of the present
invention explaining modularized recognition, conversation control and
response information according to the present invention;
FIG. 12 is a more detailed block diagram of the embodiment of FIG. 11;
FIG. 13 is an alternative detailed block diagram of the embodiment shown in
FIG. 11 wherein only phrase registry information is contained on the
cartridge;
FIG. 14 is another detailed block diagram showing yet another alternative
configuration of the embodiment of FIG. 11 wherein only context and
conversation response, along with response data is externalized to the
cartridge; and
FIG. 15 is yet another detailed block diagram depicting still another
alternative configuration of the embodiment of FIG. 11 wherein only
response data is maintained external to the speech recognition response
processor.
DESCRIPTION OF THE PREFERRED EMBODIMENTS
As depicted in the figures, the presently preferred embodiments exemplify
speech recognition techniques of the present invention as applied to an
inexpensive voice-based toy, gaming device, or similar interactive
appliance. Though one having ordinary skill in the speech recognition art
will recognize that the teachings of the present invention are not so
limited, the presently preferred embodiments can be conveniently
implemented as a stand-alone speech recognition device residing within a
stuffed doll such as dog, cat or bear suitable for young children.
FIG. 1 shows a configuration diagram that depicts the first preferred
embodiment of the present invention. In FIG. 1, the following components
are designed to recognize words spoken by non-specific speakers and to
generate response messages according to the results of the recognition:
voice input unit 1, which inputs the speaker's voice; voice analysis unit
2, which outputs characteristic voice data by analyzing the input voice;
non-specific speaker word identification unit 3, which reads the
characteristic voice data from voice analysis unit 2 and outputs the
detected data corresponding to the registered words contained in the input
voice, based on a non-specific speaker's standard characteristic voice
data relative to pre-registered recognizable words; speech recognition and
dialogue management unit 4; response data memory unit 5, which stores
pre-set response data; voice synthesis unit 6; and voice output unit 7.
Also shown in FIG. 1, a specific-speaker word registration means 8 is
provided that registers the standard characteristic voice data on the
words uttered by a specific speaker based on the specific speaker's input
voice and that outputs word detection data on the specific speaker's input
voice. Further, setup switch 9 is provided to serve as a data input setup
means for performing various data input setup actions by an individual
user.
The non-specific speaker word identificationunit 3 preferably comprises the
following: standard pattern memory unit 31, which stores standard voice
vector patterns or standard characteristic voice data that correspond to
each pre-registered word contained in the word registry; and word
detection unit 32, which generates word detection data preferably in the
form of a word lattice by reading characteristic voice data from voice
analysis unit 2 and by comparing them against the standard non-specific
speakers patterns contained in the standard pattern memory unit 31.
The standard pattern memory unit 31 stores (registers) standard patterns of
target-of-recognition words that are created beforehand using the voices
of a large number of speakers (e.g., 200 people) for each of the words.
Since these embodiments are directed to a low-cost toy or novelty,
approximately 10 words are chosen as target-of-recognition words. Although
the words used in the embodiment are mostly greeting words such as the
Japanese words "Ohayou" meaning "good morning", "oyasumi" meaning "good
night", and "konnichiwa" meaning "good afternoon" , the present invention
is, of course, by no means limited to these words or to merely the
Japanese language. In fact, various words in English, French or other
language can be registered, and the number of registered words is not
limited to 10. Though not shown in FIG. 1, word detection unit 32 is
principally composed of a processor (the CPU) and ROM that stores the
processing program. Its function is to determine on what confidence level
the words registered in standard pattern memory unit 31 occur in the input
voice, and will be described in more detail hereinbelow.
On the other hand, specific-speaker word enrollment unit 8 preferably
comprises the following: word enrollment unit 81; standard pattern memory
unit 82, which stores input voice standard patterns as the standard
characteristic voice data on the input voice; and word detection unit 83.
In this embodiment, the specific-speaker word enrollment unit registers
the words uttered by specific speakers by entering their voice signals and
outputting the detected data in the form of a word lattice for
specific-speaker registered words relative to the input voice. In this
example, it is assumed that the input voice is compared with registered
standard voice patterns by DP-matching, and word identification data is
output from word detection unit 83 based on the results of the comparison.
The registration of words by specific-speaker word enrollment unit 8 can
be performed by setting the word registration mode using setup switch 9,
as will be discussed in greater detail hereinbelow.
Still referring to FIG. 1, voice input unit 1 is composed of the following
conventional sub-components which are not shown in the figure: a
microphone, an amplifier, a low-pass filter, an A/D converter, and so
forth. The voice which is input from the microphone is converted into an
appropriate audio waveform after the voice is allowed to pass through the
amplifier and the low-pass filter. The audio waveform is then converted
into digital signals (e.g., 12 KHz sampling rate at 16 bit resolution) by
the A/D converter and is output to voice analysis unit 2. Voice analysis
unit 2 takes the audio waveform signals transmitted from voice input unit
1 and uses a processor (the CPU) to perform a frequency analysis at short
time intervals, extracts characteristic vectors (commonly LPC-Cepstrum
coefficients) of several dimensions that express the characteristic of the
frequency, and outputs the time series of the characteristic vectors
(hereinafter referred to as "characteristic voice vector series"). It
should be noted that said non-specific speaker word data output means 3
can be implemented using the hidden Markov model (HMM) method or the
DP-matching method. However, in this example keyword-spotting processing
technology using the dynamic recurrent neural network (DRNN) method is
used as disclosed by Applicants in U.S. application Ser. No. 08/078,027,
filed Jun. 18, 1993, entitled "Speech Recognition System", commonly
assigned with the present invention to Seiko-Epson Corporation of Tokyo,
Japan, which is incorporated fully herein by reference. Also, this method
is disclosed in the counterpart laid open Japanese applications H6-4097
and H6-119476. DRNN is preferably used in order to perform speech
recognition of virtually continuous speech by non-specific speakers and to
output word identification data as described herein.
The following is a brief explanation of the specific processing performed
by non-specific speaker word data identification unit 3 with reference to
FIGS. 2A-2E. Word detection unit 32 determines the confidence level at
which a word registered in standard pattern memory unit 31 occurs at a
specific location in the input voice. Now, suppose that the speaker inputs
an example Japanese language phrase "asu No tenki wa . . . " meaning
"Concerning tomorrow's weather". Assume that in this case the stylized
voice signal shown in FIG. 2A represents the audio waveform for this
expression.
In the expression "asu no tenki wa . . . ", the contextual keywords include
"asu" (tomorrow) and "tenki" (weather). These are stored in the form of
patterns in standard pattern memory unit 31 as parts of the a
predetermined word registry, which in this case, represents approximately
10. If 10 words are registered, signals are output in order to detect
words corresponding to these 10 words (designated word 1, word 2, word 3 .
. . up to word 10). From the information such as detected signal values,
the word identification unit determines the confidence level at which the
corresponding words occur in the input voice.
More specifically, if the word "tenki" (weather) occurs in the input voice
as word 1, the detection signal that is waiting for the signal "tenki"
(weather) rises at the portion "tenki" in the input voice, as shown in
FIG. 2B. Similarly, if the word "asu" (tomorrow) occurs in the input voice
as word 2, the detection signal that is waiting for the signal "asu" rises
at the portion "asu" in the input voice, as shown in FIG. 2C. In FIGS. 2B
and 2C, the numerical values 0.9 and 0.8 indicate respective confidence
levels that the spoken voice contains the particular pre-registered
keyword. The relative level or magnitude of this level can fluctuate
between .about.0 and 1.0, with 0 indicating a nearly zero confidence match
factor and 1.0 representing a 100% confidence match factor. In the case of
a high confidence level, such as 0.9 or 0.8, the registered word having a
high confidence level can be considered to be a recognition candidate
relative to the input voice. Thus, the registered word "asu" occurs with a
confidence level of 0.8 at position w1 on the time axis. Similarly, the
registered word "tenki" occurs with a confidence level of 0.9 at position
w2 on the time axis.
Also, the example of FIGS. 2A-2E show that, when the word "tenki" (weather)
is input, the signal that is waiting for word 3 (word 3 is assumed to be
the registered word "nanji" ("What time . . . ") also rises at position w2
on the time axis with an uncertain confidence level of approximately 0.6.
Thus, if two or more registered words exist as recognition candidates at
the same time relative to an input voice signal, the recognition candidate
word is determined by one of two methods: either by 1)selecting the
potential recognition candidate with the highest degree of similarity to
the input voice using confidence level comparisons as the actually
recognized keyword; or a method of selecting one of the words as the
recognized word by creating beforehand a correlation table expressing
correlation rules between words. In this case, the confidence level for
"tenki" (weather) indicates that it has the highest degree of similarity
to the input voice during time portion w2 on the time axis even though
"nanji" can be recognized as a potential recognition candidate. Based on
these confidence levels, the speech recognition and dialogue management
unit 4 performs the recognition of input voices.
Collectively, the detection information, including starting and ending
points on the time axis and the maximum magnitude of the detection signal
indicating the confidence level, for each pre-registered word contained in
non-specific speaker word registry within standard pattern memory unit 31
is known as a word lattice.
In FIGS. 2B-2E, only a partial lattice is shown for the sake of clarity,
but a word lattice including detection information for every
pre-registered non-specific word is in fact generated by the word
detection unit 32.
Though not shown in FIG. 1, speech recognition and dialogue management unit
4 is principally composed of a processor and ROM that stores the
processing program and performs the processing tasks described below.
Different CPUs may be provided in the individual units or, alternatively,
one CPU can perform the processing tasks for the different units.
Speech recognition and dialogue management unit 4 selects a recognition
word output from either non-specific word detection unit 32 or specific
speaker word detection unit 83. Based on the composite word lattice, the
speech recognition and dialogue management unit recognizes a voice
(comprehending the overall meaning of the input voice), references
response data memory unit 5, determines a response according to the
comprehended meaning of the input voice, and transmits appropriate
response information and control overhead to both voice synthesis unit 8
and voice output unit 9.
For example, when the detected data or partial word lattice shown in FIGS.
2B-2E is relayed from word detection unit 32, the speech recognition and
dialogue management unit determines one or more potential recognition
candidates denoted in the word lattice as a keyword occurring in the
input. In this particular example, since the input voice is "asu no tenki
wa" (the weather tomorrow), the words "asu"(tomorrow) and "tenki"
(weather) are detected. From the keywords "asu" and "tenki", the speech
recognition and dialogue management unit understands the contents of the
continuous input voice "asu no tenki wa".
The speech recognition processing of virtually continuous voice by keyword
spotting processing, as described above, is applicable to other languages
as well as to Japanese. If the language to be used is English, for
instance, some of the recognizable words that can be registered might be
"good morning", "time", "tomorrow", and "good night". The characteristic
data on these recognizable registered words is stored in standard memory
unit 31. If the speaker asks "What time is it now?", the word "time" in
the clause "what time is it now" is used as a keyword in this case. When
the word "time" occurs in the input voice, the detection signal that is
waiting for the word "time" rises at the portion "time" in the input
voice. When detected data (word lattice) from word detection unit 32 is
input, one or more words in the input voice is determined as a keyword.
Since in this example the input voice is "what time is it now", "time" is
detected as a keyword, and the speech recognition conversation control
unit understands the contents of the continuous input voice "what time is
it now?"
The above description concerns the case where word data is output from
non-specific speaker word data output means 3, i.e., the words spoken by
the speaker are recognized. With some speakers, however, words like the
Japanese expression "Ohayou" (good morning) totally fail to be recognized.
Although in some cases changing the way words are spoken can solve the
problem, some speakers with voice idiosyncrasies entirely fail to be
recognized. In such cases, the words that fail to be recognized can be
registered as specific-speaker words. This feature is described below.
Referring still to FIG. 1, setup switch 9 is used to register
specific-speaker words. As shown in FIG. 3, setup switch 9 preferably
comprises number key unit 91, start-of-registration button 92,
end-of-registration button 93, response message selection button 94,
end-of-response message registration button 95, and response number input
button 96. Buttons such as response message selection button 94,
end-of-response message registration button 95, and response number input
button 96 will be described in more detail hereinbelow.
By means of example, this section explains the case where the word "Ohayou"
(good morning) is registered as a specific-speaker word because it is not
recognized. First, start-of-registration button 92 on setup switch 9 is
pushed. This button operation forces speech recognition and dialogue
management unit 4 to enter into specific-speaker word registration mode.
Normal recognition operations are not performed in this word registration
mode.
Suppose that the speaker enters the number for the word "Ohayou" (good
morning) (each registered word that is known to be recognizable is
preferably assigned a number) from number key unit 91, and "Ohayou" (good
morning) is number 1, for example. Then, when the speaker presses the
numeric key "1", speech recognition and dialogue management unit 4 detects
that the speaker is trying to register the word "Ohayou" (good morning)
and performs controls so that the unit outputs a response "Say `good
morning`". When the speaker says "Ohayou" (good morning) because of this
prompt, his voice is transmitted from voice input unit 1 to voice analysis
unit 2. The characteristic vector that has been voice-analyzed is
transmitted to word enrollment unit 81. Word enrollment unit 81 creates
standard patterns for the input voice as standard characteristic voice
data. The standard pattern is then stored in standard pattern memory unit
82.
The characteristic pattern that is registered as described above can be a
standard pattern that uses the characteristic vector column of the word
"Ohayou" (good morning) exactly as uttered by the speaker. Alternatively,
the speaker can say "Ohayou" (good morning) several times, and the average
standard characteristic vector column of the individual characteristic
vector columns can be obtained, and a standard pattern can be created from
the standard characteristic vector column.
In this manner, words that are uttered by a specific speaker and that
cannot be recognized can be registered. Naturally, the registration
technique can be performed on all unrecognizable words, not just "Ohayou"
(good morning). It is in this manner that the registration of
specific-speaker words from unrecognizable words is performed.
The following describes specific examples of conversations between a
speaker and the first preferred embodiment. In the speaker's utterances,
the words enclosed in brackets indicate keywords used for character
recognition.
Suppose that the speaker says "›Ohayou! gozaimasu" meaning "›Good morning!
to you . . . ". The voice "Ohayou" is transmitted from voice input unit 1
to voice analysis unit 2, where a voice-analyzed characteristic vector is
generated. At this time, word detection unit 32 of non-specific speaker
word identification unit 3 and word enrollment unit 83 of specific speaker
word enrollment unit 8 are both waiting for a signal from voice analysis
unit 2. Word detection units 32 and 83 each outputs word detection data in
the form of the aforementioned word lattice that corresponds to the output
from voice analysis unit 2. However, the numeric v | | |