WikiPatents - Community Patent Review
Create Free Account  |  License or Sell Your Patent  |  WikiPatents Marketplace  |  WikiPatents Blog
Username:  Password:  
    
Advanced Search
Cartridge-based, interactive speech recognition method with a response creation capability    
United States Patent5946658   
Link to this pagehttp://www.wikipatents.com/5946658.html
Inventor(s)Miyazawa; Yasunaga (Suwa, JP); Inazumi; Mitsuhiro (Suwa, JP); Hasegawa; Hiroshi (Suwa, JP); Edatsune; Isao (Suwa, JP); Urano; Osamu (Suwa, JP)
AbstractA technique for improving speech recognition in low-cost, speech interactive devices. This technique calls for selectively implementing a speaker-specific word enrollment and detection unit in parallel with a word detection unit to permit comprehension of spoken commands or messages when no recognizable words are found. Preferably, specific speaker detection will be based on the speaker's own personal list of words or expression. Other facets include complementing non-specific pre-registered word characteristic information with individual, speaker-specific verbal characteristics to improve recognition in cases where the speaker has unusual speech mannerisms or accent and response alteration in which speaker-specification registration functions are leveraged to provide access and permit changes to a predefined responses table according to user needs and tastes. Also disclosed is the externalization and modularization of non-specific speaker recognition, action and response information to enhance adaptability of the speech recognizer without sacrificing product cost competitiveness or overall device responsiveness.
   














 Title Information Submit all comments and votes
 
Patent Text Patent PDF Print Page Summary File History
Plain text PDF images Print Summary File History
Drawing from US Patent 5946658
Cartridge-based, interactive speech recognition method with a response

     creation capability - US Patent 5946658 Drawing
Cartridge-based, interactive speech recognition method with a response creation capability
Inventor     Miyazawa; Yasunaga (Suwa, JP); Inazumi; Mitsuhiro (Suwa, JP); Hasegawa; Hiroshi (Suwa, JP); Edatsune; Isao (Suwa, JP); Urano; Osamu (Suwa, JP)
Owner/Assignee     Seiko Epson Corporation (Tokyo, JP)
Patent assignment
All assignments
Publication Date     August 31, 1999
Application Number     09/165,512
PAIR File History     Application Data   Transaction History
Image File Wrapper   Patent Term   Fees
Litigation
Filing Date     October 2, 1998
US Classification     704/275 704/244 704/251 704/258
Int'l Classification     G10L 009/06 G10L 005/02
Examiner     Hudspeth; David R.
Assistant Examiner     Smits; Talivaldis Ivars
Attorney/Law Firm     Gabrik; Michael T.
Address
Parent Case     CROSS REFERENCE TO RELATED APPLICATIONS This is a Continuation of prior application Ser. No. 08/700,175 filed on Aug. 20, 1996, now U.S. Pat. No. 5,842,168, which is a continuation-in-part of Ser. No. 08/536,563 filed on Sep. 29, 1995 which is now U.S. Pat. No. 5,794,204. This application is related to copending application Ser. No. 08/700,181, filed on Aug. 20, 1996, entitled "Voice Activated Interactive Speech Recognition Device And Method", and copending application Ser. No. 08/699,874, filed on Aug. 20, 1996, entitled "Speech Recognition Device And Processing Method", all commonly assigned with the present invention to the Seiko Epson Corporation of Tokyo, Japan. This application is also related to the following applications: application Ser. No. 08/078,027, filed Jun. 18, 1993, entitled "Speech Recognition System", now abandoned; application Ser. No. 08/641,268, filed Sep. 29, 1995, entitled Speech Recognition System Using Neural Networks, which is a continuation of application Ser. No. 08/078,027 and which is now U.S. Pat. No. 5,751,904, issued May 12, 1998; application Ser. No. 08/102,859, filed Aug. 6, 1993, entitled "Speech Recognition Apparatus", now U.S. Pat. No. 5,481,644, issued Jan. 2, 1996; application Ser. No. 08/485,134, filed Jun. 7, 1995, entitled "Speech Recognition Apparatus Using Neural Network and Learning Method Therefor", now U.S. Pat. No. 5,787,393, issued Jul. 28, 1998; and application Ser. No. 08/536,550, filed Sep. 29, 1996, entitled "Interactive Voice Recognition Method And Apparatus Using Affirmative/Negative Content Discrimination"; all commonly assigned with the present invention to the Seiko Epson Corporation of Tokyo, Japan.
Priority Data     Aug 21, 1995[JP]7-212249
USPTO Field of Search     704/244 704/251 704/258 704/275
Patent Tags     cartridge-based, interactive speech recognition response creation capability
   
Enter a comma (,) or semicolon (;) between multiple tag words/phrases.
Describe this patent:
 Amusing   
 Clever   
 Complex   
 Efficient   
 Historic   
 Important   
 Innovative   
 Interesting   
 Practical   
 Simple   
[no votes]
Patent WIKI

Share information and news about this patent, including information and news about the technology, inventors, company, ligation and licensing.

 References Submit all comments and votes
 
*references marked with an asterisk below are user-added references
 U.S. References
 
Add a new US reference:  
ReferenceRelevancyCommentsReferenceRelevancyComments
5842168
Miyazawa
704/275
Nov,1998

[0 after 0 votes]
5577165
Takebayashi
704/275
Nov,1996

[0 after 0 votes]
5548681
Gleaves
704/233
Aug,1996

[0 after 0 votes]
5384892
Strong
704/243
Jan,1995

[0 after 0 votes]
5357596
Takebayashi
704/275
Oct,1994

[0 after 0 votes]
4984177
Rondel
704/277
Jan,1991

[0 after 0 votes]
4305131
Best
715/716
Dec,1981

[0 after 0 votes]
 Foreign References
 Other References
 Market Review Submit all comments and votes
   
Market Size
Estimate the gross annual revenues of the relevant market sector:
> $10B
$5B - $10B
$2B - $5B
$500M - $2B
$100M - $500M
$10M - $100M
$1M - $10M
$500K - $1M
$100K - $500K
< $100K
[No votes]
$0
 
$0   $2.5B   $5B   $7.5B   $10B
Market Share
Estimate the percentage of the relevant market sector this invention will capture:
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Reasonable Royalty
What percentage of gross sales should the inventor or assignee be paid?
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Public's "Guesstimation" of Royalty Value
Market SizeN/A[No votes]
xMarket ShareN/A[No votes]
xReasonable RoyaltyN/A[No votes]

N/A

License Availablity
If you are NOT the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
License Availablity
If you ARE the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
Competitive Advantage
Does this invention have a significant competitive advantage over similar technologies?
Yes

No



[No votes]
Most helpful competitive advantage comment
[No comments]

Commercial Alternatives
Are there viable commercial alternatives for this invention?
Yes

No



[No votes]
Most helpful commercial alternative comment
[No comments]

 Technical Review Submit all comments and votes
 Claims Submit all comments and votes
 


What is claimed is:

1. A method for performing interactive speech recognition processing, comprising the steps of:

receiving voice and translating the received voice into digital form;

generating characteristic voice data for the received digitized voice;

determining whether the characteristic voice data substantially matches standard characteristic voice information corresponding to pre-registered expressions and generating phrase identification data in response thereto, wherein the pre-registered expressions are stored as standard speech patterns capable of recognition in a removable cartridge releasably communicating with said phrase identification unit, said removable cartridge comprising a first memory to retain the standard speech patterns;

recognizing a meaning from the received voice based on the received phrase identification data and formulating an appropriate response corresponding to the recognized meaning;

enabling the creation of response data based on inputted information; and

generating synthesized audio corresponding to the appropriate response formulated in said recognizing and formulating step.

2. The method of claim 1, wherein said removable cartridge includes a second memory to retain conversation content data used to recognize the meaning from the received and recognized voice.

3. The method of claim 2, wherein said removable cartridge includes a third memory to retain response data used to formulate and synthesize the appropriate response to the received and recognized voice.

4. The method of claim 3, wherein said first, second and third cartridge memories reside within at least one ROM device.

5. The method of claim 3, wherein said first, second and third cartridge memories reside within at least one EEPROM device.

6. The method of claim 1, wherein said removable cartridge includes a second memory to retain response data used to formulate and synthesize the appropriate response to the received and recognized voice.

7. A method for performing interactive speech recognition processing, comprising the steps of:

receiving voice and translating the received voice into digital form;

generating characteristic voice data for the received digitized voice;

determining whether the characteristic voice data substantially matches standard characteristic voice information corresponding to pre-registered expressions and generating phrase identification data in response thereto;

recognizing a meaning from the received voice based on the received phrase identification data and conversation content information stored in a first memory of a removable cartridge releasably communicating therewith, and formulating an appropriate response corresponding to the recognized meaning;

enabling the creation of response data based on inputted information; and

generating synthesized audio corresponding to the appropriate response formulated in said recognizing and formulating step.

8. The method of claim 7, wherein said removable cartridge includes a second memory to retain response data used to formulate and synthesize the appropriate response to the received and recognized voice.

9. The method of claim 8, wherein said first and second cartridge memories reside within at least one ROM device.

10. The method of claim 8, wherein said first and second cartridge memories reside within at least one EEPROM device.
 Description Submit all comments and votes
 


BACKGROUND OF THE INVENTION

1. Field of the Invention

This invention relates generally to speech recognition technology and is particularly concerned with portable, intelligent, interactive devices responsive to non-speaker specific commands or instructions.

2. Description of the Related Art

An example of conventional portable interactive speech recognition equipment is a speech recognition toy. For example, the speech recognition toy that was disclosed by the Japanese Laid Open Publication S62-253093 contains a plurality of pre-registered commands that are objects of recognition. The equipment compares the voice signals emitted by the children or others who are playing with the toy to voice signals pre-registered by a specific speaker. If perceived voice happens to match one or more of the pre-registered signals, the equipment generates a pre-determined electrical signal corresponding to the matched voice command, and causes the toy to perform specific operations based on the electrical signal.

However, because these toys rely on a particular individual's speaking characteristics (such as intonation, inflection, and accent) captured at a particular point in time and recognize only a prestored vocabulary, they quite frequently fail to recognize words and expressions spoken by another person, and apt not even to tolerate even slight variations in pronunciation by the registered speaker. These limitations typically lead to misrecognition or nonrecognition errors which may frustrate or confuse users of the toy, especially children, which, in turn, leads to disuse once the initial novelty has worn off. Further, speaker and word pre-registration is extremely time-consuming and cumbersome, since every desired expression must be individually registered one-by-one basis prior to use by a new speaker.

One potential solution may be to incorporate into such devices non-specific speech recognition equipment which uses exemplars from a large population of potential speakers (e.g. 200+ individuals). This technology does a much better job in correctly recognizing a wide range of speakers, but it too is limited to a predefined vocabulary. However, unlike speaker-specific recognition equipment, the predefined vocabulary cannot be altered by the user to suit individual needs or tastes. Further, proper implementation of these non-speaker specific techniques for suitably large vocabularies require copious amounts of memory and processing power currently beyond the means of most commercially available personal computers and digital assistants, as typically each pre-registered word, along with every speaker variation thereof, must be consulted in order to determine a match. Accordingly, conventional non-speaker specific recognition simply does not provide a practical recognition solution for the ultra-cost sensitive electronic toy, gaming or appliance markets.

Moreover, although specific speech recognition devices can nevertheless achieve relatively high recognition rates for a range of typical users, they cannot always achieve high recognition rate for all types of users. For example, voice characteristics such as interaction and pitch very widely depending on the age and sex of the speaker. The speech recognition device attuned to adult style speech may achieve extremely high recognition rates for adults but may fail miserably with toddlers' voices. Further, conventional non-specific speaker speech recognition could be used by a wide range of people for a wide ranging purposes. Consider the case of a speech recognition device used in an interactive toy context. In this scenario, the degree and type of interaction must be rich and developed enough to handle a wide age range from the toddler speaking his or her first words to mature adolescents, and all the conversation content variations and canned response variation must accommodate this broad range of users in order to enhance the longevity and commercial appeal of such a recognition toy. However as already discussed, a limited memory in processing resources can be devoted to speech recognition in order to make such a speech recognition device cost effective and reasonable responsive. So, heretofore a trade off between hardware costs and responsiveness versus interactably has been observed in nonspecific speaker voice recognizers.

It is, therefore, an object of the present invention to implement an interactive speech recognition method and apparatus that can perform natural-sounding conversations without increasing the number of pre-registered words or canned responses characterized by conventional canned matching type speech recognition. Moreover, it is a further object of the present invention to incorporate recognition accuracy and features approaching non-specific speaker speech recognition in a device relatively simple in configuration, low in price, easily manufactured, and easily adaptable to suit changing needs and uses. It is yet a further object of the present invention to provide a highly capable, low-cost interactive speech recognition method and apparatus which can be applied to a wide range of devices such as toys, game machines and ordinary electronic devices.

It is still a further object of the present invention to prove nonspecific speaker recognition rates for a wider range of voices then heretofore could be accommodated using conventional memory constructs. It is even a further object of the present invention that a wider range of conversation responses and detected phrases be accommodated on an as needed basis.

SUMMARY OF THE INVENTION

In accordance with these and related objects, the speech recognition technique of the present invention include: 1) voice analysis, which generates characteristic voice data by analyzing perceived voice; 2) non-specific speaker word identification, which reads the characteristic voice data and outputs detected data corresponding to pre-registered words contained within a word registry; 3) potentially, in addition to nonspecific speaker word identification, specific-speaker word enrollment that registers standard voice characteristic data for a select number of words spoken by an individual speaker and outputs detected data when these expressions are subsequently detected; 4) speech recognition and dialogue management, which, based off either/both non-specific or specific speaker word identification, reads the detected voice data, comprehends its meaning and determines a corresponding response; 5) voice synthesis, which generates a voice synthesis output based on the determined response; and 6) voice output, which externally outputs the synthesized response.

According to the preferred embodiments, optional specific speaker word registration outputs word identification data by DP-matching based on the input voice from a specific speaker. It can comprise the following: an initial word enrollment that creates standard patterns by reading characteristic data relative to a specific speaker's prescribed voice input from the voice analysis process; a standard pattern memory process that stores the standard patterns created by the word enrollment process; and a word detection process that outputs word detection data by reading characteristic data relative to the specific-speaker's prescribed voice input and by comparing the characteristic data with said standard patterns. Further, specific speaker word enrollment comprises at least the following: additional word enrollment that creates standard voice patterns that are speaker-adapted based on the standard characteristic voice data for non-specific speakers as spoken by the selected speaker along with speaker-adapted standard pattern memory for storing both the standard patterns that are speaker-adapted and those installed by speaker specific word enrollment. Moreover, specific speaker word enrollment may read characteristic data relative to the specific speaker's prescribed voice input through voice analysis and outputs word detection data by comparing the input characteristic data with the speaker-adapted standard patterns.

Further, the preferred embodiments may include a response creation function. When a particular speaker wishes to add to or modify the existing response list, the preferred embodiment can create response data based on voice signals that have been input by a particular speaker and register them according to instructions given by speech recognition and dialogue management. This permits the creation of new and useful response messages using the voices of a wide variety of people and allows a wide variety of exchanges between the embodiment and users.

Moreover, according to the preferred embodiments of the present invention: 1) word registry storage, including standard pattern memories containing the characteristic voice vectors for each registered word (either speaker specific, non-speaker specific or a combination thereof; and/or 2) conversation content storage for retaining canned context rules and response procedures when recognized words or phrases are encountered; and/or 3) response data storage for retaining response voice vector data used in formulating an appropriate response to perceived and recognized words and phrases and corresponding context and action rules, may collectively or singularly reside within memory provided on a removable cartridge external to and in communication with the speech recognition processor. Of course, necessary protocol glue and buffering logic, along with conventional bus architecture control drivers and protocols will be included as necessary to permit proper (at least read-only) communications between these cartridge memories and the various components of the speech recognition processor, including, but not limited to, the word or phrase identifier (preferably non-speaker specific), the speech recognition and dialogue management unit, and the voice synthesis unit.

By offloading these memories and information onto a modular removable cartridge and away from a central speech recognition processor, it becomes possible to tailor conversations to users of various ages, backgrounds or gender, as well as increase the available groups of pre-registered words and/or responses, all without dramatically increasing memory size and costly memory parts counts. Only a small additional expense will be required to accommodate cartridge information transfer operations to the speech processor, as well as engagement hardware to complete the electrical interconnection between the cartridge memories and the main speech recognition processing unit. Moreover, since it is anticipated that the overall memory size of each cartridge approximates the memory size of a conventional internalized memory speech recognition system, processing matching speed and overall responsiveness should not be seriously impacted by inclusion of the external cartridge paradigm. Again, here, the speech recognition processing unit in this embodiment may be required to implement additional communication overhead in order to communicate with the coupled memory cartridge, but incorporating such additional processing burdens is more than out weighed by the benefits of modularity and adaptability secured by including recognition, context and response information on removable storage such as the memory cartridge.

Thus, one aspect of the present invention couples simple non-specific speaker speech recognition with specific speaker expression enrollment and detection. Further, non-specific pre-registered words can be speaker-adapted to permit more accurate and quicker recognition. In certain situations, some words are recognized and other words are not depending on the manner in which a particular speaker utters them. With some speakers, no non-specific pre-registered words can be recognized at all. In such cases, words that fail to be recognized can be enrolled using a specific-speaker voice enrollment function. This virtually eliminates words that cannot be recognized and thus substantially improves the overall recognition capability of the equipment. This function also allows specific speakers to enroll new words suited to the user's individual needs and tastes which are not included in the non-specific word registry.

Further, the preferred embodiments may include a response creation function which permits alteration or additions to a predefined response list, thereby improving its depth and range of usefulness.

Moreover, the non-speaker specific or speaker-specific word registries, recognition contextual rules, conversation response action rules, and audible response information may all be stored singularly or in combination or external cartridge memory to accommodate wider ranges of speakers and applications having disparate conversation sets without significantly impacting device cost or composite recognition performance. This is true, even though the rest of the speech recognition processing equipment may be unitized to reduce cost and case manufacturability. If, in the case of a toy application, a cartridge is used to store recognition, conversation control and response information, the toy can adapt and grow with the child, even when "canned" non-speaker specific phrase identification techniques are utilized. Also, the recognition registry, conversation and response information can be changed or updated as the general culture changes, thereby greatly increasing the longevity and usefulness of the cartridge-equipped speech recognition apparatus. Of course, the cartridge information can also be used to broaden potential speakers and maintain acceptable recognition rates by tailoring the "canned" non-speaker specific registration list to particular dialects, regional lingual idiosyncrasies or even different languages. In such cases, a given speaker may simply select and connect the most appropriate cartridge for his or her own inflections, accent or language.

Other objects and attainments together with a fuller understanding of the invention will become apparent and appreciated by referring to the following description of the presently preferred embodiments and claims taken in conjunction with the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

In the drawings, wherein like reference symbols refer to like parts:

FIG. 1 is an overall block diagram of the first preferred embodiment;

FIGS. 2A-2E diagrammatically illustrate a sample input voice waveform and resultant word lattice generated by the non-specific speaker word identification unit of the embodiment shown in FIG. 1;

FIG. 3 illustrates an example setup switch according to the first and second preferred embodiments;

FIGS. 4A-4E diagrammatically illustrate another sample input voice waveform and resultant word lattice generated by the non-specific speaker word identification unit of the embodiment shown in FIG. 1;

FIG. 5 shows a example response table stored in the response data memory unit of the embodiment shown in FIG. 1;

FIG. 6 is an overall block diagram of a second preferred embodiment;

FIGS. 7A-7E diagrammatically illustrate a sample input voice waveform and resultant word lattice generated by both the specific and non-specific speaker word identification and enrollment units of the embodiment shown in FIG. 6;

FIG. 8 is an overall block diagram of a third preferred embodiment;

FIG. 9 illustrates an example setup switch according to the embodiment shown in FIG. 8;

FIG. 10 shows a example response table stored in the response data memory unit of the embodiment shown in FIG. 8;

FIG. 11 is an overall block diagram of a fourth embodiment of the present invention explaining modularized recognition, conversation control and response information according to the present invention;

FIG. 12 is a more detailed block diagram of the embodiment of FIG. 11;

FIG. 13 is an alternative detailed block diagram of the embodiment shown in FIG. 11 wherein only phrase registry information is contained on the cartridge;

FIG. 14 is another detailed block diagram showing yet another alternative configuration of the embodiment of FIG. 11 wherein only context and conversation response, along with response data is externalized to the cartridge; and

FIG. 15 is yet another detailed block diagram depicting still another alternative configuration of the embodiment of FIG. 11 wherein only response data is maintained external to the speech recognition response processor.

DESCRIPTION OF THE PREFERRED EMBODIMENTS

As depicted in the figures, the presently preferred embodiments exemplify speech recognition techniques of the present invention as applied to an inexpensive voice-based toy, gaming device, or similar interactive appliance. Though one having ordinary skill in the speech recognition art will recognize that the teachings of the present invention are not so limited, the presently preferred embodiments can be conveniently implemented as a stand-alone speech recognition device residing within a stuffed doll such as dog, cat or bear suitable for young children.

FIG. 1 shows a configuration diagram that depicts the first preferred embodiment of the present invention. In FIG. 1, the following components are designed to recognize words spoken by non-specific speakers and to generate response messages according to the results of the recognition: voice input unit 1, which inputs the speaker's voice; voice analysis unit 2, which outputs characteristic voice data by analyzing the input voice; non-specific speaker word identification unit 3, which reads the characteristic voice data from voice analysis unit 2 and outputs the detected data corresponding to the registered words contained in the input voice, based on a non-specific speaker's standard characteristic voice data relative to pre-registered recognizable words; speech recognition and dialogue management unit 4; response data memory unit 5, which stores pre-set response data; voice synthesis unit 6; and voice output unit 7. Also shown in FIG. 1, a specific-speaker word registration means 8 is provided that registers the standard characteristic voice data on the words uttered by a specific speaker based on the specific speaker's input voice and that outputs word detection data on the specific speaker's input voice. Further, setup switch 9 is provided to serve as a data input setup means for performing various data input setup actions by an individual user.

The non-specific speaker word identificationunit 3 preferably comprises the following: standard pattern memory unit 31, which stores standard voice vector patterns or standard characteristic voice data that correspond to each pre-registered word contained in the word registry; and word detection unit 32, which generates word detection data preferably in the form of a word lattice by reading characteristic voice data from voice analysis unit 2 and by comparing them against the standard non-specific speakers patterns contained in the standard pattern memory unit 31.

The standard pattern memory unit 31 stores (registers) standard patterns of target-of-recognition words that are created beforehand using the voices of a large number of speakers (e.g., 200 people) for each of the words. Since these embodiments are directed to a low-cost toy or novelty, approximately 10 words are chosen as target-of-recognition words. Although the words used in the embodiment are mostly greeting words such as the Japanese words "Ohayou" meaning "good morning", "oyasumi" meaning "good night", and "konnichiwa" meaning "good afternoon" , the present invention is, of course, by no means limited to these words or to merely the Japanese language. In fact, various words in English, French or other language can be registered, and the number of registered words is not limited to 10. Though not shown in FIG. 1, word detection unit 32 is principally composed of a processor (the CPU) and ROM that stores the processing program. Its function is to determine on what confidence level the words registered in standard pattern memory unit 31 occur in the input voice, and will be described in more detail hereinbelow.

On the other hand, specific-speaker word enrollment unit 8 preferably comprises the following: word enrollment unit 81; standard pattern memory unit 82, which stores input voice standard patterns as the standard characteristic voice data on the input voice; and word detection unit 83. In this embodiment, the specific-speaker word enrollment unit registers the words uttered by specific speakers by entering their voice signals and outputting the detected data in the form of a word lattice for specific-speaker registered words relative to the input voice. In this example, it is assumed that the input voice is compared with registered standard voice patterns by DP-matching, and word identification data is output from word detection unit 83 based on the results of the comparison. The registration of words by specific-speaker word enrollment unit 8 can be performed by setting the word registration mode using setup switch 9, as will be discussed in greater detail hereinbelow.

Still referring to FIG. 1, voice input unit 1 is composed of the following conventional sub-components which are not shown in the figure: a microphone, an amplifier, a low-pass filter, an A/D converter, and so forth. The voice which is input from the microphone is converted into an appropriate audio waveform after the voice is allowed to pass through the amplifier and the low-pass filter. The audio waveform is then converted into digital signals (e.g., 12 KHz sampling rate at 16 bit resolution) by the A/D converter and is output to voice analysis unit 2. Voice analysis unit 2 takes the audio waveform signals transmitted from voice input unit 1 and uses a processor (the CPU) to perform a frequency analysis at short time intervals, extracts characteristic vectors (commonly LPC-Cepstrum coefficients) of several dimensions that express the characteristic of the frequency, and outputs the time series of the characteristic vectors (hereinafter referred to as "characteristic voice vector series"). It should be noted that said non-specific speaker word data output means 3 can be implemented using the hidden Markov model (HMM) method or the DP-matching method. However, in this example keyword-spotting processing technology using the dynamic recurrent neural network (DRNN) method is used as disclosed by Applicants in U.S. application Ser. No. 08/078,027, filed Jun. 18, 1993, entitled "Speech Recognition System", commonly assigned with the present invention to Seiko-Epson Corporation of Tokyo, Japan, which is incorporated fully herein by reference. Also, this method is disclosed in the counterpart laid open Japanese applications H6-4097 and H6-119476. DRNN is preferably used in order to perform speech recognition of virtually continuous speech by non-specific speakers and to output word identification data as described herein.

The following is a brief explanation of the specific processing performed by non-specific speaker word data identification unit 3 with reference to FIGS. 2A-2E. Word detection unit 32 determines the confidence level at which a word registered in standard pattern memory unit 31 occurs at a specific location in the input voice. Now, suppose that the speaker inputs an example Japanese language phrase "asu No tenki wa . . . " meaning "Concerning tomorrow's weather". Assume that in this case the stylized voice signal shown in FIG. 2A represents the audio waveform for this expression.

In the expression "asu no tenki wa . . . ", the contextual keywords include "asu" (tomorrow) and "tenki" (weather). These are stored in the form of patterns in standard pattern memory unit 31 as parts of the a predetermined word registry, which in this case, represents approximately 10. If 10 words are registered, signals are output in order to detect words corresponding to these 10 words (designated word 1, word 2, word 3 . . . up to word 10). From the information such as detected signal values, the word identification unit determines the confidence level at which the corresponding words occur in the input voice.

More specifically, if the word "tenki" (weather) occurs in the input voice as word 1, the detection signal that is waiting for the signal "tenki" (weather) rises at the portion "tenki" in the input voice, as shown in FIG. 2B. Similarly, if the word "asu" (tomorrow) occurs in the input voice as word 2, the detection signal that is waiting for the signal "asu" rises at the portion "asu" in the input voice, as shown in FIG. 2C. In FIGS. 2B and 2C, the numerical values 0.9 and 0.8 indicate respective confidence levels that the spoken voice contains the particular pre-registered keyword. The relative level or magnitude of this level can fluctuate between .about.0 and 1.0, with 0 indicating a nearly zero confidence match factor and 1.0 representing a 100% confidence match factor. In the case of a high confidence level, such as 0.9 or 0.8, the registered word having a high confidence level can be considered to be a recognition candidate relative to the input voice. Thus, the registered word "asu" occurs with a confidence level of 0.8 at position w1 on the time axis. Similarly, the registered word "tenki" occurs with a confidence level of 0.9 at position w2 on the time axis.

Also, the example of FIGS. 2A-2E show that, when the word "tenki" (weather) is input, the signal that is waiting for word 3 (word 3 is assumed to be the registered word "nanji" ("What time . . . ") also rises at position w2 on the time axis with an uncertain confidence level of approximately 0.6. Thus, if two or more registered words exist as recognition candidates at the same time relative to an input voice signal, the recognition candidate word is determined by one of two methods: either by 1)selecting the potential recognition candidate with the highest degree of similarity to the input voice using confidence level comparisons as the actually recognized keyword; or a method of selecting one of the words as the recognized word by creating beforehand a correlation table expressing correlation rules between words. In this case, the confidence level for "tenki" (weather) indicates that it has the highest degree of similarity to the input voice during time portion w2 on the time axis even though "nanji" can be recognized as a potential recognition candidate. Based on these confidence levels, the speech recognition and dialogue management unit 4 performs the recognition of input voices.

Collectively, the detection information, including starting and ending points on the time axis and the maximum magnitude of the detection signal indicating the confidence level, for each pre-registered word contained in non-specific speaker word registry within standard pattern memory unit 31 is known as a word lattice.

In FIGS. 2B-2E, only a partial lattice is shown for the sake of clarity, but a word lattice including detection information for every pre-registered non-specific word is in fact generated by the word detection unit 32.

Though not shown in FIG. 1, speech recognition and dialogue management unit 4 is principally composed of a processor and ROM that stores the processing program and performs the processing tasks described below. Different CPUs may be provided in the individual units or, alternatively, one CPU can perform the processing tasks for the different units.

Speech recognition and dialogue management unit 4 selects a recognition word output from either non-specific word detection unit 32 or specific speaker word detection unit 83. Based on the composite word lattice, the speech recognition and dialogue management unit recognizes a voice (comprehending the overall meaning of the input voice), references response data memory unit 5, determines a response according to the comprehended meaning of the input voice, and transmits appropriate response information and control overhead to both voice synthesis unit 8 and voice output unit 9.

For example, when the detected data or partial word lattice shown in FIGS. 2B-2E is relayed from word detection unit 32, the speech recognition and dialogue management unit determines one or more potential recognition candidates denoted in the word lattice as a keyword occurring in the input. In this particular example, since the input voice is "asu no tenki wa" (the weather tomorrow), the words "asu"(tomorrow) and "tenki" (weather) are detected. From the keywords "asu" and "tenki", the speech recognition and dialogue management unit understands the contents of the continuous input voice "asu no tenki wa".

The speech recognition processing of virtually continuous voice by keyword spotting processing, as described above, is applicable to other languages as well as to Japanese. If the language to be used is English, for instance, some of the recognizable words that can be registered might be "good morning", "time", "tomorrow", and "good night". The characteristic data on these recognizable registered words is stored in standard memory unit 31. If the speaker asks "What time is it now?", the word "time" in the clause "what time is it now" is used as a keyword in this case. When the word "time" occurs in the input voice, the detection signal that is waiting for the word "time" rises at the portion "time" in the input voice. When detected data (word lattice) from word detection unit 32 is input, one or more words in the input voice is determined as a keyword. Since in this example the input voice is "what time is it now", "time" is detected as a keyword, and the speech recognition conversation control unit understands the contents of the continuous input voice "what time is it now?"

The above description concerns the case where word data is output from non-specific speaker word data output means 3, i.e., the words spoken by the speaker are recognized. With some speakers, however, words like the Japanese expression "Ohayou" (good morning) totally fail to be recognized. Although in some cases changing the way words are spoken can solve the problem, some speakers with voice idiosyncrasies entirely fail to be recognized. In such cases, the words that fail to be recognized can be registered as specific-speaker words. This feature is described below.

Referring still to FIG. 1, setup switch 9 is used to register specific-speaker words. As shown in FIG. 3, setup switch 9 preferably comprises number key unit 91, start-of-registration button 92, end-of-registration button 93, response message selection button 94, end-of-response message registration button 95, and response number input button 96. Buttons such as response message selection button 94, end-of-response message registration button 95, and response number input button 96 will be described in more detail hereinbelow.

By means of example, this section explains the case where the word "Ohayou" (good morning) is registered as a specific-speaker word because it is not recognized. First, start-of-registration button 92 on setup switch 9 is pushed. This button operation forces speech recognition and dialogue management unit 4 to enter into specific-speaker word registration mode. Normal recognition operations are not performed in this word registration mode.

Suppose that the speaker enters the number for the word "Ohayou" (good morning) (each registered word that is known to be recognizable is preferably assigned a number) from number key unit 91, and "Ohayou" (good morning) is number 1, for example. Then, when the speaker presses the numeric key "1", speech recognition and dialogue management unit 4 detects that the speaker is trying to register the word "Ohayou" (good morning) and performs controls so that the unit outputs a response "Say `good morning`". When the speaker says "Ohayou" (good morning) because of this prompt, his voice is transmitted from voice input unit 1 to voice analysis unit 2. The characteristic vector that has been voice-analyzed is transmitted to word enrollment unit 81. Word enrollment unit 81 creates standard patterns for the input voice as standard characteristic voice data. The standard pattern is then stored in standard pattern memory unit 82.

The characteristic pattern that is registered as described above can be a standard pattern that uses the characteristic vector column of the word "Ohayou" (good morning) exactly as uttered by the speaker. Alternatively, the speaker can say "Ohayou" (good morning) several times, and the average standard characteristic vector column of the individual characteristic vector columns can be obtained, and a standard pattern can be created from the standard characteristic vector column.

In this manner, words that are uttered by a specific speaker and that cannot be recognized can be registered. Naturally, the registration technique can be performed on all unrecognizable words, not just "Ohayou" (good morning). It is in this manner that the registration of specific-speaker words from unrecognizable words is performed.

The following describes specific examples of conversations between a speaker and the first preferred embodiment. In the speaker's utterances, the words enclosed in brackets indicate keywords used for character recognition.

Suppose that the speaker says "›Ohayou! gozaimasu" meaning "›Good morning! to you . . . ". The voice "Ohayou" is transmitted from voice input unit 1 to voice analysis unit 2, where a voice-analyzed characteristic vector is generated. At this time, word detection unit 32 of non-specific speaker word identification unit 3 and word enrollment unit 83 of specific speaker word enrollment unit 8 are both waiting for a signal from voice analysis unit 2. Word detection units 32 and 83 each outputs word detection data in the form of the aforementioned word lattice that corresponds to the output from voice analysis unit 2. However, the numeric v