WikiPatents - Community Patent Review
Create Free Account  |  License or Sell Your Patent  |  WikiPatents Marketplace  |  WikiPatents Blog
Username:  Password:  
    
Advanced Search
Method and apparatus for voice-interactive language instruction    
United States Patent5634086   
Link to this pagehttp://www.wikipatents.com/5634086.html
Inventor(s)Rtischev; Dimitry (Menlo Park, CA); Bernstein; Jared C. (Palo Alto, CA); Chen; George T. (Menlo Park, CA); Butzberger; John W. (Foster City, CA)
AbstractSpoken-language instruction method and apparatus employ context-based speech recognition for instruction and evaluation, particularly language instruction and language fluency evaluation. A system can administer a lesson, and particularly a language lesson, and evaluate performance in a natural interactive manner while tolerating strong foreign accents, and produce as an output a reading quality score. A finite state grammar set corresponding to the range of word sequence patterns in the lesson is employed as a constraint on a hidden Markov model (HMM) search apparatus in an HMM speech recognizer which includes a set of hidden Markov models of target-language narrations produced by native speakers of the target language. The invention is preferably based on use of a linguistic context-sensitive speech recognizer. The invention includes a system with an interactive decision mechanism which employs at least three levels of error tolerance to simulate a natural level of patience in human-based interactive instruction. A system for a reading phase is implemented through a finite state machine having at least four states which recognizes reading error at any position in a script and which employs a first set of actions. A related system for an interactive question phase is implemented through a finite state machine, but which recognizes reading errors as well as incorrect answers while invoking a second set of actions. A linguistically-sensitive utterance endpoint detector is provided for judging termination of a spoken utterance to simulate human turn-taking in conversational speech.
   














 Title Information Submit all comments and votes
 
Patent Text Patent PDF Print Page Summary File History
Plain text PDF images Print Summary File History
Drawing from US Patent 5634086
Method and apparatus for voice-interactive language instruction - US Patent 5634086 Drawing
Method and apparatus for voice-interactive language instruction
Inventor     Rtischev; Dimitry (Menlo Park, CA); Bernstein; Jared C. (Palo Alto, CA); Chen; George T. (Menlo Park, CA); Butzberger; John W. (Foster City, CA)
Owner/Assignee     SRI International (Menlo Park, CA)
Patent assignment
All assignments
Publication Date     May 27, 1997
Application Number     08/529,376
PAIR File History     Application Data   Transaction History
Image File Wrapper   Patent Term   Fees
Litigation
Filing Date     September 18, 1995
US Classification     704/270 434/185 704/255 704/256 704/256.2 704/257 704/270.1
Int'l Classification     G10L 003/00 G10L 005/06 G10L 009/00
Examiner     Hafiz; Tariq R.
Assistant Examiner    
Attorney/Law Firm     Albert; Philip H. Townsend and Townsend and Crew LLP
Address
Parent Case     This is a Continuation of application Ser. No. 08/032,850, filed Mar. 12, 1993, now abandoned.
Priority Data    
USPTO Field of Search     364/419 381/41 381/42 381/43 381/47 395/2 395/2.1 395/2.4 395/2.44 395/2.42 395/2.43 395/2.41 395/2.6 395/2.64 395/2.65 395/2.75 395/2.76 395/2.55 395/2.59 395/22 395/2.66
Patent Tags     voice-interactive language instruction
   
Enter a comma (,) or semicolon (;) between multiple tag words/phrases.
Describe this patent:
 Amusing   
 Clever   
 Complex   
 Efficient   
 Historic   
 Important   
 Innovative   
 Interesting   
 Practical   
 Simple   
[no votes]
Patent WIKI

Share information and news about this patent, including information and news about the technology, inventors, company, ligation and licensing.

 References Submit all comments and votes
 
*references marked with an asterisk below are user-added references
 U.S. References
 
Add a new US reference:  
ReferenceRelevancyCommentsReferenceRelevancyComments
5333275
Wheatley
704/243
Jul,1994

[0 after 0 votes]
5329609
Sanada

Jul,1994

[0 after 0 votes]
5329608
Bocchieri

Jul,1994

[0 after 0 votes]
5307444
Tsuboka
706/20
Apr,1994

[0 after 0 votes]
5268990
Cohen
704/200
Dec,1993

[0 after 0 votes]
5199077
Wilcox
704/256
Mar,1993

[0 after 0 votes]
5148489
Erell
704/226
Sep,1992

[0 after 0 votes]
5075896
Wilcox

Dec,1991

[0 after 0 votes]
5027406
Roberts
704/244
Jun,1991

[0 after 0 votes]
4969194
Ezawa
704/276
Nov,1990

[0 after 0 votes]
4887212
Zamora
704/8
Dec,1989

[0 after 0 votes]
4862408
Zamora
707/102
Aug,1989

[0 after 0 votes]
4860360
Boggs
704/233
Aug,1989

[0 after 0 votes]
4852180
Levinson
704/256.4
Jul,1989

[0 after 0 votes]
4783803
Baker
704/252
Nov,1988

[0 after 0 votes]
4641343
Holland
704/276
Feb,1987

[0 after 0 votes]
4380438
Okamoto
434/157
Apr,1983

[0 after 0 votes]
4276445
Harbeson
704/207
Jun,1981

[0 after 0 votes]
 Foreign References
 Other References
 Market Review Submit all comments and votes
   
Market Size
Estimate the gross annual revenues of the relevant market sector:
> $10B
$5B - $10B
$2B - $5B
$500M - $2B
$100M - $500M
$10M - $100M
$1M - $10M
$500K - $1M
$100K - $500K
< $100K
[No votes]
$0
 
$0   $2.5B   $5B   $7.5B   $10B
Market Share
Estimate the percentage of the relevant market sector this invention will capture:
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Reasonable Royalty
What percentage of gross sales should the inventor or assignee be paid?
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Public's "Guesstimation" of Royalty Value
Market SizeN/A[No votes]
xMarket ShareN/A[No votes]
xReasonable RoyaltyN/A[No votes]

N/A

License Availablity
If you are NOT the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
License Availablity
If you ARE the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
Competitive Advantage
Does this invention have a significant competitive advantage over similar technologies?
Yes

No



[No votes]
Most helpful competitive advantage comment
[No comments]

Commercial Alternatives
Are there viable commercial alternatives for this invention?
Yes

No



[No votes]
Most helpful commercial alternative comment
[No comments]

 Technical Review Submit all comments and votes
 Claims Submit all comments and votes
 


What is claimed is:

1. A language instruction and evaluation method using an automatic speech recognizer which generates word sequence hypotheses and phone sequence hypotheses from input speech and a grammar model, wherein the input speech is speech spoken by the speaker in response to a prompting of the speaker to recite a preselected script, the method comprising the steps of:

generating a grammar model from the preselected script;

imbedding alt elements in the grammar model between words and sentences of the preselected script thereby forming an altered grammar model, the alt elements representing potential nonscripted speech and pauses;

generating an input hypothesis from the input speech using the automatic speech recognizer with the altered grammar model, wherein the input hypothesis comprises a subset of sequences of words and alts allowed by the altered grammar model;

parsing the input hypothesis into sequences identified as one of words found in the preselected script, nonscripted speech and silence, wherein alts in the input hypotheses are associated with the nonscripted speech and the silence;

evaluating the accuracy of the input speech based on a distribution of alts in the input hypothesis, the accuracy being a measure of how well the input speech corresponds with preselected script which the Speaker of the input speech was prompted to recite; and

outputting an indication of the accuracy of the input speech to the speaker, thereby informing the speaker of how well the speaker has recited the preselected script.

2. The method of claim 1, further comprising the steps of:

digitizing the input speech and storing digitized input speech in a digital memory;

storing the grammar model and the altered grammar model in the digital memory; and

using a digital computer to compare the input speech with the stored grammar models.

3. The method of claim 1, further comprising a step of, in response to the input speech, prompting the speaker to re-recite the preselected script with phonetic and semantic accuracy, according to at least three levels of patience.

4. A language instruction and evaluation method using an automatic speech recognizer which generates word sequence hypotheses and phone sequence hypotheses from input speech and a grammar model, wherein the input speech is speech spoken by the speaker in response to a prompting of the speaker to recite a preselected script, the method comprising the steps of:

generating a grammar model from the preselected script;

imbedding alt elements in the grammar model between words and sentences of the preselected script thereby forming an altered grammar model, the alt elements representing potential nonscripted speech and pauses;

generating an input hypothesis from the input speech using the automatic speech recognizer with the altered grammar model, wherein the input hypothesis comprises a subset of sequences of words and alts allowed by the altered grammar model;

parsing the input hypothesis into sequences identified as one of words found in the preselected script, nonscripted speech and silence, wherein alts in the input hypotheses are associated with the nonscripted speech and the silence;

evaluating the accuracy of the input speech based on a distribution of alts in the input hypothesis; and

outputting an indication of the accuracy of the input speech to the speaker,

wherein the preselected script includes alternative texts, the method further comprising a step of generating an interactive conversation grammar model for the alternative texts, the interactive conversation grammar model comprising a first common alt element disposed before a selection of alternative phrases and a second common alt element disposed after the selection of an alternative phrase, thereby permitting alternative responses having phonetic accuracy and semantic inaccuracy.

5. The method of claim 4, further comprising a step of structuring an alt element as a plurality of transition arcs for events, including prolonged silence, prolonged out-of-script speech, speech alternating between periods of silence and periods of out-of-script speech, and speech without pauses or out-of-script speech.

6. A language instruction and evaluation method using an automatic speech recognizer which generates word sequence hypotheses and phone sequence hypotheses from input speech and a grammar model, wherein the input speech is speech spoken by the speaker in response to a prompting of the speaker to recite a preselected script, the method comprising the steps of:

generating a grammar model from the preselected script;

imbedding alt elements in the grammar model between words and sentences of the preselected script thereby forming an altered grammar model, the alt elements representing potential nonscripted speech and pauses;

generating an input hypothesis from the input speech using the automatic speech recognizer with the altered grammar model, wherein the input hypothesis comprises a subset of sequences of words and alts allowed by the altered grammar model;

parsing the input hypothesis into sequences identified as one of words found in the preselected script, nonscripted speech and silence, wherein alts in the input hypotheses are associated with the nonscripted speech and the silence, the step of parsing comprising the steps of:

a) recurrently examining a current segment output by the speech recognizer for scripted words, pause phones and reject phones;

b) determining reject density for the current segment; and

c) denoting the current segment as out-of-script speech if the reject density exceeds a reject density threshold;

evaluating the accuracy of the input speech based on a distribution of alts in the input hypothesis; and

outputting an indication of the accuracy of the input speech to the speaker.

7. The method of claim 6, wherein the step of determining the reject density for the current segment comprises the step of dividing a reject phone count returned by the speech recognizer for a preselected number of consecutive scripted words by a sum of the reject phone count and a count of the preselected number of consecutive scripted words.

8. A language instruction and evaluation method using an automatic speech recognizer which generates word sequence hypotheses and phone sequence hypotheses from input speech and a grammar model, wherein the input speech is speech spoken by the speaker in response to a prompting of the speaker to recite a preselected script, the method comprising the steps of:

generating a grammar model from the preselected script;

imbedding alt elements in the grammar model between words and sentences of the preselected script thereby forming an altered grammar model, the alt elements representing potential nonscripted speech and pauses;

generating an input hypothesis from the input speech using the automatic speech recognizer with the altered grammar model, wherein the input hypothesis comprises a subset of sequences of words and alts allowed by the altered grammar model;

parsing the input hypothesis into sequences identified as one of words found in the preselected script, nonscripted speech and silence, wherein alts in the input hypotheses are associated with the nonscripted speech and the silence, the step of parsing comprising the steps of:

a) recurrently examining a current segment output by the speech recognizer for-scripted words, pause phones and reject phones;

b) determining reject indicator for the current segment; and

c) denoting the current segment as out-of-script speech if the reject indicator exceeds a reject density threshold;

evaluating the accuracy of the input speech based on a distribution of alts in the input hypothesis; and

outputting an indication of the accuracy of the input speech to the speaker, thereby informing the speaker of how well the speaker has recited the preselected script.

9. The method of claim 8, wherein the step of determining the reject indicator for the current segment comprises the step of summing a reject phone count returned by the speech recognizer for a preselected number of consecutive scripted words.

10. A language instruction and evaluation method using an automatic speech recognizer which generates word sequence hypotheses and phone sequence hypotheses from input speech and a grammar model, wherein the input speech is speech spoken by the speaker in response to a prompting of the speaker to recite a preselected script, the method comprising the steps of:

generating a grammar model from the preselected script;

imbedding alt elements in the grammar model between words and sentences of the preselected script thereby forming an altered grammar model, the alt elements representing potential nonscripted speech and pauses;

generating an input hypothesis from the input speech using the automatic speech recognizer with the altered grammar model, wherein the input hypothesis comprises a subset of sequences of words and alts allowed by the altered grammar model;

parsing the input hypothesis into sequences identified as one of words found in the preselected script, nonscripted speech and silence, wherein alts in the input hypotheses are associated with the nonscripted speech and the silence, the step of parsing comprising the steps of:

a) recurrently examining a current segment output by the speech recognizer for scripted words, pause phones and reject phones;

b) determining a pause indicator for the current segment; and

c) denoting the current segment as an actionable pause if the pause indicator exceeds a pause indicator threshold, the actionable pause representing a turn-taking point in interaction between the automatic speech recognizer and the speaker;

evaluating the accuracy of the input speech based on a distribution of alts in the input hypothesis; and

outputting an indication of the accuracy of the input speech to the speaker, thereby informing the speaker of how well the speaker has recited the preselected script.

11. The method of claim 10, further comprising a step of generating the pause indicator threshold as a threshold dependent upon linguistic context of the current segment and position of the current segment in the preselected script, the pause indicator threshold being smaller at ends of sentences and major clauses than elsewhere among words of sentences of the preselected script.

12. The method of claim 10, wherein the pause indicator determining step comprises a step of summing pause phones returned by the speech recognizer out of a preselected number of consecutive words of the preselected script.

13. A system for tracking speech of a speaker using an automatic speech recognizer producing word sequence hypotheses and phone sequence hypotheses from a grammar model and input speech spoken by a speaker prompted to recite a preselected script, the system comprising:

presentation means for presenting information to the speaker about a subject and the preselected script and for prompting the speaker to recite the preselected script;

means for electronically capturing the input speech spoken in response to prompts of the presentation means, wherein captured input speech is stored in a computer memory;

means for analyzing the captured input speech to determine a sequence of words and alts corresponding to the captured input speech, wherein a word is identified as being part of the preselected speech and alts represent nonscripted speech and pauses;

assessing means coupled to the analyzing means for assessing completeness of an utterance to determine accuracy of the recitation of the preselected script, the accuracy being a measure of how well the input speech corresponds with preselected script which the speaker of the input speech was prompted to recite; and

producing means coupled to the assessing means for producing a response, if the recitation is not accurate, instructing the speaker to correctly recite the preselected script.

14. The system according to claim 13, wherein the system for tracking is used for instruction in a language foreign to the speaker and wherein the producing means includes means for generating an audible response as an example of native pronunciation and rendition of speech in the language.

15. The system according to claim 13, further comprising means for measuring recitation speed comprising:

means for counting words recited to determine a recited word count;

means for measuring time duration of a recitation of scripted words; and

means for dividing the recited word count by the measured time elapsed.

16. The system according to claim 13, further comprising means (192) for measuring recitation quality, thereby obtaining a recitation quality score (230), the means for measuring recitation quality comprising:

means (194) for counting words (195) in the preselected script to determine a preselected script word count;

means (196) for determining an optimum recitation time (197;

means (198) for counting reject phones (199) to determine a reject phone count;

means (200) for measuring a total time (201) elapsed during recitation of the preselected script;

means (202) for measuring good time (203) elapsed during recitation of phrases deemed acceptable by the analyzing means;

means (204) for dividing the good time (203) by the total time (201) to obtain a first quotient (205);

means (210) for outputting a preferred maximum value (211) which is a maximum of the optimum recitation time (197) and the good time (203);

means (212) for dividing the optimum recitation time (197) by the preferred maximum value (211) to obtain a second quotient (213);

means (218) for summing the reject phone count (199) and the preselected script word count (195) to obtain a quality value (219);

means (220) for dividing the preselected script word count (195) by the quality value (219) to obtain a third quotient (221); and

means for calculating the recitation quality score (230) as a weighted sum of the first quotient (208), the second score quotient (216) and the third score quotient (224).

17. A system for tracking speech of a speaker using an automatic speech recognizer producing word sequence hypotheses and phone sequence hypotheses from a grammar model and input speech spoken by a speaker prompted to recite a preselected script, the system comprising:

presentation means for presenting information to the speaker about a subject and the preselected script and for prompting the speaker to recite the preselected script;

means for electronically capturing the input speech spoken in response to prompts of the presentation means, wherein captured input speech is stored in a computer memory;

means for analyzing the captured input speech to determine a sequence of words and alts corresponding to the captured input speech, wherein a word is identified as being part of the preselected speech and alts represent nonscripted speech and pauses;

assessing means coupled to the analyzing means for assessing completeness of an utterance to determine accuracy of the recitation of the preselected script;

producing means coupled to the assessing means for producing a response, if the recitation is not accurate, instructing the speaker to correctly recite the preselected script;

means (192) for measuring recitation quality, thereby obtaining a recitation quality score (230), the means for measuring recitation quality comprising:

a) means (194) for counting words (195) in the preselected script to determine a preselected script word count;

b) means (196) for determining an optimum recitation time (197);

c) means (198) for counting reject phones (199) to determine a reject phone count;

d) means (200) for measuring a total time (201) elapsed during recitation of the preselected script;

e) means (202) for measuring good time (203) elapsed during recitation of phrases deemed acceptable by the analyzing means;

f) means (204) for dividing the good time (203) by the total time (201) to obtain a first quotient (205);

g) means (210) for outputting a preferred maximum value (211) which is a maximum of the optimum recitation time (197) and the good time (203);

h) means (212) for dividing the optimum recitation time (197) by the preferred maximum value (211) to obtain a second quotient (213);

i) means (218) for summing the reject phone count (199) and the preselected script word count (195) to obtain a quality value (219);

j) means (220) for dividing the preselected script word count (195) by the quality value (219) to obtain a third quotient (221); and

k) means for calculating the recitation quality score (230) as a weighted sum of the first quotient (208), the second score quotient (216) and the third score quotient (224), the means for calculating further comprising:

1) means (206) for weighting the first quotient (205) by a first weighting parameter (a) to obtain a first score component (208);

2) means (214) for weighting the second quotient (213) by a second weighting parameter (b) to obtain a second score component (216);

3) means (222) for weighting the third quotient (221) by a third weighting parameter (c) to obtain a third score component (224);

4) means (226) for summing the first score component (208), the second score component (216) and the third score component (224) to produce a score sum (227); and

5) means for weighting the score sum (227) by a scale factor (228) to obtain the recitation quality score (230).

18. A system for tracking speech and interacting with a speaker using spoken and graphic outputs and an automatic speech recognizer producing word sequence hypotheses and phone sequence hypotheses from input speech spoken by the speaker after being prompted to recite from a preselected script which includes a plurality of preselected script alternatives and from a grammar model, the system comprising:

presentation means for presenting information to the speaker about a subject and prompting the speaker to recite one of the plurality of preselected script alternatives;

sensing means for electronically capturing the input speech, wherein the captured input speech is stored in a computer memory;

analyzing means for analyzing the captured input speech to determine an input hypothesis corresponding to the input speech spoken by the speaker;

identifying means, coupled to the analyzing means, for identifying which preselected script alternative from the plurality of preselected script alternatives best corresponds to the input hypothesis;

assessing means, coupled to the identifying means, for assessing completeness of an utterance to determine accuracy of recitation of the identified preselected script alternative, the accuracy being a measure of how well the input speech corresponds with preselected script which the speaker of the input speech was prompted to recite;

output means, coupled to the assessing means, for outputting a response upon the completion of the utterance, the response indicating to the speaker the accuracy of the recitation of the identified preselected script alternative and the semantic appropriateness of the identified preselected script alternative.

19. The system according to claim 18, wherein the interacting system is for instruction in a language foreign to the speaker and wherein the producing means includes means for generating an audible response as an example of native pronunciation and rendition.

20. The language instruction and evaluation method of claim 1, wherein the step of outputting an indication is a step of indirectly outputting an indication and comprises the steps of:

inputting the indication to a lesson program; and indicating, using the lesson program, to the speaker the accuracy of the speaker's recitation by taking an action consistent with the accuracy input to the lesson program.
 Description Submit all comments and votes
 


COPYRIGHT NOTICE

A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.

MICROFICHE APPENDIX

This application has been filed with a microfiche appendix of 47 frames in length containing source code listings of elements of one embodiment of the present invention.

BACKGROUND OF THE INVENTION

This invention relates to speech recognition and more particularly to the types of such systems based on a hidden Markov models (HMM) for use in language or speech instruction.

By way of background, an instructive tutorial on hidden Markov modeling processes is found in a 1986 paper by Rabiner et al., "An Introduction to Hidden Markov Models," IEEE ASSP Magazine, Jan. 1986, pp. 4-16.

Various hidden-Markov-model-based speech recognition systems are known and need not be detailed herein. Such systems typically use realizations of phonemes which are statistical models of phonetic segments (including allophones or, more generically, phones) having parameters that are estimated from a set of training examples.

Models of words are made by making a network from appropriate phone models, a phone being an acoustic realization of a phoneme, a phoneme being the minimum unit of speech capable of use in distinguishing words. Recognition consists of finding the most-likely path through the set of word models for the input speech signal.

Known hidden Markov model speech recognition systems are based on a model of speech production known as a Markov source. The speech units being modeled are represented by finite state machines. Probability distributions are associated with the transitions leaving each node (state), specifying the probability of taking each transition when visiting the node. A probability distribution over output symbols is associated with each node. The transition probability distributions implicitly model duration. The output symbol distributions are typically used to model speech signal characteristics such as spectra.

The probability distributions for transitions and output symbols are estimated using labeled examples of speech. Recognition consists of determining the path through the Markov network that has the highest probability of generating the observed sequence. For continuous speech, this path will correspond to a sequence of word models.

Models are known for accounting for out-of-vocabulary speech, herein called reject phone models but sometimes called "filler" models. Such models are described in Rose et al., "A Hidden Markov Model Based Keyword Recognition System," Proceedings of IEEE ICASSP, 1990.

The specific hidden Markov model recognition system employed in conjunction with the present invention is the Decipher speech recognizer, which is available from SRI International of Menlo Park, Calif. The Decipher system incorporates probabilistic phonological information, a trainer capable of training phonetic models with different levels of context dependence, multiple pronunciations for words, and a recognizer. The co-inventors have published with others papers and reports on instructional development peripherally related to this invention. Each mentions early versions of question and answer techniques. See, for example, "Automatic Evaluation and Training in English Pronunciation," Proc. ICSLP 90, Nov. 1990, Kobe, Japan. "Toward Commercial Applications of Speaker-Independent Continuous Speech Recognition," Proceedings of Speech Tech 91, (Apr. 23, 1991) New York, N.Y. "A Voice Interactive Language Instruction System," Proceedings of Eurospeech 91, Genoa, Italy Sep. 25, 1991. These papers described only what an observer of a demonstration might experience.

Other language training technologies are known. For example, U.S. Pat. No. 4,969,194 to Ezawa et al. discloses a system for simple drilling of a user in pronunciation in a language. The system has no speech recognition capabilities, but it appears to have a signal-based feedback mechanism using a comparator which compares a few acoustic characteristics of speech and the fundamental frequency of the speech with a reference set.

U.S. Pat. No. 4,380,438 to Okamoto discloses a digital controller of an analog tape recorder used for recording and playing back a user's own speech. There are no recognition capabilities.

U.S. Pat. No. 4,860,360 to Boggs is a system for evaluating speech in which distortion in a communication channel is analyzed. There is no alignment or recognition of the speech signal against any known vocabulary, as the disclosure relates only to signal analysis and distortion measure computation.

U.S. Pat. No. 4,276,445 to Harbeson describes a speech analysis system which produces little more than an analog pitch display. It is not believed to be relevant to the subject invention.

U.S. Pat. No. 4,641,343 to Holland et al. describes an analog system which extracts formant frequencies which are fed to a microprocessor for ultimate display to a user. The only feedback is a graphic presentation of a signature which is directly computable from the input signal. There is no element of speech recognition or of any other high-level processing.

U.S. Pat. No. 4,783,803 to Baker et al. discloses a speech recognition apparatus and technique which includes means for determining where among frames to look for the start of speech. The disclosure contains a description of a low-level acoustically-based endpoint detector which processes only acoustic parameters, but it does not include higher level, context-sensitive end-point detection capability.

What is needed is a recognition and feedback system which can interact with a user in a linguistic context-sensitive manner to provide tracking of user-reading of a script in a quasi-conversational manner for instructing a user in properly-rendered, native-sounding speech.

SUMMARY OF THE INVENTION

According to the invention, an instruction system is provided which employs linguistic context-sensitive speech recognition for instruction and evaluation, particularly language instruction and language fluency evaluation. The system can administer a lesson, and particularly a language lesson, and evaluate performance in a natural voice-interactive manner while tolerating strong foreign accents from a non-native user. The lesson material and instructions may be presented to the learner in a variety of ways, including, but not limited to, video, audio or printed visual text. As an example, in one language-instruction-specific application, an entire conversation and interaction may be carried out in a target language, i.e., the language of instruction, while certain instructions may be in a language familiar to the user.

In connection with preselected visual information, the system may present aural information to a trainee. The system prompts the trainee-user to read text aloud during a reading phase while monitoring selected parameters of speech based on comparison with a script stored in the system. The system then asks the user certain questions, presenting a list of possible responses. The user is then expected to respond by reciting the appropriate response in the target language. The system is able to recognize and respond accurately and