WikiPatents - Community Patent Review
Create Free Account  |  License or Sell Your Patent  |  WikiPatents Marketplace  |  WikiPatents Blog
Username:  Password:  
    
Advanced Search
Word recognition in a speech recognition system using data reduced word templates    
United States Patent4797929   
Link to this pagehttp://www.wikipatents.com/4797929.html
Inventor(s)Gerson; Ira A. (Hoffman Estates, IL); Lindsley; Brett L. (Palatine, IL); Smanski; Philip J. (Palatine, IL)
AbstractDescribed herein, is an arrangement and method for processing speech information in a speech recognition system (300). In such a system where the speech information is depicted as words, each word representing a sequence of frames (510) and where the recognition system has means (120) for comparing present input speech to a word template, the word template stored in template memory and derived from one or more previous input word, the present invention is best employed. The invention describes combining contiguous acoustically similar frames (512) derived from the previous input word or words into representative frames to form a corresponding reduced word template, storing the reduced word template in template memory in an efficient manner, and comparing frames of the present input speech to the representative frames of the reduced word template according to the number of frames combined in the representative frames of the reduced word template. In doing so, a measure of similarity between the present input speech and the word template is generated.
   














 Title Information Submit all comments and votes
 
Patent Text Patent PDF Print Page Summary File History
Plain text PDF images Print Summary File History
Drawing from US Patent 4797929
Word recognition in a speech recognition system using data reduced word

     templates - US Patent 4797929 Drawing
Word recognition in a speech recognition system using data reduced word templates
Inventor     Gerson; Ira A. (Hoffman Estates, IL); Lindsley; Brett L. (Palatine, IL); Smanski; Philip J. (Palatine, IL)
Owner/Assignee     Motorola, Inc. (Schaumburg, IL)
Patent assignment
All assignments
Publication Date     January 10, 1989
Application Number     06/816,161
PAIR File History     Application Data   Transaction History
Image File Wrapper   Patent Term   Fees
Litigation
Filing Date     January 3, 1986
US Classification     704/243 704/238 704/239 704/245
Int'l Classification     G10L 005/00
Examiner     Kemeny; Emanuel S.
Assistant Examiner    
Attorney/Law Firm     Crawford; Robert J.
Address
Parent Case    
Priority Data    
USPTO Field of Search     381/43
Patent Tags     word recognition speech recognition data reduced word templates
   
Enter a comma (,) or semicolon (;) between multiple tag words/phrases.
Describe this patent:
 Amusing   
 Clever   
 Complex   
 Efficient   
 Historic   
 Important   
 Innovative   
 Interesting   
 Practical   
 Simple   
[no votes]
Patent WIKI

Share information and news about this patent, including information and news about the technology, inventors, company, ligation and licensing.

 References Submit all comments and votes
 
*references marked with an asterisk below are user-added references
 U.S. References
 
Add a new US reference:  
ReferenceRelevancyCommentsReferenceRelevancyComments
4513436
Nose
704/243
Apr,1985

[0 after 0 votes]
4489435
Moshier
704/244
Dec,1984

[0 after 0 votes]
4489434
Moshier
704/239
Dec,1984

[0 after 0 votes]
4449233
Brantingham
704/258
May,1984

[0 after 0 votes]
4449190
Flanagan
706/22
May,1984

[0 after 0 votes]
4415767
Gill
704/243
Nov,1983

[0 after 0 votes]
4412098
An
704/236
Oct,1983

[0 after 0 votes]
4227177
Moshier
704/231
Oct,1980

[0 after 0 votes]
4227176
Moshier
704/231
Oct,1980

[0 after 0 votes]
4181813
Marley
704/251
Jan,1980

[0 after 0 votes]
4132867
Siglow
370/512
Jan,1979

[0 after 0 votes]
3812291
Brodes
704/253
May,1974

[0 after 0 votes]
3582559
Hitchcock
62/243
Jun,1971

[0 after 0 votes]
 Foreign References
 Other References
 Market Review Submit all comments and votes
   
Market Size
Estimate the gross annual revenues of the relevant market sector:
> $10B
$5B - $10B
$2B - $5B
$500M - $2B
$100M - $500M
$10M - $100M
$1M - $10M
$500K - $1M
$100K - $500K
< $100K
[No votes]
$0
 
$0   $2.5B   $5B   $7.5B   $10B
Market Share
Estimate the percentage of the relevant market sector this invention will capture:
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Reasonable Royalty
What percentage of gross sales should the inventor or assignee be paid?
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Public's "Guesstimation" of Royalty Value
Market SizeN/A[No votes]
xMarket ShareN/A[No votes]
xReasonable RoyaltyN/A[No votes]

N/A

License Availablity
If you are NOT the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
License Availablity
If you ARE the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
Competitive Advantage
Does this invention have a significant competitive advantage over similar technologies?
Yes

No



[No votes]
Most helpful competitive advantage comment
[No comments]

Commercial Alternatives
Are there viable commercial alternatives for this invention?
Yes

No



[No votes]
Most helpful commercial alternative comment
[No comments]

 Technical Review Submit all comments and votes
 Claims Submit all comments and votes
 


What is claimed is:

1. A method for processing speech information in a speech recognition system, wherein the information is represented by a sequence of frames, said speech recognition system being capable of comparing a given frame set to a template, and having template memory to store said template, said processing method comprising the steps of:

(a) combining contiguous acoustically similar frames of a previous frame set into representative frames to form a reduced template;

(b) storing said reduced template in template memory; and

(c) comparing frames of said given frame set to said representative frames of said reduced template according to the number of frames combined in said representative frames of said reduced template to produce a measure of similarity between the given frame set and the template.

2. The method of claim 1, wherein combining further includes the steps of generating a distortion measure for each representative frame and comparing each said distortion measure to a predetermined distortion threshold.

3. The method of claim 1, wherein combining further includes the step of determining a distortion measure corresponding to said reduced template.

4. The method of claim 1, further including the step of determining the number of frames combined in each representative frame.

5. The method of claim 1, wherein storing further includes the step of storing the number of frames combined into each representative frame.

6. The method of claim 1, further including the step of determining the number of representative frames representing said reduced template.

7. The method of claim 1, wherein storing further includes the step of storing a first element of frame data corresponding to the difference between a second element of frame data and said first element of frame data.

8. The method of claim 1, wherein storing further includes the step of storing an energy measure for each representative frame.

9. The method of claim 1, wherein comparing further includes the step of accumulating distance measures corresponding to the difference between the given input word and the reduced template.

10. A method for generating a measure of similarity for speech information in a speech recognition system, wherein the information is represented by a sequence of frames, said speech recognition system being capable of comparing a given input frame set to a template, said method comprising the steps of:

combining contiguous acoustically similar frames of a previous frame set into representative frames to form a reduced template;

comparing frames of said given frame set to said representative frames of said reduced template by accumulating a set of distance measures for each representative frame, each said set having a total number of accumulated distance measures corresponding to the number of frames combined in each said representative frame; and

determining a measure of similarity between said given frame set and said template based on said accumulated distance measures.

11. The method of claim 10, wherein comparing further includes the step of defining a maximum number of accumulations for a particular representative frame corresponding to the number of frames combined into said particular representative frame.

12. The method of claim 11, wherein comparing further includes the step of defining a minimum number of accumulations proportional to said maximum number of accumulations for said particular representative frame.

13. The method of claim 10, wherein comparing further includes the step of sequentially accumulating similarity measures corresponding to two representative frames, two said representative frames being separated by at least one other said representative frame, but without accumulating a similarity measure from said other representative frame.

14. An arrangement for processing speech information in a speech recognition system, wherein the information is represented by a sequence of frames, said speech recognition system being capable of comparing a given frame set to a template, and having template memory to store said template, said processing arrangement comprising:

(a) means for combining contiguous acoustically similar frames of a previous frame set into representative frames to form a reduced template;

(b) means for storing said reduced template in template memory; and

(c) means for comparing frames of said given frame set to said representative frames of said reduced template according to the number of frames combined in said representative frames of said reduced template to produce a measure of similarity between the given frame set and the template.

15. The arrangement of claim 1, wherein means for combining further includes means for generating a distortion measure for each representative frame and comparing each said distortion measure to a predetermined distortion threshold.

16. The arrangement of claim 1, wherein means for combining further includes means for determining a distortion measure corresponding to said reduced template.

17. The arrangement of claim 1, further including means for determining the number of frames combined in each representative frame.

18. The arrangement of claim 1, wherein means for storing further includes means for storing the number of frames combined into each representative frame.

19. The arrangement of claim 1, further including means for determining the number of representative frames representing said reduced template.

20. The arrangement of claim 1, wherein means for storing further includes means for storing a first element of frame data corresponding to the difference between a second element of frame data and said first element of frame data.

21. The arrangement of claim 1, wherein means for storing further including means for storing an energy measure for each representative frame.

22. The arrangement of claim 1, wherein means for comparing further includes means for accumulating distance measures corresponding to the difference between the given input word and the reduced template.

23. An arrangement for generating a measure of similarity for speech information in a speech recognition system, wherein the information is represented by a sequence of frames, said speech recognition system being capable of comparing a given input frame set to a template, said arrangement comprising:

means for combining contiguous acoustically similar frames of a previous frame set into representative frames to form a reduced template;

means for comparing frames of said given frame set to said representative frames of said reduced template by accumulating a set of distance measures for each representative frame, each said set having a total number of accumulated distance measures corresponding to the number of frames combined in each said representative frame; and

means for determining a measure of similarity between said given frame set and said template based on said accumulated distance measures.

24. The arrangement of claim 10, wherein means for comparing further includes means for defining a maximum number of accumulations for a particular representative frame corresponding to the number of frames combined into said particular representative frame.

25. The arrangement of claim 11, wherein means for comparing further includes means for defining a minimum number of accumulations proportional to said maximum number of accumulations for said particular representative frame.

26. The arrangement of claim 10, wherein means for comparing further includes means for sequentially accumulating similarity measures corresponding to two representative frames, two said representative frames being separated by at least one other said representative frame, but without accumulating a similarity measure from said other representive frame.

27. The method for processing speech information in a speech recognition system, wherein the information is represented by a sequence of frames, said speech recognition system being capable of comparing a given frame-set to a template using a word model comprising of a plurality of states, and having template memory to store said template, said processing method comprising the steps of:

(a) combining contiguous acoustically similar frames of a previous frame-set into representative frames;

(b) forming a word model having a plurality of states, each state corresponding to one of said representative frames; and

(c) comparing at least a predetermined minimum number of frames of the given frame-set to a state of the word model according to the number of frames combined in said representative frames to produce a measure of similarity between the given frame-set and the template.

28. The method for processing speech information in a speech recognition system, wherein the information is represented by a sequence of frames, said speech recognition system being capable of comparing a given frame-set to a template using a word model comprising of a plurality of states, and having template memory to store said template, said processing method comprising the steps of:

(a) combining contiguous acoustically similar frames of a previous frame-set into representative frames;

(b) forming a word model having a plurality of states, each state corresponding to one of said representative frames; and

(c) comparing not more than a predetermined maximum number or frames of the given frame-set to a state of the word model according to the number of frames combined in said representative frames to produce a measure of similarity between the given frame-set and the template.

29. The method for processing speech information in a speech recognition system, wherein the information is represented by a sequence of frames, said speech recognition system being capable of comparing a given frame-set to a template using a word model comprising of a plurality of states, and having template memory to store said template, said processing method comprising the steps of:

(a) combining contiguous acoustically similar frames of a previous frame-set into representative frames;

(b) forming a word model having a plurality of states, each state corresponding to one of said representative frames; and

(c) comparing at least a predetermined minimum number of frames, but not more than a predetermined maximum number of frames, of the given frame-set to a state of the word model according to the number of frames combined in said representative frames to produce a measure of similarity between the given frame-set and the template.
 Description Submit all comments and votes
 


BACKGROUND

The present invention relates to word recognition for a speech recognition system and, more particularly, to word recognition using word templates having a data reduced format.

Typically, speech recognition systems represent spoken words as word templates stored in system memory. When a system user speaks into the system, the system must digitally represent the speech for comparison to the word templates stored in memory.

Two particular aspects of such an implementation have received a great deal of attention. The first aspect pertains to the amount of memory which is required to store the word templates. The representation of speech is such that the data used for matching to an input word typically requires a significant amount of memory to be dedicated for each particular word. Moreover, a large vocabulary causes extensive computation time to be consumed for the match. In general, the computation time increases linearly with amount of memory required for the template memory. Practical implementation in real time requires that this computation time be reduced. Of course, a faster processor architecture could be employed to reduce this computation time, but due to cost considerations, it is prefered that the data representing the word templates be reduced to reduce the computation.

The second aspect pertains to the particular matching techniques used in the system. Most word recognition techniques have been directed to the accuracy of the recognition process for a particular type of feature data used to represent the speech. Typically, channel bank information or LPC parameters represent the speech. When using feature data of a reduced format, the word recognition process must be sensitive to the format for an effective implementation.

The speech recognition system, described herein, clusters frames within the word templates to reduce the representative data, for which a word recognition technique requires special consideration to the combined frames. Data reduced word templates represent spoken words in a compacted form. Matching an incoming word to a reduced word template without adequately compensating for its compacted form will result in degraded recognizer performance. An obvious method for compensating for data reduced word templates would be uncompacting the reduced data before matching. Unfortunately, uncompacting the reduced data defeats the purpose of data reduction. Hence, a word recognition method is needed which allows reduced data to be directly matched against an incoming spoken word without degrading the word recognition process.

OBJECTS AND SUMMARY OF THE PRESENT INVENTION

Accordingly, an object of the present invention is to provide a system of word recognition which reduces template data and recognizes the reduced data in an efficient manner.

The present invention teaches an arrangement and method for processing speech information in a speech recognition system. In a system where the information is depicted as words, each word represented by sequence of frames and where the recognition system has means for comparing present input speech to a word template, the word template stored in template memory and derived from one or more previous input words, the processing method includes (1) combining continguous acoustically similar frames derived from the previous input word (3) into representative frames to form a corresponding reduced word template, (2) storing the reduced word template in template memory in an efficient manner, and (3) comparing frames of the present input speech to the representative frames of the reduced word template according to the number of frames combined in the representative frames of the reduced word template to produce a measure of similarity between the present input speech and the word template.

BRIEF DESCRIPTION OF THE DRAWINGS

Additional objects, features, and advantages in accordance with the present invention will be more clearly understood by reference to the following description taken in connection with the accompanying drawings, in the several figures of which like reference numerals identify like elements, and in which:

FIG. 1 is a general block diagram illustrating the technique of synthesizing speech from speech recognition templates according to the present invention;

FIG. 2 is a block diagram of a speech communications device having a user-interactive control system employing speech recognition and speech synthesis in accordance with the present invention;

FIG. 3 is a detailed block diagram of the preferred embodiment of the present invention illustrating a radio transceiver having a hands-free speech recognition/speech synthesis control system;

FIG. 4a is an expanded block diagram of the data reducer block 322 of FIG. 3;

FIG. 4b is a flowchart showing the sequence of steps performed by the energy normalization block 410 of FIG. 4a;

FIG. 4c is a detailed block diagram of the of the particular hardware configuration of the segmentation/compression block 420 of FIG. 4a;

FIG. 5a is a graphical representation of a spoken word segmented into frames for forming a cluster according to the present invention;

FIG. 5b is a diagram exemplifying output clusters being formed for a particular word template, according to the present invention;

FIG. 5c is a table showing the possible formations of an arbitrary partial cluster path according to the present invention;

FIGS. 5d and 5e show a flowchart illustrating a basic implementation of the data reduction process performed by the segmentation/compression block 420 of FIG. 4a;

FIG. 5f is a detailed flowchart of the traceback and output clusters block 582 of FIG. 5e, showing the formation of a data reduced word template from previously determined clusters;

FIG. 5g is a traceback pointer table illustrating a clustering path for 24 frames, according to the present invention, applicable to partial traceback;

FIG. 5h is a graphical representation of the traceback pointer table of FIG. 5g illustrated in the form of a frame connection tree;

FIG. 5i is a graphical representation of FIG. 5h showing the frame connection tree after three clusters have been output by tracing back to common frames in the tree;

FIGS. 6a and 6b comprise a flowchart showing the sequence of steps performed by the differential encoding block 430 of FIG. 4a;

FIG. 6c is a generalized memory map showing the particular data format of one frame of the template memory 160 of FIG. 3;

FIG. 7a is a graphical representation of frames clustered into average frames, each average frame represented by a state in a word model, in accordance with the present invention;

FIG. 7b is a detailed block diagram of the recogition processor 120 of FIG. 3, illustrating its relationship with the template memory 160;

FIG. 7c is a flowchart illustating one embodiment of the sequence of steps required for word decoding according to the present invention;

FIGS. 7d and 7e comprise a flowchart illustrating one embodiment of the steps required for state decoding according to the present invention;

FIG. 8a is a detailed block diagram of the data expander block 346 of FIG. 3;

FIG. 8b is a flowchart showing the sequence of steps performed by the differential decoding block 802 of FIG. 8a;

FIG. 8c is a flowchart showing the sequence of steps performed by the energy denormalization block 804 of FIG. 8a;

FIG. 8d is a flowchart showing the sequence of steps performed by the frame repeating block 806 of FIG. 8a;

FIG. 9a is a detailed block diagram of the channel bank speech synthesizer 340 of FIG. 3;

FIG. 9b is an alternate embodiment of the modulator/bandpass filter configuration 980 of FIG. 9a;

FIG. 9c is a detailed block diagram of the preferred embodiment of the pitch pulse source 920 of FIG. 9a;

FIG. 9d is a graphic representation illustrating various waveforms of FIGS. 9a and 9c.

DESCRIPTION OF THE PREFERRED EMBODIMENT

1. System Configuration

Referring now to the accompanying drawings, FIG. 1 shows a general block diagram of user-interactive control system 100 of the present invention. Electronic device 150 may include any electronic apparatus that is sophisticated enough to warrant the incorporation of a speech recognition/speech synthesis control system. In the preferred embodiment, electronic device 150 represents a speech communications device such as a mobile radiotelephone.

User-spoken input speech is applied to microphone 105, which acts as an acoustic coupler providing an electrical input speech signal for the control system. Acoustic processor 110 performs acoustic feature extraction upon the input speech signal. Word features, defined as the amplitude/frequency parameters of each user-spoken input word, are thereby provided to speech recognition processor 120 and to training processor 170. Acoustic processor 110 may also include a signal conditioner, such as an analog-to-digital converter, to interface the input speech signal to the speech recognition control system. Acoustic processor 110 will be further described in conjunction with FIG. 3.

Training processor 170 manipulates this word feature information from acoustic processor 110 to provide word recognition templates to be stored in template memory 160. During the training procedure, the incoming word features are arranged into individual words by locating their endpoints. If the training procedure is designed to accommodate multiple training utterances for word feature consistency, then the multiple utterances may be averaged to form a single word template. Furthermore, since most speech recognition systems do not require all of the speech information to be stored as a template, some type of data reduction is often performed by a training processor 170 to reduce the template memory requirements. The word templates are stored in template memory 160 for use by speech recognition processor 120 as well as by speech synthesis processor 140. The exact training procedure utilized by the preferred embodiment of the present invention may be found in the description accompanying FIG. 2.

In the recognition mode, speech recognition processor 120 compares the word feature information provided by acoustic processor 110 to the word recognition templates provided by template memory 160. If the acoustic features of the present word feature information derived from the user-spoken input speech sufficiently match the acoustic features of a particular prestored word template derived from the template memory, then recognition processor 120 provides device control data to device controller 130 indicative of the particular word recognized. A further discussion of an appropriate speech recognition apparatus, and how the preferred embodiment incorporates data reduction into the training process may be found in the description accompanying FIGS. 3 through 5.

Device controller 130 interfaces the entire control system to electronic device 150. Device controller 130 translates the device control data provided by recognition processor 120 into control signals adaptable for use by the particular electronic device. These control signals direct the device to perform specific operating functions as instructed by the user. (Device controller 130 may also perform additional supervisory functions related to other elements shown in FIG. 1.) An example of a device controller known in the art and suitable for use with the present invention is a microcomputer. Refer to FIG. 3 for further details of the hardware implementation.

Device controller 130 also provides device status data representing the operating status of electronic device 150. This data is applied to speech synthesis processor 140, along with word recogition templates from template memory 160. Synthesis processor 140 utilizes the status data to determine which word recognition template is to be synthesized into user-recognizable reply speech. Synthesis processor 140 may also include an internal reply memory, also controlled by the status data, to provide "canned" reply words to the user. In either case, the user is informed of the electronic device operating status when the speech reply signal is ouput via speaker 145.

Thus, FIG. 1 illustrates how the present invention provides a user-interactive control system utilizing speech recognition to control the operating parameters of an electronic device, and how a speech recognition template may be utilized to generate reply speech to the user indicative of the operating status of the device.

FIG. 2 illustrates in more detail the application of the user-interactive control system to a speech communications device comprising a part of any radio or landline voice communications system, such as, for example, a two-way radio system, a telephone system, an intercom system, etc. Acoustic processor 110, recognition processor 120, template memory 160, and device controller 130 are the same in structure and in operation as the corresponding blocks of FIG. 1. However, control system 200 illustrates the internal structure of speech communications device 210. Speech communication terminal 225 represents the main electronic network of device 210, such as, for example, a telephone terminal or a communications console. In this embodiment, microphone 205 and speaker 245 are incorporated into the speech communications device itself. A typical example of this microphone/speaker arrangement would be a telephone handset. Speech communications terminal 225 interfaces operating status information of the speech communications device to device controller 130. This operating status information may comprise functional status data of the terminal itself (e.g., channel data, service information, operating mode messages, etc.), user-feedback information of the speech recognition control system (e.g., directory contents, word recognition verification, operating mode status, etc.), or may include system status data pertaining to the communications link (e.g., loss-of-line, system busy, invalid access code, etc.).

In either the training mode or the recognition mode, the features of user spoken input speech are extracted by acoustic processor 110. In the training mode, which is represented in FIG. 2 by a position "A" of switch 215, the word feature information is applied to word averager 220 of training processor 170. As previously mentioned, if the system is designed to average multiple utterances together to form a single word template, the averaging is performed by word averager 220. Through the use of word averaging, the training processor can take into account the minor variances between two or more utterances of the same word, thereby producing a more reliable word template. Numerous word averaging techniques may be used. For example, one method would be to combine only the similar word features of all training utterances to produce a "best" set of features for the word template. Another technique may be to simply compare all training utterances to determine which one provides the "best" template. Still another word averaging technique is described by L. R. Rabiner and J. G. Wilson in "A Simplified Robust Training Procedure for Speaker Trained, Isolated Word Recognition Systems", Journal of the Acoustic Society of America, vol. 68 (November 1980), pp. 1271-76.

Data reducer 230 then performs data reduction upon either the averaged word data from word averager 220 or upon the word feature signals directly from acoustic processor 110, depending upon the presence or absence of a word averager. In either case, the reduction process consists of segmenting this "raw" word feature data and combining the data in each segment. The storage requirements for the template are then further reduced by differential encoding of the segmented data to produce "reduced" word feature data. This specific data reduction technique of the present invention is fully described in conjunction with FIGS. 4 and 5. To summarize, data reducer 230 compresses the raw word data to minimize the template storage requirements and to reduce the speech recognition computation time.

The reduced word feature data provided by training processor 170 is stored as word recognition templates in template memory 160. In the recognition mode, which is illustrated by position "B" of switch 215, recognition processor 120 compares the incoming word feature signals to the word recognition templates. Upon recognition of a valid command word, recognition processor 120 may instruct device controller 130 to cause a corresponding speech comunications device control function to be executed by speech communications terminal 225. Terminal 225 may respond to device controller 130 by sending operating status information back to controller 130 in the form of terminal status data. This data can be used by the control system to synthesize the appropriate speech reply signal to inform the user of the present device operating status. This sequence of events will be more clearly understood by referring to the subsequent example.

Synthesis processor 140 is comprised of speech synthesizer 240, data expander 250, and reply memory 260. A synthesis processor of this configuration is capable of generating "canned" replies to the user from a prestored vocabulary (stored in reply memory 260), as well as generating "template" responses from a user-generated vocabulary (stored in template memory 160). Speech synthesizer 240 and reply memory 260 are further described in conjunction with FIG. 3, and data expander 250 is fully described in the text accompanying FIG. 8a. In combination, the blocks of synthesis processor 140 generate a speech reply signal to speaker 245. Accordingly, FIG. 2 illustrates the technique of using a single template memory for both speech recognition and speech synthesis.

The simplified example of a "smart" telephone terminal employing voice-controlled dialing from a stored telephone number directory is now used to describe the operation of the control system of FIG. 2. Initially, an untrained speaker-dependent speech recognition system cannot recognize command words. Therefore, the user must manually prompy the device to begin the training procedure, perhaps by entering a particular code into the telephone keypad. Device controller 130 then directs switch 215 to enter the training mode (position "A"). Device controller 130 then instructs speech synthesizer 240 to respond with the predefined phrase TRAINING VOCABULARY ONE, which is a "canned" response obtained from reply memory 260. The user then begins to build a command word vocabulary by uttering command words, such as STORE or RECALL, into microphone 205. The features of the utterance are first extracted by acoustic processor 110, and then applied to either word averager 220 or data reducer 230. If the particular speech recognition system is designed to accept multiple utterances of the same word, word averager 220 produces a set of averaged word features representing the best representation of that particular word. If the system does not have word averaging capabilities, the single utterance word features (rather than the multiple utterance averaged word features) are applied to data reducer 230. The data reduction process removes unnecessary or duplicate feature data, compresses the remaining data, and provides template memory 160 with "reduced" word recognition templates. A similar procedure is followed for training the system to recognize digits.

Once the system is trained with the command word vocabulary, the user must continue the training procedure by entering telephone directory names and numbers. To accomplish this task, the user utters the previously-trained command word ENTER. Upon recognition of this utterance as a valid user command, device controller 130 instructs speech synthesizer 240 to reply with the "canned" phrase DIGITS PLEASE? stored in reply memory 260. Upon entering the appropriate telephone number digits (e.g., 555-1234), the user says TERMINATE and the system replys NAME PLEASE? to prompt user-entry of the corresponding directory name (e.g., SMITH). This user-interactive process continues until the telephone number directory is completely filled with the appropriate telephone names and digits.

To place a phone call, the user simply utters the command word RECALL. When the utterance is recognized as a valid user command by recognition processor 120, device controller 130 directs speech synthesizer 240 to generate the verbal reply NAME? via synthesizing information provided by reply memory 260. The user then responds by speaking the name in the directory index corresponding to the telephone number that he desires to dial (e.g. JONES). The word will be recognized as a valid directory entry if it corresponds to a predetermined name index stored in template memory 160. If valid, device controller 130 directs data expander 250 to obtain the appropriate reduced word recognition template from template memory 160 and perform the data expansion process for synthesis. Data expander 250 "unpacks" the reduced word feature data and restores the proper energy contour for an intelligible reply word. The expanded word template data is then fed to speech synthesizer 240. Using both the template data and the reply memory data, speech synthesizer 240 generates the phrase JONES . . . (from template memory 160 through data expander 250) . . . FIVE-FIVE-FIVE, SIX-SEVEN-EIGHT-NINE (from reply memory 260).

The user then says the command word SEND which, when recognized by the control system, instructs device controller 130 to send telephone number dialing information to speech communications terminal 225. Terminal 225 outputs this dialing information via a appropriate communications link. When the telephone connection is made, speech communications terminal 225 interfaces microphone audio from microphone 205 to the appropriate transmit path, and receive audio from the appropriate receive audio path to speaker 245. If a proper telephone connection cannot be made, terminal controller 225 provides the appropriate communications link status information to device controller 130. Accordingly, device controller 130 instructs speech synthesizer 240 to generate the appropriate reply word corresponding to the status information provided, such as the reply word SYSTEM BUSY. In this manner, the user is informed of the communications link status, and user-interactive voice-controlled directory dialing is achieved.

The above operational description is merely one application of synthesizing speech from speech recognition templates according to the present invention. Numerous other applications of this novel technique to a speech communications device are contemplated, such as, for example, a communications console, a two-way radio, etc. In the preferred embodiment, the control system of the present invention is used with a mobile radiotelephone.

Although speech recognition and speech synthesis allows a vehicle operator to keep both eyes on the road, the conventional handset or hand-held microphone prohibits him from keeping both hands on the steering wheel or from executing proper manual (or automatic) transmission shifting. For this reason, the control system of the preferred embodiment incorporates a speakerphone to provide hands-free control of the speech communications device. The speakerphone performs the transmit/receive audio switching function, as well as the received/reply audio multiplexing function.

Referring now to FIG. 3, control system 300 utilizes the same acoustic processor block 110, training processor block 170, recognition processor block 120, template memory block 160, device controller block 130, and synthesis processor block 140 as the corresponding blocks of FIG. 2. However, microphone 302 and speaker 375 are not an integral part of the speech communications terminal. Instead, input speech signal from microphone 302 is directed to radiotelephone 350