WikiPatents - Community Patent Review
Create Free Account  |  License or Sell Your Patent  |  WikiPatents Marketplace  |  WikiPatents Blog
Username:  Password:  
    
Advanced Search
Method of data reduction in a speech recognition    
United States Patent4905288   
Link to this pagehttp://www.wikipatents.com/4905288.html
Inventor(s)Gerson; Ira A. (Hoffman Estates, IL); Lindsley; Brett L. (Palatine, IL)
AbstractThe present invention describes a method and arrangement for reducing a sequence of initial frames into a reduced set of representative frames by combining the initial frames into a plurality of representative frames, the combining process including generating a distortion measure associated with each representative frame and comparing each distortion measure to a distortion threshold. From these representative frames, a set of mutually exclusive frames is determined to minimize the number of representative frames, whereby each representative frame in the set represents a unique set of contiguous initial frames and has an associated distortion measure which does not exceed the distortion threshold.



 Title Information Submit all comments and votes
 
Patent Text Patent PDF Print Page Summary File History
Plain text PDF images Print Summary File History
Drawing from US Patent 4905288
Method of data reduction in a speech recognition - US Patent 4905288 Drawing
Method of data reduction in a speech recognition
Inventor     Gerson; Ira A. (Hoffman Estates, IL); Lindsley; Brett L. (Palatine, IL)
Owner/Assignee     Motorola, Inc. (Schaumburg, IL)
Patent assignment
All assignments
Publication Date     February 27, 1990
Application Number     07/262,173
PAIR File History     Application Data   Transaction History
Image File Wrapper   Patent Term   Fees
Litigation
Filing Date     October 18, 1988
US Classification     704/245 704/234 704/275
Int'l Classification     G10L 005/00
Examiner     Kemeny; Emanuel S.
Assistant Examiner    
Attorney/Law Firm     Southard; Donald B.
Address
Parent Case     This is a continuation of application Ser. No. 816,163, filed Jan. 3, 1986, now abandoned.
Priority Data    
USPTO Field of Search     381/30 381/35 381/43 375/122
Patent Tags     data reduction speech recognition
   
Enter a comma (,) or semicolon (;) between multiple tag words/phrases.
Describe this patent:
 Amusing   
 Clever   
 Complex   
 Efficient   
 Historic   
 Important   
 Innovative   
 Interesting   
 Practical   
 Simple   
[no votes]
Patent WIKI

Share information and news about this patent, including information and news about the technology, inventors, company, ligation and licensing.

 References Submit all comments and votes
 
*references marked with an asterisk below are user-added references
 U.S. References
 
Add a new US reference:  
ReferenceRelevancyCommentsReferenceRelevancyComments
4624009
Glenn
704/231
Nov,1986

[0 after 0 votes]
4513436
Nose
704/243
Apr,1985

[0 after 0 votes]
4449233
Brantingham
704/258
May,1984

[0 after 0 votes]
4449190
Flanagan
706/22
May,1984

[0 after 0 votes]
4415767
Gill
704/243
Nov,1983

[0 after 0 votes]
4412098
An
704/236
Oct,1983

[0 after 0 votes]
4227177
Moshier
704/231
Oct,1980

[0 after 0 votes]
4227176
Moshier
704/231
Oct,1980

[0 after 0 votes]
4181813
Marley
704/251
Jan,1980

[0 after 0 votes]
4132867
Siglow
370/512
Jan,1979

[0 after 0 votes]
3812291
Brodes
704/253
May,1974

[0 after 0 votes]
3582559
Hitchcock
62/243
Jun,1971

[0 after 0 votes]
 Foreign References
 Other References
 Market Review Submit all comments and votes
   
Market Size
Estimate the gross annual revenues of the relevant market sector:
> $10B
$5B - $10B
$2B - $5B
$500M - $2B
$100M - $500M
$10M - $100M
$1M - $10M
$500K - $1M
$100K - $500K
< $100K
[No votes]
$0
 
$0   $2.5B   $5B   $7.5B   $10B
Market Share
Estimate the percentage of the relevant market sector this invention will capture:
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Reasonable Royalty
What percentage of gross sales should the inventor or assignee be paid?
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Public's "Guesstimation" of Royalty Value
Market SizeN/A[No votes]
xMarket ShareN/A[No votes]
xReasonable RoyaltyN/A[No votes]

N/A

License Availablity
If you are NOT the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
License Availablity
If you ARE the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
Competitive Advantage
Does this invention have a significant competitive advantage over similar technologies?
Yes

No



[No votes]
Most helpful competitive advantage comment
[No comments]

Commercial Alternatives
Are there viable commercial alternatives for this invention?
Yes

No



[No votes]
Most helpful commercial alternative comment
[No comments]

 Technical Review Submit all comments and votes
 Claims Submit all comments and votes
 


What is claimed is:

1. In a speech processing system, wherein speech is represented as a sequence of original frames, a method for reducing the sequence of original frames into a reduced set of representative frames comprising the steps of:

storing a plurality of original frames from the sequence;

combining said stored original frames into a plurality of representative frames;

generating, for each representative frame, a distortion measure corresponding to the distance between each said representative frame and said original frames combined therein;

comparing each said distortion measure to a predetermined distortion threshold; and

determining a set of a minimum number of said representative frames representing said stored original frames and each representative frame having a generated distortion measure less than said predetermined distortion threshold.

2. The method of claim 1, wherein said set of representative frames represents every original frame in the series.

3. The method of claim 1, further including the step of invalidating all representative frames designated by original frames m through n, where m<n, if said associated distortion measure from a previously determined representative frame designated by original frames i through j, where i.gtoreq.m, j.ltoreq.n and i<j, exceeds said distortion threshold by a predetermined constant.

4. In a speech processing system, wherein speech is represented as a sequence of original frames, a method for reducing the sequence of original frames into a reduced set of representative frames comprising the steps of:

forming cluster paths ending at each original frame in the sequence, said frames in sequence designated m through n, where m<n, each said cluster path composed of a series of combined original frames;

forming an additional representative frame by combining frames j through n+1, wherein m<j<n and j is an integer designating a frame in the series, said forming of an additional representative frame including the steps of:

generating, for said additional representative frame, a distortion measure corresponding to the distance between said additional representative frame and original frames combined therein and comparing said distortion measure to a predetermined distortion threshold; and

appending said additional representative frame to said previously formed cluster paths if said distortion measure does not exceed said distortion threshold, whereby the resultant reduced set of representative frames is comprises of said additional representative frame appended to said cluster path formed at frame j-1.

5. The method of claim 1 or 4, wherein each representative frame includes at least a predetermined minimum number of original frames.

6. The method of claim 1 or 4, wherein each representative frame includes no more than a predetermined maximum number of original frames.

7. The method of claim 1 or 4, further including the step of recording the number of original frames combined in each representative frame in the set.

8. The method of claim 1 or 4, further including the step of recording said distortion measure associated with each representative frame in the set.

9. The method of claim 1 or 4, wherein at least one said representative frame in the set includes a single frame.

10. The method of claim 4, further including the step of invalidating at least one said cluster path when another cluster path is determined to have fewer representative frames.

11. The method of claim 1 or 4, further including the step of designating one or more representative frames in the set as an output frame.

12. The method of claim 1 or 4, further including the step of connecting said representative frames in the set with pointers.

13. The method of claim 1 or 4, including the step of generating a peak distortion measure.

14. The method of claim 1 or 4, further including the step of determining a convergence reference frame.

15. The method of claim 4, further including the steps of comparing said distortion measures associated with two cluster paths having the same number of representative frames.

16. The method of claim 4, further including the step of determining a distortion measure associated with the set of representative frames.

17. The method of claim 4, further including the step of selecting representative frames from one end of said sequence to the other end of said sequence.

18. In a speech processing system, wherein speech is represented as a sequence of original frames, an arrangement for reducing the sequence of original frames into a reduced set of representative frames comprising:

means for storing a plurality of original frames from the sequence;

means for combining said stored original frames into a plurality of representative frames;

means for generating, for each representative frame, a distortion measure corresponding to the distance between each said representative frame and said original frames combined therein;

means for comparing each said distortion measure to a predetermined distortion threshold; and

means for determining a set of a minimum number of said representative frames representing said stored original frames, each representative frame having a generated distortion measure less than said predetermined distortion threshold.

19. The arrangement of claim 18, wherein said set of representative frames represents every original frame in the series.

20. The arrangement of claim 18, further including means for invalidating all representative frames designated by original frames m through n, where m<n, if said associated distortion measure from a previously determined representative frame designated by original frames i through j, where i.gtoreq.m, j.ltoreq.n and i<j, exceeds said distortion threshold by a predetermined constant.

21. In a speech processing system, wherein speech is represented as a sequence of original frames, a method for reducing the sequence of original frames into a reduced set of representative frames comprising:

means for forming cluster paths ending at each original frame in the sequence, said frames in sequence designated m through n, where m<n, each said cluster path composed of a series of combined original frames;

means for forming an additional representative frame by combining frames j through n+1, where m<j<n and j is an integer designating a frame in the series, said means for forming of an additional representative frame including:

means for generating, for said additional representative frame, a distortion measure corresponding to the distance between said additional representative frame and the original frames combined therein and means for comparing said distortion measure to a predetermined distortion threshold; and

means for appending said additional representative frame to said previously formed cluster paths is said distortion measure does not exceed said distortion threshold, whereby the resultant reduced set of representative frames is comprised of said additional representative frame appended to said cluster path formed at frame j-1.

22. The arrangement of claim 18 or 21, wherein each representative frame includes at least a predetermined minimum number of original frames.

23. The arrangement of claim 18 or 21, wherein each representative frame includes no more than a predetermined maximum number of original frames.

24. The arrangement of claim 18 or 21, further including means for recording the number of original frames combined in each representative frame in the set.

25. The arrangement of claim 18 or 21, further including means for recording said distortion measure associated with each representative frame in the set.

26. The arrangement of claim 18 or 21, wherein at least one said representative frame in the set includes a single frame.

27. The arrangement of claim 21, further including means for invalidating at least one said cluster path when another cluster path is determined to have fewer representative frames.

28. The arrangement of claim 18 or 21, further including means for designating one or more representative frames in the set as an output frame.

29. The arrangement of claim 18 or 21, further including means for connecting said representative frames in the set with pointers.

30. The arrangement of claim 18 or 21, including means for generating a peak distortion measure.

31. The arrangement of claim 18 or 21, further including mean for determining a convergence reference frame.

32. The arrangement of claim 21, further including means for comparing said distortion measures associated with two cluster paths having the same number of representative frames.

33. The arrangement of claim 21, further including means for determining a distortion measure associated with the set of representative frames.

34. The arrangement of claim 21, further including means for selecting representative frames from one end of said sequence to the other end of said sequence.
 Description Submit all comments and votes
 


BACKGROUND OF THE INVENTION

The present invention relates to the practice of generating word templates and, more specifically, to the practice of reducing data representing word templates in a speech recognition system.

In systems that require digital storage of an analog waveform, a significant amount of memory must be allocated for an accurate representation. In a speech recognition system, where word recognition depends on such accuracy, storing speech digitally requires an excessive amount of memory. This is especially true for speech recognition systems requiring large vocabularies. Each word in the vocabulary is typically represented by a word template. Each word template includes frames, segmented in equal time intervals, representing a spoken word. To practically implement a large vocabulary into a speech recognition system, two problems must be overcome.

The first problem is the extensive memory which is required to digitally store the vocabulary. Memory is expensive in cost and in circuit board real estate.

The second problem is the computation time required to process this representative data. In general, the computation time increases linearly with the amount of memory required for the template data. In systems utilizing large vocabularies, these two problems are an enormous burden for practical operation of a speech recognition system in real-time. Accordingly, the need to reduce the required template data is well recognized in the field of speech recognition.

Reduction of template data can be applied to sounds within a word template which are acoustically similar. Speech is typically time segmented in equal intervals. Each segment is referred to as a frame. For example, words which are spoken slowly often have frames of speech which are merely a long continuation of the same sound. Since frames having acoustically similar sounds do not need to be represented repetitively, there has been discussion of combining these frames into a representative frame. Combining frames in this manner is referred to as clustering.

When clustering any number of word template frames, the resultant frame is somewhat distorted with respect to the original frames due to slight variations of the representative data in each frame. Typically, when two or more frames are measured to be acoustically similar, clustering the frames is not expected to produce an excessive distortion. Techniques for determining an accurate similarity measure between frames are used to determine whether two or more frames should be clustered.

Similarity of frame information is usually measured using a distance calculation, such as the Hamming, or Chebyshev calculation dependent on the type of representative data. Two sequential frames from a word template can be clustered into a single frame if the `distance` between them is less than a predetermined distance. By clustering frames which have a small distance calculated between them, the data representing the speech can be reduced.

However, clustering frames in this manner is a problem when the quantity of frames in the word template is large. To `optimally` reduce the word template, a representative word template must be generated which has the fewest number of representative frames as well as satisfying a distortion criteria for each representative frame. Typically, this requires testing every possible clustering of frames in the word template. The clusters must be selected such that no other sequence of clusters will result in fewer clusters meeting the distortion criteria. The sequence of clusters is hereinafter referred to as a cluster path for the word template. The cluster path which results in the least distortion and the fewest number of clusters is the optimal cluster path. For a word template with a large number of frames, the search for the optimal cluster path results in an excessive amount of computation. For example, consider a word template comprised of 3 frames. There are a total of 4 possible cluster paths to consider, 1 2 3, 1 2 3, 1 2 3, 1 2 3 (each cluster being underlined). For a 5 frame word template, there are 16 possible cluster paths to consider. In general, for a word template comprised of N frames, there are 2.sup.(N-1) possible paths to consider. A word template comprised of 15 frames requires that 16,384 possible cluster paths be considered, with probably only one cluster formation optimally reducing the template data. The computation requirements in considering each of these possibilities is not practical in a real-time environment.

Another problem encountered when clustering in this manner pertains to matching an appropriate clustering method to the particular type of feature data representing the speech. Typically, filter bank information or linear predictive coefficient (LPC) information is used to represent the speech. Clustering a group of frames represented by filter bank information will not always produce the same distortion that LPC information would produce. Hence, minimal cluster combinations for one type of feature data may not be minimal for another type of feature data.

What is needed is a clustering method for word template data that can generate the optimal cluster path efficiently for any type of feature data and distance measure used.

OBJECTS AND SUMMARY OF THE INVENTION

Accordingly, it is an object of the present invention to provide a method of data reduction that reduces feature data such that upon completion of the reduction process there is no other possible reduction of the data that will result in greater data reduction while satisfying a distortion criteria.

It is another object of the present invention to provide a data reduction method that optimizes the required computation in finding the optimally reduced representative data set for the incoming speech.

It is a further object of the present invention to provide a method of data reduction that defines distortion incurred by data reduction given a distance measure for the feature data used to represent the speech.

It is yet a further object of the present invention to provide a method of data reduction that can be applied to infinite length frame sequences as well as to finite length frame sequences.

In summary, the present invention describes an optimal method and arrangement for reducing a sequence of initial frames into a reduced set of representative frames by combining the initial frames into a plurality of representative frames, the combining process including generating a distortion measure associated with each representative frame and comparing each distortion measure to a distortion threshold. From these representative frames, a set of mutually exclusive frames is determined to minimize the number of representative frames, whereby each representative frame in the set represents a unique set of contiguous initial frames and has an associated distortion measure which does not exceed the distortion threshold.

BRIEF DESCRIPTION OF THE DRAWINGS

Additional objects, features, and advantages in accordance with the present invention will be more clearly understood by reference to the following description taken in connection with the accompanying drawings, in the several figures of which like reference numerals identify like elements, and in which:

FIG. 1 is a general block diagram illustrating the technique of synthesizing speech from speech recognition templates according to the present invention;

FIG. 2 is a block diagram of a speech communications device having a user-interactive control system employing speech recognition and speech synthesis in accordance with the present invention;

FIG. 3 is a detailed block diagram of the preferred embodiment of the present invention illustrating a radio transceiver having a hands-free speech recognition/speech synthesis control system;

FIG. 4a is an expanded block diagram of the data reducer block 322 of FIG. 3;

FIG. 4b is a flowchart showing the sequence of steps performed by the energy normalization block 410 of FIG. 4a;

FIG. 4c is a detailed block diagram of the of the particular hardware configuration of the segmentation/compression block 420 of FIG. 4a;

FIG. 5a is a graphical representation of a spoken word segmented into frames for forming a cluster according to the present invention;

FIG. 5b is a diagram exemplifying output clusters being formed for a particular word template, according to the present invention;

FIG. 5c is a table showing the possible formations of an arbitrary partial cluster path according to the present invention;

FIGS. 5d and 5e show a flowchart illustrating a basic implementation of the data reduction process performed by the segmentation/compression block 420 of FIG. 4a;

FIG. 5f is a detailed flowchart of the traceback and output clusters block 582 of FIG. 5e, showing the formation of a data reduced word template from previously determined clusters;

FIG. 5g is a traceback pointer table illustrating a clustering path for 24 frames, according to the present invention, applicable to partial traceback;

FIG. 5h is a graphical representation of the traceback pointer table of FIG. 5g illustrated in the form of a frame connection tree;

FIG. 5i is a graphical representation of FIG. 5h showing the frame connection tree after three clusters have been output by tracing back to common frames in the tree;

FIGS. 6a and 6b comprise a flowchart showing the sequence of steps performed by the differential encoding block 430 of FIG. 4a;

FIG. 6c is a generalized memory map showing the particular data format of one frame of the template memory 160 of FIG. 3;

FIG. 7a is a graphical representation of frames clustered into average frames, each average frame represented by a state in a word model, in accordance with the present invention;

FIG. 7b is a detailed block diagram of the recognition processor 120 of FIG. 3, illustrating its relationship with the template memory 160;

FIG. 7c is a flowchart illustrating one embodiment of the sequence of steps required for word decoding according to the present invention;

FIGS. 7d and 7e comprise a flowchart illustrating one embodiment of the steps required for state decoding according to the present invention;

FIG. 8a is a detailed block diagram of the data expander block 346 of FIG. 3;

FIG. 8b is a flowchart showing the sequence of steps performed by the differential decoding block 802 of FIG. 8a;

FIG. 8c is a flowchart showing the sequence of steps performed by the energy denormalization block 804 of FIG. 8a;

FIG. 8d is a flowchart showing the sequence of steps performed by the frame repeating block 806 of FIG. 8a;

FIG. 9a is a detailed block diagram of the channel bank speech synthesizer 340 of FIG. 3;

FIG. 9b is an alternate embodiment of the modulator/bandpass filter configuration 980 of FIG. 9a;

FIG. 9c is a detailed block diagram of the preferred embodiment of the pitch pulse source 920 of FIG. 9a;

FIG. 9d is a graphic representation illustrating various waveforms of FIGS. 9a and 9c.

DESCRIPTION OF THE PREFERRED EMBODIMENT

1. System Configuration

Referring now to the accompanying drawings, FIG. 1 shows a general block diagram of user-interactive control system 100 of the present invention. Electronic device 150 may include any electronic apparatus that is sophisticated enough to warrant the incorporation of a speech recognition/speech synthesis control system. In the preferred embodiment, electronic device 150 represents a speech communications device such as a mobile radiotelephone.

User-spoken input speech is applied to microphone 105, which acts as an acoustic coupler providing an electrical input speech signal for the control system. Acoustic processor 110 performs acoustic feature extraction upon the input speech signal. Word features, defined as the amplitude/frequency parameters of each user-spoken input word, are thereby provided to speech recognition processor 120 and to training processor 170. Acoustic processor 110 may also include a signal conditioner, such as an analog-to-digital converter, to interface the input speech signal to the speech recognition control system. Acoustic processor 110 will be further described in conjunction with FIG. 3.

Training processor 170 manipulates this word feature information from acoustic processor 110 to provide word recognition templates to be stored in template memory 160. During the training procedure, the incoming word features are arranged into individual words by locating their endpoints. If the training procedure is designed to accommodate multiple training utterances for word feature consistency, then the multiple utterances may be averaged to form a single word template. Furthermore, since most speech recognition systems do not require all of the speech information to be stored as a template, some type of data reduction is often performed by training processor 170 to reduce the template memory requirements. The word templates are stored in template memory 160 for use by speech recognition processor 120 as well as by speech synthesis processor 140. The exact training procedure utilized by the preferred embodiment of the present invention may be found in the description accompanying FIG. 2.

In the recognition mode, speech recognition processor 120 compares the word feature information provided by acoustic processor 110 to the word recognition templates provided by template memory 160. If the acoustic features of the present word feature information derived from the user-spoken input speech sufficiently match the acoustic features of a particular prestored word template derived from the template memory, then recognition processor 120 provides device control data to device controller 130 indicative of the particular word recognized. A further discussion of an appropriate speech recognition apparatus, and how the preferred embodiment incorporates data reduction into the training process may be found in the description accompanying FIGS. 3 through 5.

Device controller 130 interfaces the entire control system to electronic device 150. Device controller 130 translates the device control data provided by recognition processor 120 into control signals adaptable for use by the particular electronic device. These control signals direct the device to perform specific operating functions as instructed by the user. (Device controller 130 may also perform additional supervisory functions related to other elements shown in FIG. 1.) An example of a device controller known in the art and suitable for use with the present invention is a microcomputer. Refer to FIG. 3 for further details of the hardware implementation.

Device controller 130 also provides device status data representing the operating status of electronic device 150. This data is applied to speech synthesis processor 140, along with word recognition templates from template memory 160. Synthesis processor 140 utilizes the status data to determine which word recognition template is to be synthesized into user-recognizable reply speech. Synthesis processor 140 may also include an internal reply memory, also controlled by the status data, to provide "canned" reply words to the user. In either case, the user is informed of the electronic device operating status when the speech reply signal is output via speaker 145.

Thus, FIG. 1 illustrates how the present invention provides a user-interactive control system utilizing speech recognition to control the operating parameters of an electronic device, and how a speech recognition template may be utilized to generate reply speech to the user indicative of the operating status of the device.

FIG. 2 illustrates in more detail the application of the user-interactive control system to a speech communications device comprising a part of any radio or landline voice communications system, such as, for example, a two-way radio system, a telephone system, an intercom system, etc. Acoustic processor 110, recognition processor 120, template memory 160, and device controller 130 are the same in structure and in operation as the corresponding blocks of FIG. 1. However, control system 200 illustrates the internal structure of speech communications device 210. Speech communication terminal 225 represents the main electronic network of device 210, such as, for example, a telephone terminal or a communications console. In this embodiment, microphone 205 and speaker 245 are incorporated into the speech communications device itself. A typical example of this microphone/speaker arrangement would be a telephone handset. Speech communications terminal 225 interfaces operating status information of the speech communications device to device controller 130. This operating status information may comprise functional status data of the terminal itself (e.g., channel data, service information, operating mode messages, etc.), user-feedback information of the speech recognition control system (e.g., directory contents, word recognition verification, operating mode status, etc.), or may include system status data pertaining to the communications link (e.g., loss-of-line, system busy, invalid access code, etc.).

In either the training mode or the recognition mode, the features of user spoken input speech are extracted by acoustic processor 110. In the training mode, which is represented in FIG. 2 by position "A" of switch 215, the word feature information is applied to word averager 220 of training processor 170. As previously mentioned, if the system is designed to average multiple utterances together to form a single word template, the averaging is performed by word averager 220. Through the use of word averaging, the training processor can take into account the minor variances between two or more utterances of the same word, thereby producing a more reliable word template. Numerous word averaging techniques may be used. For example, one method would be to combine only the similar word features of all training utterances to produce a "best" set of features for the word template. Another technique may be to simply compare all training utterances to determine which one provides the "best" template. Still another word averaging tcchnique is described by L. R. Rabiner and J. G. Wilpon in "A Simplified Robust Training Procedure for Speaker Trained, Isolated Word Recognition Systems", Journal of the Acoustic Society of America, vol. 68 (Nov. 1980), pp. 1271-76.

Data reducer 230 then performs data reduction upon either the averaged word data from word averager 220 or upon the word feature signals directly from acoustic processor 110, depending upon the presence or absence of a word averager. In either case, the reduction process consists of segmenting this "raw" word feature data and combining the data in each segment. The storage requirements for the template are then further reduced by differential encoding of the segmented data to produce "reduced" word feature data. This specific data reduction technique of the present invention is fully described in conjunction with FIGS. 4 and 5. To summarize, data reducer 230 compresses the raw word data to minimize the template storage requirements and to reduce the speech recognition computation time.

The reduced word feature data provided by training processor 170 is stored as word recognition templates in template memory 160. In the recognition mode, which is illustrated by position "B" of switch 215, recognition processor 120 compares the incoming word feature signals to the word recognition templates. Upon recognition of a valid command word, recognition processor 120 may instruct device controller 130 to cause a corresponding speech communications device control function to be executed by speech communications terminal 225. Terminal 225 may respond to device controller 130 by sending operating status information back to controller 130 in the form of terminal status data. This data can be used by the control system to synthesize the appropriate speech reply signal to inform the user of the present device operating status. This sequence of events will be more clearly understood by referring to the subsequent example.

Synthesis processor 140 is comprised of speech synthesizer 240, data expander 250, and reply memory 260. A synthesis processor of this configuration is capable of generating "canned" replies to the user from a prestored vocabulary (stored in reply memory 260), as well as generating "template" responses from a user-generated vocabulary (stored in template memory 160). Speech synthesizer 240 and reply memory 260 are further described in conjunction with FIG. 3, and data expander 250 is fully described in the text accompanying FIG. 8a. In combination, the blocks of synthesis processor 140 generate a speech reply signal to speaker 245. Accordingly, FIG. 2 illustrates the technique of using a single template memory for both speech recognition and speech synthesis.

The simplified example of a "smart" telephone terminal employing voice-controlled dialing from a stored telephone number directory is now used to describe the operation of the control system of FIG. 2. Initially, an untrained speaker-dependent speech recognition system cannot recognize command words. Therefore, the user must manually prompt the device to begin the training procedure, perhaps by entering a particular code into the telephone keypad. Device controller 130 then directs switch 215 to enter the training mode (position "A"). Device controller 130 then instructs speech synthesizer 240 to respond with the predefined phrase TRAINING VOCABULARY ONE, which is a "canned" response obtained from reply memory 260. The user then begins to build a command word vocabulary by uttering command words, such as STORE or RECALL, into microphone 205. The features of the utterance are first extracted by acoustic processor 110, and then applied to either word averager 220 or data reducer 230. If the particular speech recognition system is designed to accept multiple utterances of the same word, word averager 220 produces a set of averaged word features representing the best representation of that particular word. If the system does not have word averaging capabilities, the single utterance word features (rather than the multiple utterance averaged word features) are applied to data reducer 230. The data reduction process removes unnecessary or duplicate feature data, compresses the remaining data, and provides template memory 160 with "reduced" word recognition templates. A similar procedure is followed for training the system to recognize digits.

Once the system is trained with the command word vocabulary, the user must continue the training procedure by entering telephone directory names and numbers. To accomplish this task, the user utters the previously-trained command word ENTER. Upon recognition of this utterance as a valid user command, device controller 130 instructs speech synthesizer 240 to reply with the "canned" phrase DIGITS PLEASE? stored in reply memory 260. Upon entering the appropriate telephone number digits (e.g., 555-1234), the user says TERMINATE and the system replys NAME PLEASE? to prompt user-entry of the corresponding directory name (e.g., SMITH). This user-interactive process continues until the telephone number directory is completely filled with the appropriate telephone names and digits.

To place a phone call, the user simply utters the command word RECALL. When the utterance is recognized as a valid user command by recognition processor 120, device controller 130 directs speech synthesizer 240 to generate the verbal reply NAME? via synthesizing information provided by reply memory 260. The user then responds by speaking the name in the directory index corresponding to the telephone number that he desires to dial (e.g. JONES). The word will be recognized as a valid directory entry if it corresponds to a predetermined name index stored in template memory 160. If valid, device controller 130 directs data expander 250 to obtain the appropriate reduced word recognition template from template memory 160 and perform the data expansion process for synthesis. Data expander 250 "unpacks" the reduced word feature data and restores the proper energy contour for an intelligible reply word. The expanded word template data is then fed to speech synthesizer 240. Using both the template data and the reply memory data, speech synthesizer 240 generates the phrase JONES . . . (from template memory 160 through data expander 250) . . . FIVE-FIVE-FIVE, SIX-SEVEN-EIGHT-NINE (from reply memory 260).

The user then says the command word SEND which, when recognized by the control system, instructs device controller 130 to send telephone number dialing information to speech communications terminal 225. Terminal 225 outputs this dialing information via an appropriate communications link. When the telephone connection is made, speech communications terminal 225 interfaces microphone audio from microphone 205 to the appropriate transmit path, and receive audio from the appropriate receive audio path to speaker 245. If a proper telephone connection cannot be made, terminal controller 225 provides the appropriate communications link status information to device controller 130. Accordingly, device controller 130 instructs speech synthesizer 240 to generate the appropriate reply word corresponding to the status information provided, such as the reply word SYSTEM BUSY. In this manner, the user is informed of the communications link status, and user-interactive voice-controlled directory dialing is achieved.

The above operational description is merely one application of synthesizing speech from speech recognition templates according to the present invention. Numerous other applications of this novel technique to a speech communications device are contemplated, such as, for example, a communications console, a two-way radio, etc. In the preferred embodiment, the control system of the present invention is used with a mobile radiotelephone.

Although speech recognition and speech synthesis allows a vehicle operator to keep both eyes on the road, the conventional handset or hand-held microphone prohibits him from keeping both hands on the steering wheel or from executing proper manual (or automatic) transmission shifting. For this reason, the control system of the preferred embodiment incorporates a speakerphone to provide hands-free control of the speech communications device. The speakerphone performs the transmit/receive audio switching function, as well as the received/reply audio multiplexing function.

Referring now to FIG. 3, control system 300 utilizes the same acoustic processor block 110, training processor block 170, recognition processor block 120, template memory block 160, device controller block 130, and synthesis processor block 140 as the corresponding blocks of FIG. 2. However, microphone 302 and speaker 375 are not an integral part of the speech communications terminal. Instead, input speech signal from microphone 302 is directed to radiotelephone 350 via speakerphone 360. Similarly, speakerphone 360 also controls the multiplexing of the synthesized audio from the control system and the receive audio from the communications link. A more detailed analysis of the switching/ multiplexing configuration of the speakerphone will be described later. Additionally, the speech communications terminal is now illustrated in FIG. 3 as a radiotelephone having a transmitter and a receiver to provide the appropriate communications link via radio frequency (RF) channels. A detailed description of the radio blocks is also provided later.

Microphone 302, which is typically remotely-mounted at a distance from the user's mouth (e.g., on the automobile sun visor), acoustically couples the user's voice to control system 300. This speech signal is usually amplified by preamplifier 304 to provide input speech signal 305 This audio input is directly applied to acoustic processor 110, and is switched by speakerphone 360 before being applied to radiotelephone 350 via switched microphone audio line 315.

As previously mentioned, acoustic processor 110 extracts the features of the user-spoken input speech to provide word feature information to both training processor 170 and recognition processor 120. Acoustic processor 110 first converts the analog input speech into digital form by analog-to-digital (A/D) converter 310. This digital data is then applied t