|
Claims  |
|
|
What is claimed is:
1. In a speech processing system, wherein speech is represented as a
sequence of original frames, a method for reducing the sequence of
original frames into a reduced set of representative frames comprising the
steps of:
storing a plurality of original frames from the sequence;
combining said stored original frames into a plurality of representative
frames;
generating, for each representative frame, a distortion measure
corresponding to the distance between each said representative frame and
said original frames combined therein;
comparing each said distortion measure to a predetermined distortion
threshold; and
determining a set of a minimum number of said representative frames
representing said stored original frames and each representative frame
having a generated distortion measure less than said predetermined
distortion threshold.
2. The method of claim 1, wherein said set of representative frames
represents every original frame in the series.
3. The method of claim 1, further including the step of invalidating all
representative frames designated by original frames m through n, where
m<n, if said associated distortion measure from a previously determined
representative frame designated by original frames i through j, where
i.gtoreq.m, j.ltoreq.n and i<j, exceeds said distortion threshold by a
predetermined constant.
4. In a speech processing system, wherein speech is represented as a
sequence of original frames, a method for reducing the sequence of
original frames into a reduced set of representative frames comprising the
steps of:
forming cluster paths ending at each original frame in the sequence, said
frames in sequence designated m through n, where m<n, each said cluster
path composed of a series of combined original frames;
forming an additional representative frame by combining frames j through
n+1, wherein m<j<n and j is an integer designating a frame in the series,
said forming of an additional representative frame including the steps of:
generating, for said additional representative frame, a distortion measure
corresponding to the distance between said additional representative frame
and original frames combined therein and comparing said distortion measure
to a predetermined distortion threshold; and
appending said additional representative frame to said previously formed
cluster paths if said distortion measure does not exceed said distortion
threshold, whereby the resultant reduced set of representative frames is
comprises of said additional representative frame appended to said cluster
path formed at frame j-1.
5. The method of claim 1 or 4, wherein each representative frame includes
at least a predetermined minimum number of original frames.
6. The method of claim 1 or 4, wherein each representative frame includes
no more than a predetermined maximum number of original frames.
7. The method of claim 1 or 4, further including the step of recording the
number of original frames combined in each representative frame in the
set.
8. The method of claim 1 or 4, further including the step of recording said
distortion measure associated with each representative frame in the set.
9. The method of claim 1 or 4, wherein at least one said representative
frame in the set includes a single frame.
10. The method of claim 4, further including the step of invalidating at
least one said cluster path when another cluster path is determined to
have fewer representative frames.
11. The method of claim 1 or 4, further including the step of designating
one or more representative frames in the set as an output frame.
12. The method of claim 1 or 4, further including the step of connecting
said representative frames in the set with pointers.
13. The method of claim 1 or 4, including the step of generating a peak
distortion measure.
14. The method of claim 1 or 4, further including the step of determining a
convergence reference frame.
15. The method of claim 4, further including the steps of comparing said
distortion measures associated with two cluster paths having the same
number of representative frames.
16. The method of claim 4, further including the step of determining a
distortion measure associated with the set of representative frames.
17. The method of claim 4, further including the step of selecting
representative frames from one end of said sequence to the other end of
said sequence.
18. In a speech processing system, wherein speech is represented as a
sequence of original frames, an arrangement for reducing the sequence of
original frames into a reduced set of representative frames comprising:
means for storing a plurality of original frames from the sequence;
means for combining said stored original frames into a plurality of
representative frames;
means for generating, for each representative frame, a distortion measure
corresponding to the distance between each said representative frame and
said original frames combined therein;
means for comparing each said distortion measure to a predetermined
distortion threshold; and
means for determining a set of a minimum number of said representative
frames representing said stored original frames, each representative frame
having a generated distortion measure less than said predetermined
distortion threshold.
19. The arrangement of claim 18, wherein said set of representative frames
represents every original frame in the series.
20. The arrangement of claim 18, further including means for invalidating
all representative frames designated by original frames m through n, where
m<n, if said associated distortion measure from a previously determined
representative frame designated by original frames i through j, where
i.gtoreq.m, j.ltoreq.n and i<j, exceeds said distortion threshold by a
predetermined constant.
21. In a speech processing system, wherein speech is represented as a
sequence of original frames, a method for reducing the sequence of
original frames into a reduced set of representative frames comprising:
means for forming cluster paths ending at each original frame in the
sequence, said frames in sequence designated m through n, where m<n, each
said cluster path composed of a series of combined original frames;
means for forming an additional representative frame by combining frames j
through n+1, where m<j<n and j is an integer designating a frame in the
series, said means for forming of an additional representative frame
including:
means for generating, for said additional representative frame, a
distortion measure corresponding to the distance between said additional
representative frame and the original frames combined therein and means
for comparing said distortion measure to a predetermined distortion
threshold; and
means for appending said additional representative frame to said previously
formed cluster paths is said distortion measure does not exceed said
distortion threshold, whereby the resultant reduced set of representative
frames is comprised of said additional representative frame appended to
said cluster path formed at frame j-1.
22. The arrangement of claim 18 or 21, wherein each representative frame
includes at least a predetermined minimum number of original frames.
23. The arrangement of claim 18 or 21, wherein each representative frame
includes no more than a predetermined maximum number of original frames.
24. The arrangement of claim 18 or 21, further including means for
recording the number of original frames combined in each representative
frame in the set.
25. The arrangement of claim 18 or 21, further including means for
recording said distortion measure associated with each representative
frame in the set.
26. The arrangement of claim 18 or 21, wherein at least one said
representative frame in the set includes a single frame.
27. The arrangement of claim 21, further including means for invalidating
at least one said cluster path when another cluster path is determined to
have fewer representative frames.
28. The arrangement of claim 18 or 21, further including means for
designating one or more representative frames in the set as an output
frame.
29. The arrangement of claim 18 or 21, further including means for
connecting said representative frames in the set with pointers.
30. The arrangement of claim 18 or 21, including means for generating a
peak distortion measure.
31. The arrangement of claim 18 or 21, further including mean for
determining a convergence reference frame.
32. The arrangement of claim 21, further including means for comparing said
distortion measures associated with two cluster paths having the same
number of representative frames.
33. The arrangement of claim 21, further including means for determining a
distortion measure associated with the set of representative frames.
34. The arrangement of claim 21, further including means for selecting
representative frames from one end of said sequence to the other end of
said sequence. |
|
|
|
|
Claims  |
|
|
Description  |
|
|
BACKGROUND OF THE INVENTION
The present invention relates to the practice of generating word templates
and, more specifically, to the practice of reducing data representing word
templates in a speech recognition system.
In systems that require digital storage of an analog waveform, a
significant amount of memory must be allocated for an accurate
representation. In a speech recognition system, where word recognition
depends on such accuracy, storing speech digitally requires an excessive
amount of memory. This is especially true for speech recognition systems
requiring large vocabularies. Each word in the vocabulary is typically
represented by a word template. Each word template includes frames,
segmented in equal time intervals, representing a spoken word. To
practically implement a large vocabulary into a speech recognition system,
two problems must be overcome.
The first problem is the extensive memory which is required to digitally
store the vocabulary. Memory is expensive in cost and in circuit board
real estate.
The second problem is the computation time required to process this
representative data. In general, the computation time increases linearly
with the amount of memory required for the template data. In systems
utilizing large vocabularies, these two problems are an enormous burden
for practical operation of a speech recognition system in real-time.
Accordingly, the need to reduce the required template data is well
recognized in the field of speech recognition.
Reduction of template data can be applied to sounds within a word template
which are acoustically similar. Speech is typically time segmented in
equal intervals. Each segment is referred to as a frame. For example,
words which are spoken slowly often have frames of speech which are merely
a long continuation of the same sound. Since frames having acoustically
similar sounds do not need to be represented repetitively, there has been
discussion of combining these frames into a representative frame.
Combining frames in this manner is referred to as clustering.
When clustering any number of word template frames, the resultant frame is
somewhat distorted with respect to the original frames due to slight
variations of the representative data in each frame. Typically, when two
or more frames are measured to be acoustically similar, clustering the
frames is not expected to produce an excessive distortion. Techniques for
determining an accurate similarity measure between frames are used to
determine whether two or more frames should be clustered.
Similarity of frame information is usually measured using a distance
calculation, such as the Hamming, or Chebyshev calculation dependent on
the type of representative data. Two sequential frames from a word
template can be clustered into a single frame if the `distance` between
them is less than a predetermined distance. By clustering frames which
have a small distance calculated between them, the data representing the
speech can be reduced.
However, clustering frames in this manner is a problem when the quantity of
frames in the word template is large. To `optimally` reduce the word
template, a representative word template must be generated which has the
fewest number of representative frames as well as satisfying a distortion
criteria for each representative frame. Typically, this requires testing
every possible clustering of frames in the word template. The clusters
must be selected such that no other sequence of clusters will result in
fewer clusters meeting the distortion criteria. The sequence of clusters
is hereinafter referred to as a cluster path for the word template. The
cluster path which results in the least distortion and the fewest number
of clusters is the optimal cluster path. For a word template with a large
number of frames, the search for the optimal cluster path results in an
excessive amount of computation. For example, consider a word template
comprised of 3 frames. There are a total of 4 possible cluster paths to
consider, 1 2 3, 1 2 3, 1 2 3, 1 2 3 (each cluster being underlined). For
a 5 frame word template, there are 16 possible cluster paths to consider.
In general, for a word template comprised of N frames, there are
2.sup.(N-1) possible paths to consider. A word template comprised of 15
frames requires that 16,384 possible cluster paths be considered, with
probably only one cluster formation optimally reducing the template data.
The computation requirements in considering each of these possibilities is
not practical in a real-time environment.
Another problem encountered when clustering in this manner pertains to
matching an appropriate clustering method to the particular type of
feature data representing the speech. Typically, filter bank information
or linear predictive coefficient (LPC) information is used to represent
the speech. Clustering a group of frames represented by filter bank
information will not always produce the same distortion that LPC
information would produce. Hence, minimal cluster combinations for one
type of feature data may not be minimal for another type of feature data.
What is needed is a clustering method for word template data that can
generate the optimal cluster path efficiently for any type of feature data
and distance measure used.
OBJECTS AND SUMMARY OF THE INVENTION
Accordingly, it is an object of the present invention to provide a method
of data reduction that reduces feature data such that upon completion of
the reduction process there is no other possible reduction of the data
that will result in greater data reduction while satisfying a distortion
criteria.
It is another object of the present invention to provide a data reduction
method that optimizes the required computation in finding the optimally
reduced representative data set for the incoming speech.
It is a further object of the present invention to provide a method of data
reduction that defines distortion incurred by data reduction given a
distance measure for the feature data used to represent the speech.
It is yet a further object of the present invention to provide a method of
data reduction that can be applied to infinite length frame sequences as
well as to finite length frame sequences.
In summary, the present invention describes an optimal method and
arrangement for reducing a sequence of initial frames into a reduced set
of representative frames by combining the initial frames into a plurality
of representative frames, the combining process including generating a
distortion measure associated with each representative frame and comparing
each distortion measure to a distortion threshold. From these
representative frames, a set of mutually exclusive frames is determined to
minimize the number of representative frames, whereby each representative
frame in the set represents a unique set of contiguous initial frames and
has an associated distortion measure which does not exceed the distortion
threshold.
BRIEF DESCRIPTION OF THE DRAWINGS
Additional objects, features, and advantages in accordance with the present
invention will be more clearly understood by reference to the following
description taken in connection with the accompanying drawings, in the
several figures of which like reference numerals identify like elements,
and in which:
FIG. 1 is a general block diagram illustrating the technique of
synthesizing speech from speech recognition templates according to the
present invention;
FIG. 2 is a block diagram of a speech communications device having a
user-interactive control system employing speech recognition and speech
synthesis in accordance with the present invention;
FIG. 3 is a detailed block diagram of the preferred embodiment of the
present invention illustrating a radio transceiver having a hands-free
speech recognition/speech synthesis control system;
FIG. 4a is an expanded block diagram of the data reducer block 322 of FIG.
3;
FIG. 4b is a flowchart showing the sequence of steps performed by the
energy normalization block 410 of FIG. 4a;
FIG. 4c is a detailed block diagram of the of the particular hardware
configuration of the segmentation/compression block 420 of FIG. 4a;
FIG. 5a is a graphical representation of a spoken word segmented into
frames for forming a cluster according to the present invention;
FIG. 5b is a diagram exemplifying output clusters being formed for a
particular word template, according to the present invention;
FIG. 5c is a table showing the possible formations of an arbitrary partial
cluster path according to the present invention;
FIGS. 5d and 5e show a flowchart illustrating a basic implementation of the
data reduction process performed by the segmentation/compression block 420
of FIG. 4a;
FIG. 5f is a detailed flowchart of the traceback and output clusters block
582 of FIG. 5e, showing the formation of a data reduced word template from
previously determined clusters;
FIG. 5g is a traceback pointer table illustrating a clustering path for 24
frames, according to the present invention, applicable to partial
traceback;
FIG. 5h is a graphical representation of the traceback pointer table of
FIG. 5g illustrated in the form of a frame connection tree;
FIG. 5i is a graphical representation of FIG. 5h showing the frame
connection tree after three clusters have been output by tracing back to
common frames in the tree;
FIGS. 6a and 6b comprise a flowchart showing the sequence of steps
performed by the differential encoding block 430 of FIG. 4a;
FIG. 6c is a generalized memory map showing the particular data format of
one frame of the template memory 160 of FIG. 3;
FIG. 7a is a graphical representation of frames clustered into average
frames, each average frame represented by a state in a word model, in
accordance with the present invention;
FIG. 7b is a detailed block diagram of the recognition processor 120 of
FIG. 3, illustrating its relationship with the template memory 160;
FIG. 7c is a flowchart illustrating one embodiment of the sequence of steps
required for word decoding according to the present invention;
FIGS. 7d and 7e comprise a flowchart illustrating one embodiment of the
steps required for state decoding according to the present invention;
FIG. 8a is a detailed block diagram of the data expander block 346 of FIG.
3;
FIG. 8b is a flowchart showing the sequence of steps performed by the
differential decoding block 802 of FIG. 8a;
FIG. 8c is a flowchart showing the sequence of steps performed by the
energy denormalization block 804 of FIG. 8a;
FIG. 8d is a flowchart showing the sequence of steps performed by the frame
repeating block 806 of FIG. 8a;
FIG. 9a is a detailed block diagram of the channel bank speech synthesizer
340 of FIG. 3;
FIG. 9b is an alternate embodiment of the modulator/bandpass filter
configuration 980 of FIG. 9a;
FIG. 9c is a detailed block diagram of the preferred embodiment of the
pitch pulse source 920 of FIG. 9a;
FIG. 9d is a graphic representation illustrating various waveforms of FIGS.
9a and 9c.
DESCRIPTION OF THE PREFERRED EMBODIMENT
1. System Configuration
Referring now to the accompanying drawings, FIG. 1 shows a general block
diagram of user-interactive control system 100 of the present invention.
Electronic device 150 may include any electronic apparatus that is
sophisticated enough to warrant the incorporation of a speech
recognition/speech synthesis control system. In the preferred embodiment,
electronic device 150 represents a speech communications device such as a
mobile radiotelephone.
User-spoken input speech is applied to microphone 105, which acts as an
acoustic coupler providing an electrical input speech signal for the
control system. Acoustic processor 110 performs acoustic feature
extraction upon the input speech signal. Word features, defined as the
amplitude/frequency parameters of each user-spoken input word, are thereby
provided to speech recognition processor 120 and to training processor
170. Acoustic processor 110 may also include a signal conditioner, such as
an analog-to-digital converter, to interface the input speech signal to
the speech recognition control system. Acoustic processor 110 will be
further described in conjunction with FIG. 3.
Training processor 170 manipulates this word feature information from
acoustic processor 110 to provide word recognition templates to be stored
in template memory 160. During the training procedure, the incoming word
features are arranged into individual words by locating their endpoints.
If the training procedure is designed to accommodate multiple training
utterances for word feature consistency, then the multiple utterances may
be averaged to form a single word template. Furthermore, since most speech
recognition systems do not require all of the speech information to be
stored as a template, some type of data reduction is often performed by
training processor 170 to reduce the template memory requirements. The
word templates are stored in template memory 160 for use by speech
recognition processor 120 as well as by speech synthesis processor 140.
The exact training procedure utilized by the preferred embodiment of the
present invention may be found in the description accompanying FIG. 2.
In the recognition mode, speech recognition processor 120 compares the word
feature information provided by acoustic processor 110 to the word
recognition templates provided by template memory 160. If the acoustic
features of the present word feature information derived from the
user-spoken input speech sufficiently match the acoustic features of a
particular prestored word template derived from the template memory, then
recognition processor 120 provides device control data to device
controller 130 indicative of the particular word recognized. A further
discussion of an appropriate speech recognition apparatus, and how the
preferred embodiment incorporates data reduction into the training process
may be found in the description accompanying FIGS. 3 through 5.
Device controller 130 interfaces the entire control system to electronic
device 150. Device controller 130 translates the device control data
provided by recognition processor 120 into control signals adaptable for
use by the particular electronic device. These control signals direct the
device to perform specific operating functions as instructed by the user.
(Device controller 130 may also perform additional supervisory functions
related to other elements shown in FIG. 1.) An example of a device
controller known in the art and suitable for use with the present
invention is a microcomputer. Refer to FIG. 3 for further details of the
hardware implementation.
Device controller 130 also provides device status data representing the
operating status of electronic device 150. This data is applied to speech
synthesis processor 140, along with word recognition templates from
template memory 160. Synthesis processor 140 utilizes the status data to
determine which word recognition template is to be synthesized into
user-recognizable reply speech. Synthesis processor 140 may also include
an internal reply memory, also controlled by the status data, to provide
"canned" reply words to the user. In either case, the user is informed of
the electronic device operating status when the speech reply signal is
output via speaker 145.
Thus, FIG. 1 illustrates how the present invention provides a
user-interactive control system utilizing speech recognition to control
the operating parameters of an electronic device, and how a speech
recognition template may be utilized to generate reply speech to the user
indicative of the operating status of the device.
FIG. 2 illustrates in more detail the application of the user-interactive
control system to a speech communications device comprising a part of any
radio or landline voice communications system, such as, for example, a
two-way radio system, a telephone system, an intercom system, etc.
Acoustic processor 110, recognition processor 120, template memory 160,
and device controller 130 are the same in structure and in operation as
the corresponding blocks of FIG. 1. However, control system 200
illustrates the internal structure of speech communications device 210.
Speech communication terminal 225 represents the main electronic network
of device 210, such as, for example, a telephone terminal or a
communications console. In this embodiment, microphone 205 and speaker 245
are incorporated into the speech communications device itself. A typical
example of this microphone/speaker arrangement would be a telephone
handset. Speech communications terminal 225 interfaces operating status
information of the speech communications device to device controller 130.
This operating status information may comprise functional status data of
the terminal itself (e.g., channel data, service information, operating
mode messages, etc.), user-feedback information of the speech recognition
control system (e.g., directory contents, word recognition verification,
operating mode status, etc.), or may include system status data pertaining
to the communications link (e.g., loss-of-line, system busy, invalid
access code, etc.).
In either the training mode or the recognition mode, the features of user
spoken input speech are extracted by acoustic processor 110. In the
training mode, which is represented in FIG. 2 by position "A" of switch
215, the word feature information is applied to word averager 220 of
training processor 170. As previously mentioned, if the system is designed
to average multiple utterances together to form a single word template,
the averaging is performed by word averager 220. Through the use of word
averaging, the training processor can take into account the minor
variances between two or more utterances of the same word, thereby
producing a more reliable word template. Numerous word averaging
techniques may be used. For example, one method would be to combine only
the similar word features of all training utterances to produce a "best"
set of features for the word template. Another technique may be to simply
compare all training utterances to determine which one provides the "best"
template. Still another word averaging tcchnique is described by L. R.
Rabiner and J. G. Wilpon in "A Simplified Robust Training Procedure for
Speaker Trained, Isolated Word Recognition Systems", Journal of the
Acoustic Society of America, vol. 68 (Nov. 1980), pp. 1271-76.
Data reducer 230 then performs data reduction upon either the averaged word
data from word averager 220 or upon the word feature signals directly from
acoustic processor 110, depending upon the presence or absence of a word
averager. In either case, the reduction process consists of segmenting
this "raw" word feature data and combining the data in each segment. The
storage requirements for the template are then further reduced by
differential encoding of the segmented data to produce "reduced" word
feature data. This specific data reduction technique of the present
invention is fully described in conjunction with FIGS. 4 and 5. To
summarize, data reducer 230 compresses the raw word data to minimize the
template storage requirements and to reduce the speech recognition
computation time.
The reduced word feature data provided by training processor 170 is stored
as word recognition templates in template memory 160. In the recognition
mode, which is illustrated by position "B" of switch 215, recognition
processor 120 compares the incoming word feature signals to the word
recognition templates. Upon recognition of a valid command word,
recognition processor 120 may instruct device controller 130 to cause a
corresponding speech communications device control function to be executed
by speech communications terminal 225. Terminal 225 may respond to device
controller 130 by sending operating status information back to controller
130 in the form of terminal status data. This data can be used by the
control system to synthesize the appropriate speech reply signal to inform
the user of the present device operating status. This sequence of events
will be more clearly understood by referring to the subsequent example.
Synthesis processor 140 is comprised of speech synthesizer 240, data
expander 250, and reply memory 260. A synthesis processor of this
configuration is capable of generating "canned" replies to the user from a
prestored vocabulary (stored in reply memory 260), as well as generating
"template" responses from a user-generated vocabulary (stored in template
memory 160). Speech synthesizer 240 and reply memory 260 are further
described in conjunction with FIG. 3, and data expander 250 is fully
described in the text accompanying FIG. 8a. In combination, the blocks of
synthesis processor 140 generate a speech reply signal to speaker 245.
Accordingly, FIG. 2 illustrates the technique of using a single template
memory for both speech recognition and speech synthesis.
The simplified example of a "smart" telephone terminal employing
voice-controlled dialing from a stored telephone number directory is now
used to describe the operation of the control system of FIG. 2. Initially,
an untrained speaker-dependent speech recognition system cannot recognize
command words. Therefore, the user must manually prompt the device to
begin the training procedure, perhaps by entering a particular code into
the telephone keypad. Device controller 130 then directs switch 215 to
enter the training mode (position "A"). Device controller 130 then
instructs speech synthesizer 240 to respond with the predefined phrase
TRAINING VOCABULARY ONE, which is a "canned" response obtained from reply
memory 260. The user then begins to build a command word vocabulary by
uttering command words, such as STORE or RECALL, into microphone 205. The
features of the utterance are first extracted by acoustic processor 110,
and then applied to either word averager 220 or data reducer 230. If the
particular speech recognition system is designed to accept multiple
utterances of the same word, word averager 220 produces a set of averaged
word features representing the best representation of that particular
word. If the system does not have word averaging capabilities, the single
utterance word features (rather than the multiple utterance averaged word
features) are applied to data reducer 230. The data reduction process
removes unnecessary or duplicate feature data, compresses the remaining
data, and provides template memory 160 with "reduced" word recognition
templates. A similar procedure is followed for training the system to
recognize digits.
Once the system is trained with the command word vocabulary, the user must
continue the training procedure by entering telephone directory names and
numbers. To accomplish this task, the user utters the previously-trained
command word ENTER. Upon recognition of this utterance as a valid user
command, device controller 130 instructs speech synthesizer 240 to reply
with the "canned" phrase DIGITS PLEASE? stored in reply memory 260. Upon
entering the appropriate telephone number digits (e.g., 555-1234), the
user says TERMINATE and the system replys NAME PLEASE? to prompt
user-entry of the corresponding directory name (e.g., SMITH). This
user-interactive process continues until the telephone number directory is
completely filled with the appropriate telephone names and digits.
To place a phone call, the user simply utters the command word RECALL. When
the utterance is recognized as a valid user command by recognition
processor 120, device controller 130 directs speech synthesizer 240 to
generate the verbal reply NAME? via synthesizing information provided by
reply memory 260. The user then responds by speaking the name in the
directory index corresponding to the telephone number that he desires to
dial (e.g. JONES). The word will be recognized as a valid directory entry
if it corresponds to a predetermined name index stored in template memory
160. If valid, device controller 130 directs data expander 250 to obtain
the appropriate reduced word recognition template from template memory 160
and perform the data expansion process for synthesis. Data expander 250
"unpacks" the reduced word feature data and restores the proper energy
contour for an intelligible reply word. The expanded word template data is
then fed to speech synthesizer 240. Using both the template data and the
reply memory data, speech synthesizer 240 generates the phrase JONES . . .
(from template memory 160 through data expander 250) . . . FIVE-FIVE-FIVE,
SIX-SEVEN-EIGHT-NINE (from reply memory 260).
The user then says the command word SEND which, when recognized by the
control system, instructs device controller 130 to send telephone number
dialing information to speech communications terminal 225. Terminal 225
outputs this dialing information via an appropriate communications link.
When the telephone connection is made, speech communications terminal 225
interfaces microphone audio from microphone 205 to the appropriate
transmit path, and receive audio from the appropriate receive audio path
to speaker 245. If a proper telephone connection cannot be made, terminal
controller 225 provides the appropriate communications link status
information to device controller 130. Accordingly, device controller 130
instructs speech synthesizer 240 to generate the appropriate reply word
corresponding to the status information provided, such as the reply word
SYSTEM BUSY. In this manner, the user is informed of the communications
link status, and user-interactive voice-controlled directory dialing is
achieved.
The above operational description is merely one application of synthesizing
speech from speech recognition templates according to the present
invention. Numerous other applications of this novel technique to a speech
communications device are contemplated, such as, for example, a
communications console, a two-way radio, etc. In the preferred embodiment,
the control system of the present invention is used with a mobile
radiotelephone.
Although speech recognition and speech synthesis allows a vehicle operator
to keep both eyes on the road, the conventional handset or hand-held
microphone prohibits him from keeping both hands on the steering wheel or
from executing proper manual (or automatic) transmission shifting. For
this reason, the control system of the preferred embodiment incorporates a
speakerphone to provide hands-free control of the speech communications
device. The speakerphone performs the transmit/receive audio switching
function, as well as the received/reply audio multiplexing function.
Referring now to FIG. 3, control system 300 utilizes the same acoustic
processor block 110, training processor block 170, recognition processor
block 120, template memory block 160, device controller block 130, and
synthesis processor block 140 as the corresponding blocks of FIG. 2.
However, microphone 302 and speaker 375 are not an integral part of the
speech communications terminal. Instead, input speech signal from
microphone 302 is directed to radiotelephone 350 via speakerphone 360.
Similarly, speakerphone 360 also controls the multiplexing of the
synthesized audio from the control system and the receive audio from the
communications link. A more detailed analysis of the switching/
multiplexing configuration of the speakerphone will be described later.
Additionally, the speech communications terminal is now illustrated in
FIG. 3 as a radiotelephone having a transmitter and a receiver to provide
the appropriate communications link via radio frequency (RF) channels. A
detailed description of the radio blocks is also provided later.
Microphone 302, which is typically remotely-mounted at a distance from the
user's mouth (e.g., on the automobile sun visor), acoustically couples the
user's voice to control system 300. This speech signal is usually
amplified by preamplifier 304 to provide input speech signal 305 This
audio input is directly applied to acoustic processor 110, and is switched
by speakerphone 360 before being applied to radiotelephone 350 via
switched microphone audio line 315.
As previously mentioned, acoustic processor 110 extracts the features of
the user-spoken input speech to provide word feature information to both
training processor 170 and recognition processor 120. Acoustic processor
110 first converts the analog input speech into digital form by
analog-to-digital (A/D) converter 310. This digital data is then applied
t | | |