|
Description  |
|
|
BACKGROUND OF THE INVENTION
1. Field of the Invention.
The present invention relates, in general, to voice recognition, and, more particularly, to software, systems, software and methods for performing voice and speech recognition over a distributed network.
2. Relevant Background
Voice and speech recognition systems are increasingly common interfaces for obtaining user input into computer systems. Speech recognition is used to provide enhanced services such as interactive voice response (IVR), automated phone attendants,
voice mail, fax mail, and other applications. More sophisticated speech recognition systems are used for speech-to-text conversion systems used for dictation and transcription.
Voice and speech recognition systems are characterized by, among other things, their recognition accuracy, speed and vocabulary size. High speed, accurate, large vocabulary systems tend to be complex and so require significant computing
resources to implement. Moreover, such systems have increased training demands to develop accurate models of users' speech patterns. In applications where computing resources are limited or the ability to train to a particular user's speech patterns is
limited, speech recognition products tend to be slow and/or inaccurate. Currently, speech recognition enabled software applications must often compromise between complex but accurate solutions, or simple but less accurate solutions. In many
applications, however, the impracticality of meaningful training dictates that the application can only implement less accurate techniques.
Voice recognition is of two basic types, speaker-dependent and speaker-independent. A speaker dependent system operates in environments where the system has relatively frequent contact with each speaker, where sizable vocabularies are involved,
and where the cost of recognition errors is high. These systems are usually easier to develop, cheaper to buy and more accurate, but not as flexible as speaker-adaptive or speaker-independent systems. In a speaker-dependent system, a user trains the
system by, for example, providing speech samples and creating a correlation between the samples and text of what was provided, usually with some manual effort on the part of the speaker. Such systems often use a generic engine coupled with substantial
data files, called voice models, that characterize a particular speaker for which the system has been trained. The training process can involve significant effort to obtain high recognition rates. Moreover, the voice model files are tightly coupled to
the recognition software so that it is difficult to port the training investment to other hardware/software platforms.
A speaker independent system operates for any speaker of a particular type (e.g. American English). These systems are the most difficult to develop, most expensive and accuracy is lower than speaker dependent systems. However, they are highly
useful in a wide variety of applications where many users must use the system such as answering services, interactive voice response (IVR) systems, call processing centers, data entry and the like. Such applications sacrifice the accuracy of
speaker-dependent systems for the flexibility of enabling a heterogeneous group of speakers to use the system. Such applications are characterized in that high recognition rates are desirable, but the cost of recognition failure is relatively low.
A middle ground is sometimes defined as a speaker adaptive system. A speaker adaptive system dynamically adapts its operation to the characteristics of new speakers. These systems are more akin to speaker-dependent models, but allow the system
to be trained over time. Adaptive systems can improve their vocabulary over time and result in complex, but accurate speech models. Such systems still require significant training effort, however. As in speaker-dependent systems, the complex speech
models cannot be readily ported to other systems.
Training methods tend to be very product specific. Moreover, the data structures in which the relationships between a user's speech and text are correlated tend to be product specific. Hence, the significant training effort applied to a first
speech recognition program may not be reusable for any other program or system. In some cases, speakers must re-train systems between version updates of the same program. Temporary or permanent changes to a user's voice patterns affect performance and
may require retraining. This significant training burden and lack of portability between products has worked against wide scale adoption of speech recognition systems.
Moreover, even where a user has trained one or more speaker-dependent systems, this training effort cannot be leveraged to improve the performance of the many speaker-independent systems that are encountered. The speaker-independent systems
cannot, by design, access or use speaker-dependent speech models to improve their performance. Hence, a need exists for improved speech recognition systems, software and methods that enable portable speech models that can be used for a wide variety of
tasks and leverage the training efforts across a wide variety of systems.
The dichotomy between speaker-dependent and speaker-independent technologies has resulted in an interesting dilemma in industry. Many of the applications that could benefit most from accurate speech recognition (e.g., interactive voice response
systems) cannot afford the complexity of highly accurate speaker dependent systems, nor obtain the necessary voice models that would improve their accuracy. From a practical perspective, speakers will only invest the significant time required to develop
a high quality voice model in applications where the result is worth the effort. The benefits realized by a business cannot compel individual speakers to submit to the necessary training regimens. Hence, these applications settle for
speaker-independent solutions and invest heavily in improving the performance of such systems.
Increasingly, computer-implemented applications and services are targeting "thin clients" or computers with limited processing power and data storage capacity. Such devices are cost effective means of implementing user interfaces. Thin clients
are becoming prominent in appliances such as televisions, telephones, Internet terminals and the like. However, the limited computing resources make it difficult to implement complex functionality such as voice and speech recognition. A need exists for
voice processing systems, methods and software that can provide high quality voice processing services with reduced hardware requirements.
In the past, computers were used by one user, or perhaps a few users, to access a limited set of applications. As computers are used more frequently to provide interfaces to everyday appliances, the need to adapt user interfaces to multiple
users becomes more pressing. Voice processing, in particular, represents a user input mode that is difficult to adapt to multiple users. In current systems, a voice model must be developed on and stored in each machine for each user. Not only does
this tax the machine's resources, but it creates a burdensome need for each user to train each computer that they use.
Conversely, each user tends to access computer resources via a variety of computer-implemented interfaces and computing hardware. It is contemplated that any given user may wish to access voice-enabled television, voice-enabled software on a
personal computer, voice-enabled automobile controls, and the like. The effort to train and maintain each of these systems individually becomes significant with only a few applications, and prohibitive with the large number of applications that could
potentially become voice enabled.
Hence, a need exists for speech recognition systems, methods and software that provide increased accuracy with reduced cost. Moreover, there is a need for systems that require reduced effort on the part of the speaker. Further, a need for
systems and software that enable users to leverage training effort across multiple, disparate speech-recognition enabled applications exists.
SUMMARY OF THE INVENTION
Briefly stated, the present invention involves a speech recognition system in which one or more speaker-dependent voice signatures are developed for each of a plurality of speakers. A plurality of configurable speech processing engines are
deployed and integrated with computer applications. A session is initiated between the configurable engine and a particular speaker. The configurable engine identifies the user using voice recognition or other explicit or implicit user-identification
methods. The configurable engine accesses a copy of the speaker dependent voice signature associated with the identified speaker to perform speaker-dependent speech recognition.
In another aspect, the present invention involves voice signatures that are configured to integrate with and be used by a plurality of disparate voice-enabled applications. The voice signature comprises a static data structure or a dynamically
adapting data structure that represents a correlation between a speaker's voice patterns and language constructs. The voice signature is preferably portable across multiple computer hardware and software platforms. Preferably, a plurality of voice
signatures are stored in a network accessible repository for access by voice-enabled applications as needed.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 shows a computer environment in which the present invention is implemented;
FIG. 2 shows entities and relationships in a particular embodiment of the present invention;
FIG. 3 illustrates an exemplary packet structure in accordance with an embodiment of the present invention;
FIG. 4 shows a flow diagram of processes involved in an implementation of the present invention; and
FIG. 5 depicts a distributed service model implementing functionality in accordance with the present invention.
FIG. 6 illustrates an embodiment in which station-to-station duplex voice exchange is implemented.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
The present invention is directed to voice processing systems characterized by a number of distinct aspects. In general, the systems and methods of the present invention intends to reduce the burden on users and developers of speech recognition
systems by enabling training files and voice models to be readily shared between disparate applications. Further, initial training and voice model adaptation can be implemented with greater efficiency by sharing voice information across multiple
disparate applications.
In one aspect, the present invention provides a "voice processing substrate" or "voice processing service" upon which other software applications can build. By providing high quality voice recognition and speech recognition services
ubiquitously, existing software applications can become "voice-enabled" with significantly lower development cost. Moreover, applications that would not have been practical heretofore due to the high cost and proprietary nature of voice recognition
software, are made viable by the distributed and highly portable and scaleable nature of the present invention.
In another aspect, the present invention involves applications of the voice processing service such as interactive voice response, dictation and transcription services, voice messaging services, voice automated application services, and the like
that share a common repository of speech recognition resources. These applications, typically implemented as software applications, can leverage the aggregate knowledge about their user's voice and speech patterns by using the shared common speech
recognition resources.
In yet another aspect the present invention involves a distributed voice processing system in which the various functions involved in voice processing can be performed in a pipelined or parallel fashion. Speech tasks differ significantly in
purpose and complexity. In accordance with this aspect of the present invention, the processes involved in speech processing are modularized and distributed amongst a number of processing resources. This enables the system to employ only the required
resources to complete a particular task. Also, this enables the processes to be implemented in parallel or in a pipelined fashion to greatly improve overall performance.
The present invention is illustrated and described in terms of a distributed computing environment such as an enterprise computing system using public communication channels such as the Internet. However, an important feature of the present
invention is that it is readily scaled upwardly and downwardly to meet the needs of a particular application. Accordingly, unless specified to the contrary the present invention is applicable to significantly larger, more complex network environments as
well as small network environments such as conventional LAN systems.
FIG. 1 shows an exemplary computing environment 100 in which the present invention may be implemented. Speech server 101 comprises program and data constructs that function to receive requests from a variety of sources, access voice resources
105, and provide voice services in response to the requests. The provided voice services involve accessing stored voice resources 105 that implement a central repository of resources that can be leveraged to provide services for a wide variety of
requests. The services provided by speech server 101 may vary in complexity from simply retrieving specified voice resources (e.g., obtaining a speech sample file for a particular user) to more complex speech recognition processes (e.g., feature
extraction, phoneme recognition, phoneme-to-text mapping).
Requests to speech server 101 may come directly from voice appliances 102, however, in preferred examples requests come from "voice portals" 110. Voice portals comprise software applications and/or software servers that provide a set of
fundamental behaviors and that are voice enabled by way of their coupling to speech server 101. Example voice portals include interactive voice response (IVR) services 111, dictation service 112 and voice mail service 113. However, the number and
variety of applications and services that can be voice-enabled in accordance with the present invention is nearly limitless. Because voice portals 110 access shared speech server 101 and shared voice resources 105, they do not each need to create,
obtain, or maintain duplicate or special-purpose instances of the voice resources. Instead, the voice portals can focus on implementing the logic necessary to implement their fundamental behaviors, effectively outsourcing the complex tasks associated
with voice processing.
A set 103 of voice appliances 102 represent the hardware and software devices used to implement voice-enabled user interfaces. Exemplary voice appliances 102 include, but are not limited to, personal computers with microphones or speech
synthesis programs, telephones, cellular telephones, voice over IP (VoIP) terminals, laptop and hand held computers, computer games and the like. Any given speaker may use a plurality of voice appliances 102. Likewise, any given voice appliance 102 may
be used by multiple speakers.
A variety of techniques are used to perform voice processing. Typically speech recognition starts with the digital sampling of speech followed by acoustic signal processing. Most techniques include spectral analysis such as Fast Fourier
Transform (FFT) analysis, LPC analysis (Linear Predictive Coding), MFCC (Mel Frequency Cepstral Coefficients), cochlea modeling and the like. Using phoneme recognition, the preprocessed files are parsed to identify groups of phonemes and words using
techniques such as DTW (Dynamic Time Warping), HMM (hidden Markov modeling), NNs (Neural Networks), expert systems, N-grams and combinations of techniques. Most systems use some knowledge of the language (e.g., syntax and context) to aid the recognition
process.
The precise distribution of functionality amongst the various components shown in FIG. 1 can vary significantly. Modularization of components allows components to be placed or implemented rationally within the network architecture. For example,
analog-to-digit conversion (ADC) and digital signal processing (DSP) steps may occur within voice appliances 102 such that a digital preprocessed signal is communicated to voice portals 110. Alternatively, this pre-processing can be performed by voice
portals 110, or can be out-sourced to speech server 101. In many applications it is preferable to perform these preprocessing functions as near to the analog voice source (e.g., the speaker) as possible to avoid signal loss during communication.
Conversely, it is contemplated that copies of shared voice resources can be stored permanently or temporarily (i.e., cached) within voice portals 111 and/or voice appliances 102 so that more complex functions can be implemented without access to speech
server 101 each instance.
Each of the devices shown in FIG. 1 may include memory, mass storage, and a degree of data processing capability sufficient to manage their connection to a network. The computer program devices in accordance with the present invention are
implemented in the memory of the various devices shown in FIG. 1 and enabled by the data processing capability of the devices shown in FIG. 1. In addition to local memory and storage associated with each device, it is often desirable to provide one or
more locations of shared storage such as disk farm (not shown) that provides mass storage capacity beyond what an individual device can efficiently use and manage. Selected components of the present invention may be stored in or implemented in shared
mass storage.
FIG. 2 shows conceptual relationships between entities in a specific embodiment of the present invention. Voice appliance 102 interacts with a speaker and communicates a voice signal over network 201 to voice portal 110. The term "voice signal"
is intended to convey a very broad range of signals that capture the voice utterances of a user in analog or digital form and which indicate an identity of the speaker. The speaker identification can be to a specific individual speaker, or an indication
of a group to which the speaker belongs (e.g., English-speaking children from Phoenix, Ariz.). The speaker identification can take a variety of forms, and may be explicitly provided by the speaker or voice appliance 102 or implied from the connection
through network 201 using techniques such as caller ID, area code information, or reverse telephone directory lookup.
Network 201 may comprise the public switched telephone network (PSTN) including cellular phone networks, as well as local area networks (LANs), wide area networks (WANs), metropolitan area networks (MANs), as well as public internetworks such as
the Internet. Any network or group of networks that are capable of transporting the voice signal and speaker identification information are suitable implementations of network 201. Internet 202 is an example of a data communication network suitable for
exchanging data between components of the present invention. While Internet 202 is an example of a public IP-based network, a suitable public, private, or hybrid network or internetwork topology including LANs, WANs, and MANs are suitable equivalents in
most applications.
Voice portal 110 comprises speech-enabled application 204 and speech recognition (SR) front-end 203. Application 204 implements desired fundamental behaviors of the application, such as looking up telephone numbers, weather information, stock
quotes, addresses and the like. Speech enabled application has an interface that couples to SR front end 203. This interface may be configured to receive voice-format data such as phoneme probabilities or text input, but may also be configured to
receive commands or other structured input such as structured query language (SQL) statements.
Front-end 203 implements a defined interface that is protocol compliant with network 201 to communicate request and response traffic with voice appliances 102. SR front-end 203 receives requests from voice appliances 102 where the requests
identify the speaker and include a voice signal. SR front-end generates a request to speech server 101 to access shared resources 105 needed to process the voice signal so as to generate input to speech enabled application 204. The processing
responsibilities between SR front-end 203 and speech server 101 are agreed upon in advance, but can be varied significantly. In a particular example, the requests from SR front end 203 include a digitized speech signal, and the responses from speech
server 101 include a set of phoneme probabilities corresponding to the speech signal.
It is contemplated that a typical system will involve multiple SR front-end devices 203 communicating simultaneously with a single speech server 101. Each front end 203 may handle multiple voice appliances 102 simultaneously. One advantage of
the present invention is that centralized speech server 101 can be configured to process these requests in parallel more readily than could individual voice appliances 102. In such cases, requests to speech server 101 are preferably accompanied by a
source identification that uniquely identifies a particular SR front-end 203 and a stream identifier that uniquely identifies a particular voice session that is using the identified SR front end 203. In some cases the speaker ID can also be used to
identify the session, although when a particular voice appliance 102 is conducting multiple simultaneous sessions, the speaker ID alone may be an ambiguous reference. This information can be used to route the resources 105 to appropriate processes that
are using the resources.
SR front end 203 exchanges request/response traffic with speech server 101 over the Internet 202 in the example of FIG. 2. The request/response traffic comprises hypertext transfer protocol (HTTP) packets over TCP/IP in the particular example,
although other protocols are suitable and may be preferable in some instances. For example, universal datagram protocol (UDP) can be faster, although offers poorer reliability. The benefits of various protocol layers and stacks are well known and
readily consulted in the selection of particular protocols.
Voice resources 105 comprise speaker-dependent signatures 207 and speaker group signatures 208 in a particular embodiment. Speaker-dependent signatures 207 comprise one or more voice models associated with a particular speaker. In contrast,
speaker group signatures 208 comprise one or more voice models that are associated with a group of speakers such as English speaking children from Phoenix, Ariz., rather than a particular speaker. Group signatures are a useful middle ground where a
particular speaker cannot be identified with certainty, but the speaker can be identified generally as a member of a particular speaker group.
The voice models essentially implement a mapping between voice signals and symbols, words, word portions (e. | | |