Speech recognition software is provided in combination with application specific software on a communications network. Analog voice data is digitized at a user's location, identified as voice data, and transmitted to the application software residing at a central location. The network server receiving data identified as voice data transmits it to a speech server. Speech recognition software resident at the speech server contains a dictionary and modules tailored to the voice of each of the users of the speech recognition software. As the user speaks, a translation of the dictation is transmitted back to the user's location and appears in print on the user's computer screen for examination and if necessary, voice or typed correction of its contents. Multiple users have interleaved access to the speech recognition software so that transmission back to each of the users is contemporaneous.
A system and method for generating on-demand voiceprints are presented wherein voiceprints are created on the fly using voice recordings and associated metadata specified by an application. The application requests a voiceprint and specifies a description of the data necessary to generate the voiceprint, including the appropriate voice recordings, the requisite verification engine and other parameters that should be utilized to generate the voiceprint. The specified voice recordings are accessed from storage and a voiceprint is produced using the designated speech engine and application-specified parameters.
A speech-enabled distributed processing system forming a Voice Web includes a gateway, one or more voice content sites coupled to the gateway over a wide area network, and a browser coupled to the gateway over a network, which may or may not be the wide area network. The gateway receives telephone calls from one or more users over telephony connections and performs endpointing of speech of each user. The browser provides the gateway with information enabling the gateway to selectively direct the endpointed speech to a voice content site via the wide area network. The gateway outputs the endpointed speech in the form of application protocol requests onto the wide area network to the appropriate site, as specified by the browser, or to the browser. The gateway receives prompts in the form of application protocol responses from the browser or a voice content site and plays the prompts to the appropriate user over the telephony connection. While accessing a selected voice content site, the gateway reroutes the endpointed speech to the browser if the endpointing result represents a hotword candidate.
A system is disclosed for facilitating speech recognition and transcription among users employing incompatible protocols for generating, transcribing, and exchanging speech. The system includes a system transaction manager that receives a speech information request from at least one of the users. The speech information request includes formatted spoken text generated using a first protocol. The system also includes a speech recognition and transcription engine, which communicates with the system transaction manager. The speech recognition and transcription engine receives the speech information request from the system transaction manager and generates a transcribed response, which includes a formatted transcription of the formatted speech. The system transmits the response to the system transaction manager, which routes the response to one or more of the users. The latter users employ a second protocol to handle the response, which may be the same as or different than the first protocol. The system transaction manager utilizes a uniform system protocol for handling the speech information request and the response.
The invention is a system, a method of transmitting messages selectively as text or non-text from an entity (104) in a network (100 and 102), and an entity in a network. A system in accordance with the invention includes at least one terminal (16); a network containing the at least one terminal; an entity in the network which provides messages selectively as text or non-text to the network in a speech encoded form; and wherein the messages are transmitted in the speech encoded form by the network to the at least one terminal which reproduces the messages to a user thereof in either a text form or by a sound reproduction device of the at least one terminal.
A technique for remotely processing a local audio command to control a local device includes: receiving at a local site an acoustic signal and generating a corresponding audio signal; transmitting the audio signal to a remote site; performing speech recognition processing on the audio signal at the remote site to determine whether the audio signal includes a command; performing voice recognition processing on the audio signal at the remote site to determine whether the audio signal has been supplied by an authorized user; generating a command signal in response to the audio signal including a command and being supplied by an authorized user; and transmitting the command signal to a device at the local site to effect a change in a state of the local device.