|
Description  |
|
|
TECHNICAL FIELD
This invention relates to methods and apparatus for automating various user
initiated telephony processes, particularly through the use of improved
recognition systems and methodology.
BACKGROUND ART
In the environment of telecommunications systems there has been a steady
trend toward automating what was originally operator assistance traffic.
Much current activity is directed to responding to directory assistance
calls by processing voice frequency instructions from the caller without
operator intervention. The instructions are used by an automatic speech
recognition unit to generate data signals corresponding to recognized
voice frequency signals. The data signals are then used to search a
database for a directory listing to derive the desired directory number. A
system of this type is described in U.S. Pat. No. 4,979,206 issued Dec.
18, 1990.
According to that patent such automated service is supplied by a switching
system equipped with an automatic speech recognition facility for
interpreting a spoken or keyed customer request comprising data for
identifying a directory listing. In response to recognition of data
conveyed by the request, the system searches a database to locate the
directory number listing corresponding to the request. This listing is
then automatically announced to the requesting customer. In implementing
this system the calling customer or caller receives a prompting
announcement requesting that the caller provide the zip code or spell the
name of the community of the desired directory number. The caller is also
prompted to spell the last name of the customer corresponding to the
desired directory number. If further data is required, the caller may be
prompted to spell the first name and street address of the desired party.
Following responses to prompting announcements a search is made to
determine if only one listing corresponds to the data supplied by the
caller. When this occurs the directory number is announced to the caller.
The aim of such a system has been to require a minimum of speech
recognition capability by the speech recognition facility--namely, only
letters of the alphabet and numbers.
A typical public switched telephone network (PSTN) arrangement proposed to
effect such a system is illustrated in block diagram form in FIG. 1 of the
aforementioned patent (PRIOR ART). The network of FIG. 1 is here described
in some detail as a typical environment in which the method and apparatus
of the invention may be utilized. In FIG. 1 block 1 represents a
telecommunications switching system, or switch operating under stored
program control. Switch 1 may be a switch such as the 5ESS switch
manufactured by AT&T Technologies, Inc., arranged to offer the Operator
Services Position System (OSPS) features.
Shown within switch 1 are various blocks for carrying out the functions of
a program controlled switch. Control 10 is a distributed control system
operating under the control of a group of data and call processing
programs to control various sections or elements of switch 1. Element 12
is a voice and data switching network capable of switching voice and/or
data between inputs connected to that switching network, frequently
referred to as the switch fabric or network. Connected to network 12 is a
Voice Processing Unit (VPU) 14. Network 12 and VPU 14 operate under the
control of control 10. Trunks 31 and 33, customer line 44, data link 35,
and operator access facility 26 are connected to network 12 at input ports
31a, 33a, 44a, 35a, and 26a respectively, and control 10 is connected to
network 12 via data channel 11 at input port 11a.
VPU 14 receives speech or customer keyed information from callers at
calling terminals 40 or 42 and processes the voice signals or keyed tone
signals from a customer station using well known automatic speech
recognition techniques to generate data corresponding to the speech or
keyed information. These data are used by Directory Assistance Computers
(DAS/C) 56 in making a search for a desired telephone or directory number
listing. When a directory assistance request comes from a customer
terminal 42 via customer line 44, port 44a and switching network 12 to VPU
14, VPU 14 analyzes voice input signals to recognize individual ones of
various elements corresponding to a predetermined list of spoken
responses.
VPU 14 also generates voice messages or announcements to prompt a caller to
speak information into the system for subsequent recognition by the voice
processing unit. VPU 14 generates output data signals, representing the
results of the voice processing. These output signals are sent to control
10 whence they may be transmitted via data link 59 to DAS/C computer 56,
or be used within control 10 as an input to the program of control 10 for
controlling establishment of connections in switching network 12 or
requesting further announcements by VPU 14. VPU 14 includes announcement
circuits 13 and detection circuits, i.e., automatic speech recognition
circuits 15 both controlled by a controller of VPU 14. A Conversant 1
Voice System, Model 80, manufactured by AT&T Technologies, Inc., may be
used to carry out the functions of the VPU 14.
When the DAS/C computer 56 completes its data search and locates the
requested directory listing, it is connected via data link 58 to an Audio
Response Unit (ARU) 60, which is connected to the voice and data switching
network 12 for announcing the telephone number of an identified telephone
listing. Computer Consoles, Inc. (CCI) manufactures an Audio Response Unit
16 and the DAS/C terminal 52 which may be used in this environment. As
shown, the DAS/C computer 56 is directly connected to control 10 by data
link 59 but could be connected to control 10 via a link to network 12 and
a connection through network 12 via port 11a. After a directory listing is
found the directory number is reported to audio response unit 60 for
announcement to the caller.
Directory assistance calls can also be processed with the help of an
operator if the VPU fails to recognize adequate oral information.
Connected to switch 1 are trunks 31 and 33 connected to local switch 30 and
interconnection network 32. Local switch 30 is connected to calling
customer terminal 40 and interconnection network 32 is connected to a
called customer terminal 46. Switch 30 and network 32 connect customer
terminal signals from customer terminals to switch 1. Also connected to
switch 1 are customer lines including customer line 44 for connecting a
customer terminal 42 to switch 1.
In an alternate connection calling terminal 40 is connected via local
switch 30 to switch 1. In a more general case, other switches forming part
of a larger public telephone network such as interconnection network 32
would be required to connect calling terminal 40 to switch 1. Generally
speaking, calls are connected to switch 1 via communication links such as
trunks 31 and 33 and customer line 44. In the alternate connection calling
terminal 40 is connected by a customer line to a 1AESS 30, manufactured by
AT&T Technologies, Inc., and used here as a local switch or end office.
That switch is connected to trunk 31 which is connected to switch 1. Local
switch 30 is also connected to switch 1 by a data link 35 used for
conveying common channel signaling messages between these two switches.
Such common channel signaling messages are used herein to request switch
30 to initiate the setting up of a connection, for example, between
customer terminals 40 and 46. Switch 1 is connected in the example
terminating connection to called terminal 46 via interconnection network
32. If the calling terminal is not directly connected to switch 1, the
directory number of the calling terminal identified, for example, by
Automatic Number Identification (ANI), is transmitted from the switch
connected to the calling terminal to switch one.
Operator position terminal 24 connected to switch 1 comprises a terminal
for use by an operator in order to provide operator assistance. Data
displays for the operator position terminal 24 are generated by control
10. Operator position terminal 24 is connected to switching network 12 by
operator access facility 26 which may include carrier facilities to allow
the operator position to be located far from switching network 12 or may
be a simple voice and data access facility if the operator positions are
located close to the switching network.
In order to handle directory assistance services, the directory assistance
operator has access to two separate operator terminals; terminal 24 for
communicating with the caller and switch 1 and terminal 52 used for
communicating via data link 54 with DAS/C computer 56. The operator at
terminals 24 and 52 communicates orally with a caller and on the basis of
these communications keys information into the DAS/C terminal 52 for
transmission to the DAS/C computer 56. The DAS/C computer 56 responds to
such keyed information by generating displays of information on DAS/C
terminal 52 which information may include the desired directory number.
Until the caller provides sufficient information to locate a valid listing
the caller is not connected to an audio response unit since there is
nothing to announce. Further details of the operation of the system of
FIG. 1 are set forth in U.S. Pat. No. 4,979,206.
Further examples of use of voice recognition in automation of telephone
operator assistance calls is found in U.S. Pat. Nos. 5,163,083, issued
Nov. 10, 1992; 5,185,781, issued Feb. 9, 1993; 5,181,237, issued Jan. 19,
1993, to Dowden et al.
Another proposed use for speech recognition in a telecommunications network
is voice verification. This is the process of verifying the person's
claimed identity by analyzing a sample of that person's voice. This form
of security is based on the premise that each person can be uniquely
identified by his or her voice. The degree of security afforded by a
verification technique depends on how well the verification algorithm
discriminates the voice of an authorized user from all unauthorized users.
It would be desirable to use voice verification to verify the identity of
a telephone caller. Such schemes to date, however, have not been
implemented in a fully satisfactory manner. One such proposal for
implementing voice verification is described in U.S. Pat. No. 5,297,194,
issued Mar. 22, 1994, to Hunt et al. In an embodiment of such a system
described in this patent a caller attempting to obtain access to services
via a telephone network is prompted to enter a spoken password having a
plurality of digits. Preferably, the caller is prompted to speak the
password beginning with the first digit and ending with a last digit. Each
spoken digit of the password is then recognized using a
speaker-independent voice recognition algorithm. Following entry of the
last digit of the password, a determination is made whether the password
is valid. If so, the caller's identity is verified using a voice
verification algorithm.
This method is implemented according to that patent using a system
comprising a digital processor for prompting the caller to speak the
password and then using speech processing means controlled by the digital
processor for effecting a multi-stage data reduction process and
generating resulting voice recognition and voice verification parameter
data and voice recognition and verification routines.
Following the digit based voice recognition step, the voice verification
routing is controlled by the digital processor and is responsive to a
determination that the password is valid for determining whether the
caller is an authorized user. This routing includes transformation means
that receives the speech feature data generated for each digit in the
voice verification feature transformation data and in response thereto
generates voice verification parameter data for each digit. A verifier
routing receives the voice verification parameter data and the
speaker-relative voice verification class reference data and in response
thereto generates an output indicating whether the caller is an authorized
user.
In operation a caller places a call from a conventional calling station
telephone to a financial institution or card verification company in order
to access account information. The caller has previously enrolled in the
voice verification database that includes his or her voice verification
class reference data. The financial institution includes suitable
input/output devices connected to the system (or integrally therewith) to
interface signals to and from the telephone lines. Once the call set up
has been established, the digital processor controls the prompt means to
prompt the caller to begin digit-by-digit entry of the caller's
preassigned password. The voice recognition algorithm processes each digit
and uses a statistical recognition strategy to determine which digit (0-9
and "oh") is spoken. After all digits have been recognized, a test is made
to determine whether the entered password is valid for the system. If so,
the caller is conditionally accepted. In other words, if the password is
valid the system "knows" who the caller claims to be and where the account
information is stored.
Thereafter the system performs voice verification on the caller to
determine if the entered password has been spoken by a voice previously
enrolled in the voice verification reference database and assigned to the
entered password. If the verification algorithm establishes a "match"
access to the data is provided. If the algorithm substantially matches the
voice to the stored version thereof but not within a predetermined
acceptance criterion, the system prompts the caller to input additional
personal information to further test the identity of the claimed owner of
the password. If the caller cannot provide such information, the system
rejects the access inquiry and the call is terminated.
Existing approaches for deploying speech recognition technology for
universal application are based on creating speech models based on
"average" voice features. This averaging approach tends to exclude persons
with voice characteristics beyond the boundaries created by the averaging.
The speech model averages are based on the training set used when the
models are created. For example, if the models are created using speech
samples for New Englanders then the models will tend to exclude voices
with Southern accents or voices with Hispanic accents. If the models try
to average an all inclusive population, the performance deteriorates for
the entire spectrum.
BRIEF SUMMARY OF THE INVENTION
It is an object of the invention to provide a system and method for
accomplishing universal speech recognition on a reliable basis using a
unique combination of existing technologies and available equipment.
The new and improved methodology and system involves an initial two step
passive and active procedure to preselect the most appropriate technology
model or device for each type of caller. The passive feature may be based
on numerous factors subject to determination without seeking active
participation by the customer or user. One such factor is demographics
which may be determined by identifying the geographic area of origin of
the call. This may be accomplished through the use of ANI or Caller ID or
any one of a number of other passively determinable factors such as ICLID,
DNIC, NNX, area code, time of day, snapshot, date or biometrics. If the
profile database constructed for the purpose of making an appropriate
choice of recognition technology model or device on the basis of passive
features is inconclusive, a second step or active procedure may be
initiated. This may take the form of an automated oral query or prompt to
solicit a customer or caller response that can be analyzed to select the
appropriate recognition model or device following the caller active step.
It has been recognized by the inventor that a factor in obtaining high
efficiency speech recognition is that the speech recognition products of
different vendors perform more or less satisfactorily under differing
specific circumstances. For example, the equipment of one vendor may
provide the best performance for continuous digit recognition, the
equipment of another vendor may provide the best performance for speaker
dependent recognition, the equipment of still another vendor may provide
the best performance for speaker independent/word spotting recognition,
the equipment of another vendor or different equipment of the same vendor
may provide the best performance for male voices or female voices, etc.
According to the invention this seeming limitation is utilized to advantage
by providing a platform (which may be distributed) which includes the
speech recognition equipment of multiple vendors. The recognition task is
then handled by directing a specific recognition question to the type of
equipment best able to handle that specific situation. Thus an optimal
arrangement might incorporate the algorithms of multiple vendors within a
single bus architecture so that multiple vendor boards are placed on the
main machine and the operating program directs the signal to be recognized
to the most appropriate board for processing.
In many cities it is known that certain areas are largely, if not
completely, populated by particular ethnic groups. As a part of the
passive step, the incoming call can be identified as to the area of call
origin and that call directed at the outset to a voice recognition
sub-system which is most effective for the language or accent of that
ethnic group. This may be accomplished by creating a demographic database
based on statistical data collected for the involved city. Thus each city
may have its own unique demographic database.
According to a preferred embodiment the recognition device may then
comprise a platform which includes multiple different recognition
resources. Specific resources are then selected for their pre-established
ability to handle different situations with high efficiency. With such
resources available across a backbone, such as an Ethernet, an executive
server can direct a speech input to a selected resource depending upon the
ethnic vocabulary needed at that time. The demographic database may be
advantageously associated with and controlled by the intelligence
available in the AIN ISCP. The incoming call can trigger the ISCP via the
AIN network on the basis of the ANI or Caller ID information to direct
call setup to the selected resource prior to connection of the caller.
This passive procedure is completely transparent to the caller.
Once the call is connected into a particular resource, a speech sample is
obtained which can be used to confirm that the call is in the correct
resource utilizing the appropriate models. If there is any question as to
the correctness of this solution, a direct question can be triggered to
obtain active caller participation. Thus the caller can be asked a
question which would require an answer tailored to permit more specific
language identification. In appropriate circumstances the caller may be
instructed to converse in what is tentatively established to be his/her
native language.
In addition to the foregoing it is a feature of the invention that the
intelligent recognition process can also detect behavioral information
such as anxiety, anger, inebriation, etc. This aspect of the invention
requires additional database data which may be provided for that purpose.
As a last resort, a caller can be connected to a live operator.
The foregoing discussion is directed to the situation in which a particular
call is directed to a single voice recognition resource selected either on
the passive and/or active basis described above. However in times of low
network traffic it is also a feature of the invention to process an
incoming call through multiple resources in parallel to provide a maximum
reliability in recognition. For example, the involved telephone station,
particularly a public station, may include a more or less sophisticated
camera or optical/electronic device effective to accomplish lip reading,
or classify gender, or other physical characteristics of the caller.
After speech recognition has been achieved according to the invention, the
resulting output signals may be utilized for any of a number of purposes,
such as in the directory assistance procedure illustrated and described in
relation to FIG. 1, or as a substitute for dialing where the desired
directory number is merely spoken by the caller. Still further, the high
reliability of the system makes possible enhanced services which would
permit a user to speak a predetermined identification word and then say
"home" or "office" to achieve automatic completion of a call to his/her
home or office.
Accordingly it is a primary object of the invention to provide an improved
system and method for accomplishing universal speech recognition in the
environment of a switched telephone network and most particularly a PSTN.
It is another object of the invention to provide a system and method for
accomplishing universal speech recognition for purposes of the transfer of
spoken intelligence as well as speaker authentication.
It is yet another object of the invention to provide an improved system and
method for accomplishing universal speech recognition on an efficient and
economic basis using features and technologies currently available to the
public switched telephone network.
It is another object of the invention to provide such a system using a two
step passive and active procedure wherein the passive | | |