|
Claims  |
|
|
What is claimed is:
1. An automated transcription disambiguation method comprising the steps
of:
providing an input question having first and second words to a processor in
a form subject to misinterpretation by the processor;
generating a plurality of hypotheses with the processor, the hypotheses
including alternative interpretations of at least one of the first and
second words due to possible misinterpretations of the input question by
the processor;
producing with the processor an initial evaluation of the hypotheses;
gathering confirming evidence for the hypotheses by searching with the
processor in a text corpus for co-occurrences of hypothesized first and
second words for the hypotheses;
automatically and explicitly selecting with the processor from among the
plurality of hypotheses a preferred hypothesis as to both of the first and
second words based at least in part on the initial evaluation and at least
in part on the gathered confirming evidence; and
outputting a transcription result from the processor, the transcription
result representing the selected preferred hypothesis.
2. In the operation of a system comprising a processor, an input
transducer, an output facility, and a corpus comprising at least one
document comprising words represented in a first form, a method for
transcribing an input question by transforming the input question from a
sequence of words represented in a second form, subject to
misinterpretation by the processor, into a sequence of words represented
in the first form, the method comprising the steps of:
accepting the input question into the system, the question comprising a
sequence of words represented in the second form;
converting the input question into a signal with the input transducer;
converting the signal into a sequence of symbols with the processor;
generating a set of hypotheses from the sequence of symbols with the
processor, the hypotheses of the set comprising sequences of words
represented in the first form, the set of hypotheses including alternative
interpretations of at least one of the words to account for possible
misinterpretation of the input question;
producing with the processor an initial evaluation of the hypotheses;
automatically constructing a query from hypotheses of the set with the
processor;
executing the constructed query by searching with the processor in the
corpus for co-occurrences of hypothesized words for the hypotheses;
analyzing the co-occurrences and the initial evaluation with the processor
to produce a revised evaluation of the hypotheses of the set;
automatically and explicitly selecting a preferred hypothesis from the set
with the processor responsively to the revised evaluation, the preferred
hypothesis comprising a preferred sequence of words in the first form and
thus a preferred transcription of the sequence of words of the input
question; and
outputting the preferred hypothesis with the output facility.
3. The method of claim 2 wherein:
the corpus includes a plurality of documents;
the step of executing the constructed query includes retrieving documents
containing the co-occurrences;
the step of automatically and explicitly selecting the preferred hypothesis
further comprises selecting with the processor a preferred set of
documents, the preferred set of documents comprising a subset of the
retrieved documents that are relevant to the preferred hypothesis, and
the step of outputting the preferred hypothesis further comprises
outputting with the output facility at least a portion of a document
belonging to the preferred set of documents.
4. The method of claim 3 further comprising the steps, performed after the
step of outputting at least a portion of a document belonging to the
preferred set of documents, of:
accepting a relevance feedback input into the system, the relevance
feedback input comprising a sequence of words represented in the second
form, the sequence of words including a relevance feedback keyword and a
word that occurs in the outputted document;
converting the relevance feedback input into an additional query with the
processor; and
executing the additional query with the processor to retrieve an additional
document from the corpus.
5. The method of claim 2 wherein:
the step of automatically and explicitly selecting the preferred hypothesis
further comprises selecting a plurality of preferred hypotheses with the
processor; and
the step of outputting the preferred hypothesis further comprises
outputting the selected plurality of preferred hypotheses with the output
facility.
6. The method of claim 2 wherein:
the step of accepting an input question further comprises accepting
information into the system, the information concerning the locations of
word boundaries between words of the question; and
the step of converting the signal into a sequence of symbols further
comprises specifying subsequences of the sequence of symbols with the
processor according to the locations of word boundaries thus accepted.
7. The method of claim 2 wherein the step of generating a set of hypotheses
from the sequence of symbols further comprises generating hypothesized
locations of word boundaries with the processor.
8. The method of claim 2 wherein the step of converting the input question
into a signal comprises converting spoken input into an audio signal with
an audio transducer.
9. The method of claim 2 wherein the step of constructing a query from
hypotheses of the set comprises constructing a Boolean query with a
proximity constraint.
10. The method of claim 2 wherein the step of generating a set of
hypotheses from the sequence of symbols comprises detecting a keyword with
the processor to prevent inclusion of the keyword in hypotheses of the
set.
11. The method of claim 10 wherein the step of constructing a query from
hypotheses of the set comprises constructing a query from hypotheses of
the set with the processor, the query being responsive to the detected
keyword.
12. The method of claim 2 wherein the step of constructing a query from
hypotheses of the set comprises constructing an initial query with the
processor and prior to the outputting step automatically constructing a
reformulated query with the processor, the reformulated query comprising a
reformulation of the initial query.
13. The method of claim 2 wherein the step of outputting the preferred
hypothesis comprises visually displaying the preferred hypothesis.
14. The method of claim 2 wherein the step of outputting the preferred
hypothesis comprises synthesizing a spoken form of the preferred
hypothesis.
15. The method of claim 2 wherein the step of outputting the preferred
hypothesis comprises providing the preferred hypothesis to an applications
program.
16. The method of claim 15 further comprising the step of accepting the
preferred hypothesis into the applications program as textual input to the
applications program.
17. The method of claim 2 wherein the step of producing an initial
evaluation comprises determining an initial evaluation measurement for
each hypothesis.
18. In a system comprising a processor, a method for processing an input
utterance comprising speech, the method comprising the steps of:
accepting the input utterance into the system;
producing a phonetic transcription of the input utterance with the
processor;
responsively to the phonetic transcription, generating with the processor a
set of hypotheses, the hypotheses of the set being hypotheses as to a
first word contained in the input utterance and further as to a second
word contained in the input utterance, the set of hypotheses including
alternative interpretations of at least one of the words to account for
the error-prone nature of speech analysis;
determining with the processor an initial evaluation measurement for each
hypothesis;
automatically constructing an information retrieval query with the
processor, the query comprising the set of hypotheses and a proximity
constraint;
executing the constructed query in conjunction with an information
retrieval subsystem comprising a text corpus; and
responsively to the results of the executed query with respect to each
hypothesis of the set of hypotheses, and taking into consideration the
initial evaluation measurements of the hypotheses, automatically and
explicitly selecting with the processor from among the hypotheses of the
set a preferred hypothesis, the preferred hypothesis including the first
and second words.
19. The method of claim 18 wherein the step of generating a set of
hypotheses comprises matching portions of the phonetic transcription
against a phonetic index with the processor.
20. In a system comprising a processor, an error-prone input facility, and
an information retrieval subsystem, said information retrieval subsystem
comprising a natural-language text corpus, a method for accessing
documents of the corpus, the method comprising the steps of:
transcribing a question with the error-prone input facility and the
processor, the question comprising a sequence of words;
selecting a subset of words of the sequence with the processor;
forming with the processor a plurality of hypotheses about the selected
subset of words, the hypotheses of the plurality representing possible
alternative transcriptions of the question to account for the error-prone
nature of the input facility;
producing with the processor an initial evaluation of the hypotheses;
automatically constructing a co-occurrence query with the processor, the
co-occurrence query being based on hypotheses of the plurality;
executing the co-occurrence query in conjunction with the information
retrieval subsystem to retrieve a set of documents;
analyzing the initial evaluation and documents of the retrieved set with
the processor to produce a revised evaluation of the hypotheses;
responsively to the revised evaluation, automatically and explicitly
selecting with the processor a preferred hypothesis representing a
preferred transcription of the sequence of words of the question;
evaluating documents of the retrieved set with the processor with respect
to the selected hypothesis to determine a relevant document; and
outputting from the system the relevant document thus determined.
21. An automated system for producing a preferred transcription of a
question presented in a form prone to erroneous transcription, comprising:
a processor;
an input transducer, coupled to the processor, for accepting an input
question and producing a signal therefrom;
converter means, coupled to the input transducer, for converting the signal
to a string comprising a sequence of symbols;
hypothesis generation means, coupled to the converter means, for developing
a set of hypotheses from the string, each hypothesis of the set comprising
a sequence of word representations, the set of hypotheses representing a
set of possible alternative transcriptions of the input question to
account for the likelihood of erroneous transcription;
initial scoring means, coupled to the hypothesis generation means, for
determining an initial score for each hypothesis;
query construction means, coupled to the hypothesis generation means, for
automatically constructing at least one information retrieval query using
hypotheses of the set;
a corpus comprising documents, each document comprising word
representations;
query execution means, coupled to the query construction means and to the
corpus, for retrieving from the corpus documents responsive to said at
least one query;
analysis means, coupled to the query execution means, for generating an
analysis of the retrieved documents and evaluating the hypotheses of the
set based on the initial scores and the analysis to determine a preferred
hypothesis from among the hypotheses of the set, the preferred hypothesis
representing a preferred transcription of the sequence of words of the
input question; and
output means, coupled to the analysis means, for outputting the preferred
hypothesis.
22. A speech processing apparatus comprising:
input means for transducing a spoken utterance into an audio signal;
means for converting the audio signal into a sequence of phones;
means for analyzing the sequence of phones to generate a plurality of
hypotheses comprising sequences of words, the hypotheses representing
possible alternative transcriptions of the spoken utterance to account for
the error-prone nature of speech analysis;
means for determining an initial evaluation measurement for each
hypothesis;
means for automatically constructing a query using the hypotheses of the
plurality;
information retrieval means, coupled to a corpus of documents and to the
constructing means, for retrieving documents of the corpus relevant to the
constructed query;
means for automatically and explicitly ranking the hypotheses of the
plurality according to confirming evidence found in the retrieved
documents and further according to the initial evaluation measurements
previously determined; and
means for outputting a subset of the hypotheses thus ranked, each
hypothesis of the subset comprising a sequence of words representing a
possible transcription of the spoken utterance. |
|
|
|
|
Claims  |
|
|
Description  |
|
|
COPYRIGHT NOTIFICATION
A portion of the disclosure of this patent document contains material which
is subject to copyright protection. The copyright owners have no objection
to the facsimile reproduction, by anyone, of the patent document or the
patent disclosure, as it appears in the patent and trademark office patent
file or records, but otherwise reserve all copyright rights whatsoever.
SOFTWARE APPENDIX
An appendix comprising 71 pages is included as part of this application.
The appendix provides two (2) files of a source code software program for
implementation of an embodiment of the method of the invention on a
digital computer.
The files reproduced in the appendix represent unpublished work that is
Copyright .COPYRGT.1993 Xerox Corporation. All rights reserved. Copyright
protection claimed includes all forms and matters of copyrightable
material and information now allowed by statutory or judicial law or
hereafter granted, including without limitation, material generated from
the software programs which are displayed on the screen such as icons,
screen display looks, etc.
BACKGROUND OF THE INVENTION
The present invention relates to systems and methods for transcribing words
from a form convenient for input by a human user, e.g., spoken or
handwritten words, into a form easily understood by an applications
program executed by a computer, e.g., text. In particular, it relates to
transcription systems and methods appropriate for use in conjunction with
computerized information-retrieval (IR) systems and methods, and more
particularly to speech-recognition systems and methods appropriate for use
in conjunction with computerized information-retrieval systems and methods
used with textual databases.
In prior art IR systems, the user typically enters input--either
natural-language questions, or search terms connected by specialized
database commands--by typing at a keyboard. Few IR systems permit the user
to use speech input, that is, to speak questions or search strings into a
microphone or other audio transducer. Systems that do accept speech input
do not directly use the information in a database of free-text
natural-language documents to facilitate recognition of the user's input
speech.
The general problem of disambiguating the words contained in an error-prone
transcription of user input arises in a number of contexts beyond speech
recognition, including but not limited to handwriting recognition in
pen-based computers and personal digital assistants (e.g., the Apple
Newton) and optical character recognition. Transcription of user input
from a form convenient to the user into a form convenient for use by the
computer has any number of applications, including but not limited to word
processing programs, document analysis programs, and, as already stated,
information retrieval programs. Unfortunately, computerized transcription
tends to be error-prone.
SUMMARY OF THE INVENTION
The present invention provides a technique for using information retrieved
from a text corpus to automatically disambiguate an error-prone
transcription, and more particularly provides a technique for using
co-occurrence information in the corpus to disambiguate such input.
According to the invention, a processor accepts an input question. The
processor is used to generate a hypothesis, typically as to a first word
and a second word in the input question, and then is used to gather
confirming evidence for the hypothesis by seeking a co-occurrence of the
first word and the second word in a corpus.
In one aspect, the present invention provides a system and method for
automatically transcribing an input question from a form convenient for
user input into a form suitable for use by a computer. The question is a
sequence of words represented in a form convenient for the user, such as a
spoken utterance or a handwritten phrase. The question is transduced into
a signal that is converted into a sequence of symbols. A set of hypotheses
is generated from the sequence of symbols. The hypotheses are sequences of
words represented in a form suitable for use by the computer, such as
text. One or more information retrieval queries are constructed and
executed to retrieve documents from a corpus (database). Retrieved
documents are analyzed to produce an evaluation of the hypotheses of the
set and to select one or more preferred hypotheses from the set. The
preferred hypotheses are output to a display, speech synthesizer, or
applications program. Additionally, retrieved documents relevant to the
preferred hypotheses can be selected and output.
In another aspect, the invention provides a system and method for
retrieving information from a corpus of natural-language text in response
to a question or utterance spoken by a user. The invention uses
information retrieved from the corpus to help it properly interpret the
user's question, as well as to respond to the question.
The invention takes advantage of the observation that the intended words in
a user's question usually are semantically related to each other and thus
are likely to co-occur in a corpus within relatively close proximity of
each other. By contrast, words in the corpus that spuriously match
incorrect phonetic transcriptions are much less likely to be semantically
related to each other and thus less likely to co-occur within close
proximity of each other. The invention retrieves from the corpus those
segments of text or documents that are most relevant to the user's
question by hypothesizing what words the user has spoken based on a
somewhat unreliable, error-prone phonetic transcription of the user's
spoken utterance, and then searching for co-occurrences of these
hypothesized words in documents of the corpus by executing Boolean queries
with proximity and order constraints. Hypotheses that are confirmed by
query matching are considered to be preferred interpretations of the words
of the user's question, and the documents in which they are found are
considered to be of probable relevance to the user's question.
A further understanding of the nature and advantages of the invention will
become apparent by reference to the remaining portions of the
specification and drawings.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 illustrates a system that embodies the invention;
FIG. 2 schematically depicts information flow in a system according to a
first specific embodiment of the invention;
FIG. 3 is a flowchart of method steps carried out according to a first
specific embodiment of the invention;
FIG. 4 illustrates a conceptual model of a portion of a phonetic index;
FIG. 5 is a flowchart of steps for phonetic index matching;
FIG. 6 is a flowchart of steps for query reformulation;
FIG. 7 is a flowchart of steps for scoring;
FIG. 8 schematically depicts an example of information flow in a system
according to a second specific embodiment of the invention;
FIG. 9 is a flowchart of method steps carried out according to a second
specific embodiment of the invention;
FIG. 10 illustrates a system in which the invention is used as a "front
end" speech-recognizer component module in the context of a
non-information-retrieval application; and
FIG. 11 is a specific embodiment that is adaptable to a range of input
sources, hypothesis generation mechanisms, query construction mechanisms,
and analysis techniques.
DESCRIPTION OF SPECIFIC EMBODIMENTS
The disclosures in this application of all articles and references,
including patent documents, are incorporated herein by reference.
1. Introduction
The invention will be described in sections 1 through 6 with respect to
embodiments that accept user input in the form of spoken words and that
are used in information retrieval (IR) contexts. In these embodiments, the
invention enables a person to use spoken input to access information in a
corpus of natural-language text, such as contained in a typical IR system.
The user is presented with information (e.g., document titles, position in
the corpus, words in documents) relevant to the input question. Some of
these embodiments can incorporate relevance feedback.
The invention uses information, particularly co-occurrence information,
present in the corpus to help it recognize what the user has said. The
invention provides robust performance in that it can retrieve relevant
information from the corpus even if it does not recognize every word of
the user's utterance or is uncertain about some or all of the words.
A simple example illustrates these ideas. Suppose that the corpus comprises
a database of general-knowledge articles, such as the articles of an
encyclopedia, and that the user is interested in learning about President
Kennedy. The user speaks the utterance, "President Kennedy," which is
input into the invention. The invention needs to recognize what was said
and to retrieve appropriate documents, that is, documents having to do
with President Kennedy. Suppose further that it is unclear whether the
user has said "president" or "present" and also whether the user has said
"Kennedy" or "Canada." The invention performs one or more searches in the
corpus to try to confirm each of the following hypotheses, and at the same
time, to try to gather documents that are relevant to each hypothesis:
______________________________________
president kennedy
present kennedy
president canada
present canada
______________________________________
The corpus is likely to include numerous articles that contain phrases such
as "President Kennedy," "President John F. Kennedy," and the like. Perhaps
it also includes an article on "present-day Canada," and an article that
contains the phrase "Kennedy was present at the talks . . . ." It does not
include any article that contains the phrase "president of Canada"
(because Canada has a prime minister, not a president).
The invention assumes that semantically related words in the speaker's
utterance will tend to appear together (co-occur) more frequently in the
corpus. Put another way, the invention assumes that the user has spoken
sense rather than nonsense, and that the sense of the user's words is
reflected in the words of articles of the corpus. Thus the fact that
"President Kennedy" and related phrases appear in the corpus much more
frequently than phrases based on any of the other three hypotheses
suggests that "President Kennedy" is the best interpretation of the user's
utterance and that the articles that will most interest the user are those
that contain this phrase and related phrases. Accordingly, the invention
assigns a high score to the articles about President Kennedy and assigns
lower scores to the article about present-day Canada and the article about
Kennedy's presence at the talks. The highest-scoring articles can be
presented to the user as a visual display on a computer screen, as phrases
spoken by a speech synthesizer, or both. Optionally, the user can make
additional utterances directing the invention to retrieve additional
documents, narrow the scope of the displayed documents, and so forth, for
example, "Tell me more about President Kennedy and the Warren Commission."
The present invention finds application in information retrieval systems
with databases comprising free (unpreprocessed) natural-language text. It
can be used both in systems that recognize discrete spoken words and in
systems that recognize continuous speech. It can be used in systems that
accommodate natural-language utterances, Boolean/proximity queries,
special commands, or any combination of these.
More generally, the invention finds application in speech-recognition
systems regardless of what they are connected to. A speech recognizer that
embodies or incorporates the method of the invention with an appropriate
corpus or corpora can be used as a "front end" to any application program
where speech recognition is desired, such as, for example, a
word-processing program. In this context, the invention helps the
application program "make more sense" of what the user is saying and
therefore make fewer speech-recognition mistakes than it would otherwise.
This is discussed further in section 7 below.
Still more generally, the invention finds application beyond
speech-recognition in handwriting recognition, optical character
recognition, and other systems in which a user wishes to input words into
a computer program in a form that is convenient for the user but easily
misinterpreted by the computer. This is discussed further in Section 8
below. The technique of the present invention, in which a sequence of
words supplied by a user and transcribed by machine in an error-prone
fashion is disambiguated and/or verified by automatically formulating
alternative hypotheses about the correct or best interpretation, gathering
confirming evidence for these hypotheses by searching a text corpus for
occurrences and co-occurrences of hypothesized words, and analyzing the
search results to evaluate which hypothesis or hypotheses best represents
the user's intended meaning, is referred to as semantic co-occurrence
filtering.
2. Glossary
The following terms are intended to have the following general meanings:
Corpus: A body of natural language text to be searched, used by the
invention. Plural: corpora.
Document match: The situation where a document satisfies a query.
FSM, finite-state recognizers: A device that receives a string of symbols
as input, computes for a finite number of steps, and halts in some
configuration signifying that the input has been accepted or else that it
has been rejected.
Hypothesis: A guess at the correct interpretation of the words of a user's
question, produced by the invention.
Inflected form: A form of a word that has been changed from the root form
to mark such distinctions as case, gender, number, tense, person, mood, or
voice.
Information retrieval, IR: The accessing and retrieval of stored
information, typically from a computer database.
Keyword: A word that received special treatment when input to the
invention; for example, a common function word or a command word.
Match sentences: Sentences in a document that cause or help cause the
document to be retrieved in response to a query. Match sentences contain
phrases that conform to the search terms and constraints specified in the
query.
Orthographic: Pertaining to the letters in a word's spelling.
Phone: A member of a collection of symbols that are used to describe the
sounds uttered when a person pronounces a word.
Phonetic transcription: The process of transcribing a spoken word or
utterance into a sequence of constituent phones.
Query: An expression that is used by an information retrieval system to
search a corpus and return text that matches the expression.
Question: A user's information need, presented to the invention as input.
Root form: The uninflected form of a word; typically, the form that appears
in a dictionary citation.
Utterance: Synonym for question in embodiments of the invention that accept
spoken input.
Word index: A data structure that associates words found in a corpus with
all the different places such words exist in the corpus.
3. System Components
Certain system components that are common to the specific embodiments of
the invention described in sections 4, 5, and 6 will now be described.
FIG. 1 illustrates a system 1 that embodies the present invention. System 1
comprises a processor 10 coupled to an input audio transducer 20, an
output visual display 30, an optional output speech synthesizer 31, and an
information retrieval (IR) subsystem 40 which accesses documents from
corpus 41 using a word index 42. Also in system 1 are a phonetic
transcriber 50, a hypothesis generator 60, a phonetic index 62, a query
constructor 70, and a scoring mechanism 80. Certain elements of system 1
will now be described in more detail.
Processor 10 is a computer processing unit (CPU). Typically it is part of a
mainframe, workstation, or personal computer. It can comprise multiple
processing elements in some embodiments.
Transducer 20 converts a user's spoken utterance into a signal that can be
processed by processor 10. Transducer 20 can comprise a microphone coupled
to an analog-to-digital converter, so that the user's speech is converted
by transducer 20 into a digital signal. Transducer 20 can further comprise
signal-conditioning equipment including components such as a preamplifier,
a pre-emphasis filter, a noise reduction unit, a device for analyzing
speech spectra (e.g., by Fast Fourier Transform), or other audio signal
processing devices in some embodiments. Such signal-conditioning equipment
can help to eliminate or minimize spurious or unwanted components from the
signal that is output by transducer 20, or provide another representation
(e.g., spectral) of the signal.
Display 30 provides visual output to the user, for example, alphanumeric
display of the texts or titles of documents retrieved from corpus 41.
Typically, display 30 comprises a computer screen or monitor.
Speech synthesizer 31 optionally can be included in system 1 to provide
audio output, for example, to read aloud portions of retrieved documents
to the user. Speech synthesizer 31 can comprise speech synthesis hardware,
support software executed by CPU 10, an audio amplifier, and a speaker.
IR subsystem 40 incorporates a processor that can process queries to search
for documents in corpus 41. It can use processor 10 or, as shown in FIG.
1, can have its own processor 43. IR subsystem 40 can be located at the
same site as processor 10 or can be located at a remote site and connected
to processor 10 via a suitable communication network.
Corpus 41 comprises a database of documents that can be searched by IR
subsystem 40. The documents comprise natural-language texts, for example,
books, articles from newspapers and periodicals, encyclopedia articles,
abstracts, office documents, etc.
It is assumed that corpus 41 has been indexed to create word index 42, and
that corpus 41 can be searched by IR subsystem 40 using queries that
comprise words (search terms) of word index 42 with Boolean operators and
supplemental proximity and order constraints expressible between the
words. This functionality is provided by many known IR systems. Words in
word index 42 can correspond directly to their spellings in corpus 41, or
as is often the case in IR systems, can be represented by their root
(uninflected) forms.
Transcriber 50, hypothesis generator 60, phonetic index 62, query
constructor 70, and scoring mechanism 80 are typically implemented as
software modules executed by processor 10. The operation and function of
these modules is described more fully below for specific embodiments, in
particular with reference to the embodiments of FIGS. 2 and 8. It will be
observed that corresponding elements in FIGS. 1, 2, 8, and 10 are
similarly numbered.
3.1 Query Syntax
It is assumed that IR subsystem 40 can perform certain IR query operations.
IR queries are formulated in a query language that expresses Boolean,
proximity, and ordering or sequence relationships between search terms in
a form understandable by IR subsystem 40. For purposes of discussion the
query language is represented as follows:
__________________________________________________________________________
term represents the single search term term. A
term can be an individual word or in some
cases another query.
<p term1 term2 . . . >
represents strict ordering of terms. The IR
subsystem determines that a document matches
this query if and only if all the terms
enclosed in the angle brackets appear in the
| | |