|
|
|
| United States Patent | 5418948 |
| Link to this page | http://www.wikipatents.com/5418948.html |
| Inventor(s) | Turtle; Howard R. (Woodbury, MN) |
| Abstract | A computer implemented process for creating a search query for an
information retrieval system in which a database is provided containing a
plurality of stopwords and phrases. A natural language input query defines
the composition of the text of documents to be identified. Each word of
the natural language input query is compared to the database in order to
remove stopwords from the query. The remaining words of the input query
are stemmed to their basic roots, and the sequence of stemmed words in the
list is compared to phrases in the database to identify phrases in the
search query. The phrases are substituted for the sequence of stemmed
words from the list so that the remaining elements, namely the substituted
phrases and unsubstituted stemmed words, form the search query. The
completed search query elements are query nodes of a query network used to
match representation nodes of a document network of an inference network.
The database includes as options a topic and key database for finding
numerical keys, and a synonym database for finding synonyms, both of which
are employed in the query as query nodes. |
|
|
|
Title Information  |
|
|
|
|
|
Drawing from US Patent 5418948 |
|
|
Concept matching of natural language queries with a database of document
concepts |
|
|
|
|
|
| Publication Date |
*
May 23, 1995 |
|
|
|
|
|
| Filing Date |
September 8, 1993 |
|
|
|
|
|
|
|
|
|
|
|
| Parent Case |
This is a continuation of application Ser. No. 07/773,101, filed Oct. 8,
1991, now U.S. Pat. No. 5,265,065 granted Nov. 23, 1993. |
|
|
|
|
|
|
|
|
|
|
|
|
|
Title Information  |
|
|
References  |
|
|
| *references marked with an asterisk below are user-added references |
|
U.S. References |
|
|
| Add a new US reference: |
| | Reference | Relevancy | Comments | Reference | Relevancy | Comments | 5317507 Gallant 715/532 May,1994 |      Your vote accepted [0 after 0 votes] | | 5301109 Landauer 704/9 Apr,1994 |      Your vote accepted [0 after 0 votes] | | 5297042 Morita
Mar,1994 |      Your vote accepted [0 after 0 votes] | | 5265065 Turtle 707/4 Nov,1993 |      Your vote accepted [0 after 0 votes] | | 5255386 Prager 707/5 Oct,1993 |      Your vote accepted [0 after 0 votes] | | 5251131 Masand 704/9 Oct,1993 |      Your vote accepted [0 after 0 votes] | | 5220625 Hatakeyama 715/809 Jun,1993 |      Your vote accepted [0 after 0 votes] | | 5123103 Ohtaki 707/5 Jun,1992 |      Your vote accepted [0 after 0 votes] | | 5117349 Tirfing 707/3 May,1992 |      Your vote accepted [0 after 0 votes] | | 5109509 Katayama 704/9 Apr,1992 |      Your vote accepted [0 after 0 votes] | | 5099425 Kanno: Yuji (Kawasaki, JP) 704/9 Mar,1992 |      Your vote accepted [0 after 0 votes] | | 4991087 Burkowski 707/3 Feb,1991 |      Your vote accepted [0 after 0 votes] | | 4974191 Amirghodsi 704/8 Nov,1990 |      Your vote accepted [0 after 0 votes] | | 4972349 Kleinberger 707/1 Nov,1990 |      Your vote accepted [0 after 0 votes] | | 4931935 Ohira 704/8 Jun,1990 |      Your vote accepted [0 after 0 votes] | | 4918588 Barrett 707/10 Apr,1990 |      Your vote accepted [0 after 0 votes] | | 4914590 Loatman 704/8 Apr,1990 |      Your vote accepted [0 after 0 votes] | | 4868750 Kucera 704/8 Sep,1989 |      Your vote accepted [0 after 0 votes] | | 4862408 Zamora 707/102 Aug,1989 |      Your vote accepted [0 after 0 votes] | | 4839853 Deerwester
Jun,1989 |      Your vote accepted [0 after 0 votes] | | 4823306 Barbic 707/5 Apr,1989 |      Your vote accepted [0 after 0 votes] | | 4787035 Bourne 700/247 Nov,1988 |      Your vote accepted [0 after 0 votes] | | 4775956 Kaji 704/7 Oct,1988 |      Your vote accepted [0 after 0 votes] | | 4706212 Toma 704/2 Nov,1987 |      Your vote accepted [0 after 0 votes] | | 4688195 Thompson 706/11 Aug,1987 |      Your vote accepted [0 after 0 votes] | | 4670848 Schramm 706/62 Jun,1987 |      Your vote accepted [0 after 0 votes] | | 4580218 Raye 707/1 Apr,1986 |      Your vote accepted [0 after 0 votes] | | 4554631 Reddington 707/4 Nov,1985 |      Your vote accepted [0 after 0 votes] | | 4499553 Dickinson 715/533 Feb,1985 |      Your vote accepted [0 after 0 votes] | | 4471459 Dickinson 715/533 Sep,1984 |      Your vote accepted [0 after 0 votes] | | 4384329 Rosenbaum 704/10 May,1983 |      Your vote accepted [0 after 0 votes] | | 4358824 Glickman 707/5 Nov,1982 |      Your vote accepted [0 after 0 votes] | | 4270182 Asija 704/8 May,1981 |      Your vote accepted [0 after 0 votes] | | 4241402 Mayper, Jr. 707/6 Dec,1980 |      Your vote accepted [0 after 0 votes] | | |
|
|
|
|
U.S. References |
|
|
Foreign References |
|
|
|
|
|
|
Foreign References |
|
|
Other References |
|
|
|
|
|
|
Other References |
|
|
|
|
|
References  |
|
|
Claims  |
|
|
What is claimed is:
1. A computer-implemented process for forming a search query for searching
a document database by a computer-implemented search process, the search
process identifying documents likely to match the search query by matching
individual terms of the search query to individual terms and sequences of
terms in the document database, the process for forming the search query
comprising:
a) providing a first database containing a plurality of phrases derived
from domain specific natural language phrases, each of said phrases
consisting of a plurality of terms in original order;
b) input to a computer an input query composed in natural language and
comprising a plurality of terms arranged in a user-selected order;
c) parsing said input query into separate terms in an ordered sequence, the
order of the terms in the sequence being the same as the order of the
terms in the input query;
d) selecting groups of terms, each group consisting of a plurality of
successive terms of the sequence;
e) comparing each group of terms to each phrase in said first database to
identify each group of terms of the input query that matches a phrase in
said first database; and
f) replacing each identified group of terms of the input query by a
representation of the matching phrase from said first database, the search
query comprising each representation substituted for groups of terms of
the input query and each remaining term of the input query.
2. A computer-implemented process for forming a search query according to
claim 1 further including providing a second database containing a
plurality of topics having a descriptive topical text and an associated
unique numerical key, each topical text being composed of a plurality of
terms, comparing the terms of the input query or the search query to each
of the terms of the topical texts in the second database, assigning a
statistical weight to each topical text reflecting the probability that
the topical text matches the query, ranking the topical texts based on the
statistical weight, and inserting into the search query the numerical keys
associated with up to n highest ranked topical texts, where n is a
predetermined integer.
3. A computer-implemented process for forming a search query according to
claim 2 wherein the step of inserting the numerical keys into the search
query includes comparing the statistical weights of the topical texts to a
predetermined threshold, and inserting the numerical keys into the search
query which are associated with topical texts having statistical weights
which exceed the predetermined threshold.
4. A computer-implemented process for forming a search query according to
claim 2 wherein the statistical weight for each topical text is determined
by comparing each term of the query to each term of the topical text,
determining the probability that the query term is a correct descriptor to
the topical text in accordance with the relationship
P(c.sub.i .vertline.d.sub.j)=0.4+0.6.multidot.idf.sub.i
.multidot.tf.sub.ij,
where idf.sub.i is based on the frequency of texts in the second database
containing the query term and tf.sub.ij is based on the frequency with
which the query term appears in the respective topical text, and for each
topical text adding the probabilities for all terms of the query and
normalizing the sum of the probabilities by the number of terms in the
query.
5. A computer-implemented process for forming a search query according to
clam 1 wherein the input query may include one or more groups of terms
forming citations, each citation including numerical terms, said process
further includes:
g) identifying each group of terms forming a citation in said input query,
and
h) replacing each identified group of terms forming a citation by a
citation word which comprises a representation of the citation.
6. A computer-implemented process for forming a search query according to
claim 5 wherein the citation word comprises the numerical terms of the
group of terms forming the citation and a predetermined word-level
proximity number.
7. A computer-implemented process for forming a search query according to
claim 1 further including before step f, removing stopwords from the input
query.
8. A computer system for forming a search query according to claim 7
wherein the first database further includes a plurality of stopwords,
fifth comparing means for comparing each term in said register means to
the stopwords in the first database, and deleting means responsive to the
fifth comparing means for deleting each term from said register means that
matches a stopword.
9. A computer implemented process for forming a search query according to
clam 1 further including, before step (f) for each identified group of
terms, identifying those terms which are shared by two successive
identified groups of terms, and assigning the shared term to only one of
the two successive groups.
10. A computer system for forming a search query according to claim 7
further including third processing means for identifying those terms which
are shared by two successive identified groups of terms and assigning the
shared term to only one of the two successive groups.
11. A computer-implemented process for forming a search query according to
claim 1 further including stemming the terms of said input query.
12. A computer-implemented process according to claim 1 further including,
after step
g) comparing each term and representation of the search query to individual
terms of a document database containing representations of the contents of
texts of a plurality of documents,
h) identifying the number of occurrences of respective terms,
representations and partial representations of the search query in the
representations for each document,
i) assigning a statistical weight to individual documents based on each
occurrence of respective terms, representations and partial
representations of the search query in the representations for each
document, and
j) identifying the probability that the document matches the search query
by summing the statistical weights.
13. A computer-implemented process according to claim 12 wherein the
statistical weight for each occurrence of a representation in a document
matching a part of a representation of the search query is a fraction of
the statistical weight for an occurrence of a representation in the
document that matches the corresponding full representation of the search
query.
14. A computer-implemented process according to claim 1 further including,
after step f,
g) comparing each term and representation of the search query to individual
terms of a document database containing representations of the contents of
texts of a plurality of documents,
h) identifying terms of a document that at least partially match a
representation of the search query, and
i) assigning a statistical weight to the document based on the number of
occurrences of the partially matched terms in the document.
15. A computer system for forming a search query for searching a document
database by a computer-implemented search process, the search process
identifying documents likely to match the search query by matching
individual terms of the search query to individual terms and sequences of
terms in the document database, said system comprising:
a) a first database consisting of a plurality of phrases derived from
domain specific natural language phrases, each of said phrases consisting
of a plurality of terms in original order;
b) register means for storing an input query composed in natural language,
the input query comprising a plurality of terms arranged in a
user-selected order;
c) parsing means responsive to said register means for parsing said input
query into separate terms;
d) first processing means for forming an ordered sequence of terms, the
order of the terms being the same as the order of the terms in the input
query;
e) selecting means for selecting groups of terms, each group consisting of
a plurality of successive terms of the sequence;
f) first comparing means for comparing each group of terms in said register
means to each phrase in said first database to identify each group of
terms in the register means which matches a phrase in said first database;
and
g) second processing means for replacing each identified group of terms in
said register means by a representation of the matching phrase in said
first database.
16. A computer system for forming a search query according to claim 15
wherein said read only memory further contains a second database
consisting of a plurality of topics each having a descriptive topical text
and an associated unique numerical key, each topical text being composed
of a plurality of terms, second comparing means for comparing the terms of
the input query or the search query to each of the terms of the topical
texts in the second database, third processing means for assigning a
statistical weight to each topical text reflecting the probability that
the topical text matches the query, ranking means for ranking the topical
texts based on the statistical weight, said register means being
responsive to the ranking means to store the numerical keys associated
with up to n highest ranked topical texts, where n is a predetermined
integer.
17. A computer system for forming a search query according to claim 16
further including third comparing means for comparing the statistical
weight of the topical texts to a predetermined threshold, said register
means being responsive to the third comparing means to store numerical
keys which are associated with topical texts having statistical weights
which exceed the predetermined threshold.
18. A computer system for forming a search query according to claim 16
further including fourth comparing means for comparing each term of the
query to each term of the topical text, fourth processing means for
determining the probability that the query term is a correct descriptor of
the topical text in accordance with the relationship
P(c.sub.i .vertline.d.sub.j)=0.4+0.6.multidot.idf.sub.i
.multidot.tf.sub.ij,
where idf.sub.i is based on the frequency of texts in the second database
containing the query term and tf.sub.ij is based on the frequency with
which the query term appears in the respective topical text, adding means
for adding for each topical text the probabilities for all terms of the
query, and normalizing means responsive to the adding means for
normalizing the sum of the probabilities by the number of terms in the
query.
19. A computer system for forming a search query according to claim 15
wherein said input query may include one or more groups of terms forming
citations, each citation having numerical terms said computer system
further including:
h) fifth processing means for identifying each group of terms forming a
citation in said input query, and
i) sixth processing means for replacing each identified group of terms
forming a citation by a citation word which comprises a representation of
the citation.
20. A computer system for forming a search query according to claim 19
wherein the citation word formed by the sixth processing means comprises
the numerical terms of the group of terms forming the citation and a
predetermined word-level proximity number.
21. A computer system for forming a search query according to claim 15
further including means for stemming the terms of said input query.
22. A computer system according to claim 15 further including,
h) a second database containing representations of the contents of texts of
a plurality of documents, each of said representations comprising a
plurality of terms,
i) fifth comparing means responsive to the second processing means and the
second database for comparing each term and representation of the search
query to individual terms of the second database,
j) seventh processing means responsive to the fifth comparing means for
identifying the number of occurrences of respective terms, representations
and partial representations of the search query in the representations for
each document,
k) eighth processing means responsive to the seventh processing means for
assigning a statistical weight to individual documents based on each
occurrence of respective terms, representations and partial
representations of the search query in the representations for each
document, and
l) summing means responsive to the eighth processing means for identifying
the probability that the document matches the search query by summing the
statistical weights.
23. A computer system according to claim 22 wherein the eighth processing
means assigns a statistical weight for each occurrence of a representation
in a document matching a part of a representation of the search query as a
fraction of the statistical weight for an occurrence of a representation
in the document that matches the corresponding full representation of the
search query.
24. A computer system according to claim 15 further including
h) a second database containing representations of the contents of texts of
a plurality of documents, each of said representations comprising a
plurality of terms,
i) fifth comparing means responsive to the second processing means and the
second database for comparing each term and representation of the search
query to individual terms of the second database,
j) seventh processing means responsive to the fifth comparing means for
identifying terms of a document that at least partially match a
representation of the search query, and
k) eighth processing means responsive to the seventh processing means for
assigning a statistical weight to the document based on each occurrence of
matched terms in the document.
25. A computer-implemented process for forming a search query for searching
a document database by a computer-implemented search process, the search
process identifying documents likely to match the search query by matching
individual terms of the search query to individual terms and sequences of
terms in the document database, the process for forming the search query
comprising:
a) providing a database containing a plurality of topics each having a
descriptive topical text and an associated unique numerical key, each
topical text being composed of a plurality of terms;
b) input to a computer an input query composed in natural language;
c) comparing the terms of the input query or the search query to each of
the terms of the topical texts in the database;
d) assigning a statistical weight to each topical text reflecting the
probability that the topical text matches the query;
e) ranking the topical texts based on the statistical weight; and
f) inserting into the search query the numerical keys associated with up to
n highest ranked topical texts, where n is a predetermined integer.
26. A computer-implemented process for forming a search query according to
claim 25 wherein the step of inserting the numerical keys into the search
query includes comparing the statistical weights of the topical texts to a
predetermined threshold, and inserting the numerical keys into the search
query which are associated with topical texts having statistical weights
which exceed the predetermined threshold.
27. A computer-implemented process for forming a search query according to
claim 25 wherein the statistical weight for each topical text is
determined by comparing each term of the query to each term of the topical
text, determining the probability that the query term is a correct
descriptor of the topical text in accordance with the relationship
P(c.sub.i .vertline.d.sub.j)=0.4+0.6.multidot.idf.sub.i
.multidot.tf.sub.ij,
where idf.sub.i is based on the frequency of texts in the database
containing the query term and tf.sub.ij is based on the frequency with
which the query term appears in the respective topical text, and for each
topical text adding the probabilities for all terms of the query and
normalizing the sum of the probabilities by the number of terms in the
query.
28. A computer system for forming a search query for searching a document
database by a computer-implemented search process, the search process
identifying documents likely to match the search query by matching
individual terms of the search query to individual terms and sequences of
terms in the document database, said system comprising:
a) a read only memory containing a database consisting of a plurality of
topics each having a descriptive topical text and an associated unique
numerical key, each topical text being composed of a plurality of terms;
b) register means for storing an input query composed in natural language,
the input query comprising a plurality of terms arranged in a
user-selected order;
c) first comparing means for comparing the terms of the input query or the
search query to each of the terms of the topical texts in the database;
e) first processing means for assigning a statistical weight to each
topical text reflecting the probability that the topical text matches the
query; and
f) ranking means for ranking the topical texts based on the statistical
weight, said register means being responsive to the ranking means to store
the numerical keys associated with up to n highest ranked topical texts,
where n is a predetermined integer.
29. A computer system for forming a search query according to claim 28
further including second comparing means for comparing the statistical
weight of the topical texts to a predetermined threshold, said register
means being responsive to the second comparing means to store numerical
keys which are associated with topical texts having statistical weights
which exceed the predetermined threshold.
30. A computer system for forming a search query according to claim 28
further including third comparing means for comparing each term of the
query to each term of the topical text, second processing means for
determining the probability that the query term is a correct descriptor of
the topical text in accordance with the relationship
P(c.sub.i .vertline.d.sub.j)=0.4+0.6.multidot.idf.sub.i
.multidot.tf.sub.ij,
where idf.sub.i is based on the frequency of texts in the database
containing the query term and tf.sub.ij is based on the frequency with
which the query term appears in the respective topical text, adding means
| | |