|
Claims  |
|
|
What is claimed is:
1. In a computer system for identifying a predetermined number of documents
of a document collection containing representations that have high
probabilities of matching a query containing a plurality of concepts, in
which the system has a database containing identifications of documents in
the document collection and defining a plurality of representations
representing the contents of the documents, the collection comprising a
plurality of documents, and query means for defining the query, apparatus
comprising:
sample selection means for iteratively selecting successive samples of a
plurality of documents from the collection, each sample containing fewer
documents than the entire collection and each successive sample containing
documents different from each previous sample;
processing means responsive to the sample selection means for calculating,
during each iteration, probabilities that documents contained in the
sample contain representations that match the query and for identifying a
preselected number of documents having the highest probabilities, the
documents being identified during an iteration from a group consisting of
the respective sample of documents and the documents identified during the
next previous iteration, the preselected number being different for each
iteration and no greater than the predetermined number; and
output means outputting the identifications of the predetermined number of
documents identified by the processing means.
2. The apparatus according to claim 1 further including threshold setting
means responsive to the processing means for setting a probability
threshold equal to the probability of a first identified document.
3. The apparatus according to claim 2 including determining means operable
during each respective iteration and responsive to the identification of
the preselected number of documents by the processing means to determine
if an additional document of the respective sample has a probability
greater than the probability threshold, the processing means being
responsive to the determining means identifying an additional document
having a probability greater that the probability threshold to replace the
previously-identified document having the lowest probability by the
additional document, and the threshold setting means being responsive to
the processing means to reset the probability threshold to the probability
of the identified document having the new lowest probability.
4. The apparatus according to claim 1 wherein the preselected number is
equal to the number of the respective iteration, and the predetermined
number is equal to the number of the last iteration.
5. The apparatus according to claim 1 including estimating means responsive
to the processing means for estimating a maximum probability for a second
document different from the first document based on a partially calculated
probability for the second document and an assumption that the
representations in the second document match the concepts of the query for
which probabilities have not been calculated, the processing means being
responsive to the estimating means to calculate partial probabilities that
representations in the second document match concepts of the query until
either the estimated maximum probability does not at least equal the
threshold or the probability is calculated for all the concepts in the
query.
6. The apparatus according to claim 5 wherein the output means includes a
result list ranking the identified documents in probability order, the
threshold setting means being responsive to the result list to reset the
probability threshold equal to the probability of the document lowest on
the result list.
7. The apparatus according to claim 1 wherein the processing means includes
a result list ranking the identified documents in probability order.
8. A system for identifying documents matching a comprising:
a memory containing a database containing identification of documents in a
document collection and defining a plurality of representations
representing the contents of the documents, the collection comprising a
plurality of documents, the database further containing indications of the
frequencies of occurrence of documents containing first representations in
the collection;
computer means responsive to a query defining a plurality of concepts, the
computer means including
matching means for matching the concepts to representations,
estimating means for estimating the frequency of occurrence of documents
containing a second selected representation in the collection, the second
selected representation being different from any of the first
representations, the estimating means including
sample selection means for selecting a sample comprising a plurality of
documents from the collection, the sample containing fewer documents than
the entire collection;
frequency identifying means responsive to the sample selection means for
identifying the frequency of occurrence of documents containing the second
selected representation in the selected sample of documents;
processor means responsive to the memory means and to the frequency
identifying means for calculating a maximum and a minimum probable
frequency of occurrence of documents containing the second selected
representation in the collection; and
selection means responsive to the processor means for selecting the
midpoint, of the maximum and minimum probable frequencies as the estimated
frequency of occurrence of the second selected representation;
retrieval means for selecting documents meeting the query based on the
frequencies of occurrence of documents containing first representations
which match the concepts and the estimated frequencies of occurrence of
documents containing second representations which match the concepts, and
output means responsive to the retrieval means and the memory for
outputting identifications of the selected documents.
9. The system according to claim 8 wherein the processor means includes
means for identifying if the difference between the maximum and minimum
probable frequencies is within a preselected limit, and further including
adjusting means responsive to the processor means for adding additional
documents from the collection to the sample of documents if the
calculating difference between the maximum and minimum probable
frequencies exceeds the preselected limit.
10. The system according to claim 8 where the processor means calculates
the maximum probable frequency, f.sub.max, and the minimum probable
frequency, f.sub.min, in accordance with relationships based on the number
of gaps between documents in the sample containing the second selected
representation (n.sub.i), the number of documents in the collection
(n.sub.c), and the number of documents in the sample (x.sub.i).
11. The system according to claim 10 where f.sub.max and f.sub.min are
calculated accordance with the relationships
##EQU5##
where s.sub.i is the greater of x.sub.i /n.sub.i or the standard deviation
of the n.sub.i gaps, and
z is the standard critical value for normal distribution for a preselected
reliability.
12. In a system for identifying documents matching a query, in which the
system has a database containing identifications of documents in a
document collection and defining a plurality of representations
representing the contents of the documents, the collection compromising a
plurality of documents and the database containing a frequency of
occurrence of documents containing each of at least some of the
representations in the collection of documents, query means for defining a
query containing a plurality of concepts, matching means for matching
concepts to representations means for selecting documents meeting the
query based on frequencies of occurrence of documents in the collection
containing representations matching the concepts, and output means
responsive to the means for selecting documents for outputting
identifications of selected documents, the improvement comprising a
process of estimating the frequency of occurrence of documents containing
a representation in the collection of documents for which the database
does not contain a frequency of occurrence, comprising:
identifying, on the basis of concepts in the query, a representation for
which the database does not contain a frequency of occurrence;
selecting a sample comprising a plurality of documents from the collection,
the sample containing fewer documents than the entire collection;
identifying the frequency of occurrence of documents containing the
identified representation in the selected sample of documents;
calculating a maximum and a minimum probable frequency of occurrence of
documents containing the identified representation in the collection; and
selecting a midpoint of the maximum and minimum probable frequencies as the
estimated frequency of occurrence of documents containing the identified
representation,
whereby the means for selecting documents meeting the query is responsive
to the frequencies of occurrence in the database of documents in the
collection containing representations matching the concepts and to
estimated frequencies of occurrence to select documents in the collection
containing representations matching the concepts.
13. The process according to claim 12 further including identifying whether
the difference between the maximum and minimum probable frequencies is
within a preselected limit, and adding additional documents to the sample
from the collection if the calculated difference between the maximum and
minimum probable frequencies exceeds the preselected limit.
14. The process according to claim 13 where the preselected limit is 0.05.
15. The process according to claim 12 where the maximum probable frequency,
f.sub.max, and the minimum probable frequency, f.sub.min, are calculated
in accordance with the relationships
##EQU6##
where n.sub.i is the number of gaps between documents in the sample
containing the selected representation,
n.sub.c is the number of documents in the collection,
x.sub.i is the number of documents in the sample,
s.sub.i is the greater of x.sub.i /n.sub.i or the standard deviation of the
n.sub.i gaps, and
z is the standard critical value for normal distribution for a preselected
reliability.
16. The process according to claim 15 where the selected representation
contains a plurality of terms, the method including setting f.sub.min
equal to n.sub.i if the calculated f.sub.min is smaller than n.sub.i,
setting f.sub.max equal to n.sub.i +(n.sub.c 31 x.sub.i) if the calculated
f.sub.max is smaller than zero or smaller than n.sub.i, and setting
f.sub.max equal to an a priori maximum if the calculated f.sub.max is
greater than the a priori maximum.
17. The process according to claim 16 wherein the selected representation
is a synonym represented by a plurality of terms, and wherein the a priori
maximum is equal to the sum of all frequencies of occurrence of documents
in the collection containing a term of the synonym, said method including
setting f.sub.min equal to an a priori minimum if the calculated f.sub.min
is smaller than the a priori minimum, where the a priori minimum is equal
to the frequency of occurrence of documents containing the term of the
synonym appearing in the greatest number of documents in the collection.
18. The process according to claim 16 wherein the selected representation
is a phrase containing a plurality of terms, and the a priori maximum is
equal to the frequency of occurrence of documents containing the term of
the phrase appearing in the least number of documents in the collection.
19. The process according to claim 15 where the preselected reliability is
0.995 and z is 2.8070.
20. The process according to claim 12 wherein the midpoint selected between
the maximum and minimum probable frequencies is the mean of the maximum
and minimum probable frequencies.
21. In a computer system for identifying documents matching a query, in
which the system has a database containing identifications of documents in
a document collection and defining a plurality of representations
representing the contents of the documents, the collection comprising a
plurality of documents, and query means for defining a query containing a
plurality of concepts, apparatus for identifying documents of the document
collection containing representations that match the query containing a
plurality of concepts, the apparatus comprising:
processing means for calculating probabilities that documents match the
query and for identifying a first document having a calculated
probability;
threshold setting means responsive to the processing means for setting a
probability threshold equal to the probability of the first document;
estimating means responsive to the processing means for estimating a
maximum probability for a second document different from the first
document based on a partially calculated probability and an assumption
that the representations in the second document match the concepts of the
query for which probabilities have not been calculated;
the processing means being responsive to the estimating means to calculate
partial probabilities that representations in the second document match
concepts of the query until either the estimated maximum probability for
the second document does not at least equal the probability threshold or
the probability is calculated for all the concepts in the query;
the estimating means being further responsive to the processing means
ceasing or completing the calculation of the probability for the second
document to estimate a maximum probability for a third document different
from the first and second documents; and
output means responsive to the processing means for outputting
identifications of only documents whose probability is calculated for all
concepts in the query.
22. The apparatus according to claim 21 wherein the output means includes a
result list identifying in probability order, up to a predetermined number
of documents whose probability is calculated for all concepts in the
query, the threshold setting means being responsive to the result list to
reset the probability threshold equal to the probability of the document
lowest on the result list.
23. Apparatus according to claim 21 wherein the threshold setting means is
responsive to the processing means calculating the probability for the
second document for all the concepts in the query to set the probability
threshold equal to the probability of the second document.
24. The apparatus according to claim 21 wherein the output means includes a
result list identifying in probability order, up to a predetermined number
of documents whose probability is calculated for all concepts in the
query.
25. A document identification system for identifying a predetermined number
of documents matching a query, comprising:
a read-only memory containing a database containing identifications of
documents in a document collection and defining a plurality of
representations representing the contents of documents in the document
collection, the collection comprising a plurality of documents;
query means for defining the query containing a plurality of concepts;
computer means responsive to the query containing a plurality of concepts,
the computer means including matching means for matching the concepts to
representations;
sample selection means for iteratively selecting successive samples of a
plurality of documents from the collection for examination, each sample
containing fewer documents than the entire collection, and each successive
sample containing documents different from each previous sample;
processing means responsive to the sample selection means for calculating,
during each iteration probabilities that documents contained in the sample
contain representations that match the query and for identifying up to a
preselected number of documents having the highest probabilities, the
documents being identified during each iteration from a group consisting
of the respective sample of documents and the documents identified during
the next previous iteration, the preselected number being different for
each iteration and no greater than the predetermined number; and
output means outputting identifications of the predetermined number of
documents identified by the processing means.
26. The system according to claim 25 further including threshold setting
means responsive to the processing means for setting a probability
threshold equal to the probability of a first identified document.
27. The system according to claim 26 including determining means operable
during each respective iteration and responsive to the identification of
the preselected number of documents by the processing means to determine
if an additional document of the respective sample has a probability
grater than the probability threshold, the processing means being
responsive to the determining means identifying an additional document
having a probability greater than the probability threshold to replace the
previously-identified document having the lowest probability with the
additional document, and the threshold setting means is responsive to the
processing means to reset the probability threshold to the probability of
the identified document having the new lowest probability.
28. The system according to claim 25 including estimating means responsive
to the processing means for estimating a maximum probability for a second
document different from the first document based on a partially calculated
probability for the second document and an assumption that the
representations in the second document match the concepts of the query for
which probabilities have not been calculated, the processing means being
responsive to the estimating means to calculate partial probabilities that
representations in the second document match concepts of the query until
either the estimated maximum probability for the second document does not
at least equal the threshold or the probability is calculated for all the
concepts in the query.
29. The system according to claim 28 wherein the output means includes a
result list ranking the identified documents in probability order, the
threshold setting means being responsive to the result list to reset the
probability threshold equal to the probability of the document lowest on
the result list.
30. The system according to claim 25 wherein the output means includes a
result list ranking the identified documents in probability order.
31. A document identification system for identifying documents matching a
query, comprising:
a read-only memory containing a database containing identifications of
documents in a document collection and defining a plurality of
representations representing the contents of documents in a document
collection, the collection comprising a plurality of documents, the
database further containing indications of the frequencies of occurrences
of a plurality of representations in the documents;
query means for defining the query containing a plurality of concepts;
computer means responsive to the query, the computer means including
matching means for matching the concepts to representations;
calculating means for calculating the probabilities that documents meet the
query based on the frequencies of occurrence of representations in the
respective documents which match the concepts;
processing means responsive to the calculating means for identifying a
first document contained in the sample having the highest calculated
probability;
threshold setting means responsive to the processing means for setting a
probability threshold equal to the probability of the first document;
estimating means responsive to the calculating means for estimating a
maximum probability for a second document different from the first
document based on a partially calculated probability for the second
document and an assumption that the representations in the second document
match the concepts of the query for which probabilities have not been
calculated,
said calculating means being responsive to the estimating means to
calculate partial probabilities that representations in the second
document match concepts in the query until either the estimated maximum
probability for the second document does not at least equal the
probability threshold or the probability is calculated for all concepts in
the query,
the estimating means being further responsive to the calculating means
ceasing or the completing the calculation of the probability for the
second document to estimate a maximum probability for a third document
different from the first and second documents; and
output means responsive to the processing means for outputting
identifications of only documents whose probability is calculated for all
concepts in the query.
32. The document identification system according to claim 31 wherein said
output means includes a result list responsive to the calculating means to
identify in probability order up to a predetermined number of those
documents whose probability is calculated for all concepts in the query,
said threshold setting means being responsive to the result list to reset
the probability threshold equal to the probability of the document lowest
on the result list.
33. In a computer system for identifying documents matching a query, in
which the system has a database containing identifications of documents in
a document collection and defining a plurality of representations
representing the contents of the documents, the collection comprising a
plurality of documents, and query means for defining a query containing a
plurality of concepts, a process of identifying a predetermined number of
documents of the document collection containing representations that have
high probabilities of matching the query containing a plurality of
concepts, the process comprising:
iteratively selecting successive samples of a plurality of documents from
the collection for examination, each sample containing fewer documents
than the entire collection, and each successive sample containing
documents different from each previous sample;
calculating the probabilities that documents contained in the sample
contain representations that match the query;
identifying, during each iteration, a preselected number of documents
having the highest probabilities, the documents being selected from a
group consisting of a respective sample of documents and the documents
identified during the next previous iteration, the preselected number
being different for each iteration and no greater than the predetermined
number; and
outputting identifications of the predetermined number of identified
documents upon completion of the last iteration.
34. The process according to claim 33 including setting a probability
threshold to the probability of the identified document having the lowest
probability of all identified documents, and during each respective
iteration and after the preselected number of documents has been
identified, determining if an additional document of the respective sample
has been identified having a probability greater than the probability
threshold, and if so, replacing the previously-identified document having
the lowest probability with the additional document and resetting the
probability threshold to the probability of the identified document having
the new lowest probability.
35. The process according to claim 33 wherein the preselected number is
equal to the number of the respective iteration, and the predetermined
number is equal to the number of the last iteration.
36. The process according to claim 33 including setting a probability
threshold equal to the probability of a first document, estimating a
maximum probability for a second document different from the first
document based on a partially calculated probability for the second
document and an assumption that the representations in the second document
match the concepts of the query for which probabilities have not been
calculated, and calculating partial probabilities that representations in
the second document match concepts in the query until either the estimated
maximum probability for the second document does not at least equal the
threshold or the probability is calculated for all the concepts in the
query.
37. The process according to claim 36 including ranking the identified
documents in probability order, and resetting the probability threshold
equal to the probability of the document lowest on the list.
38. The process according to claim 33 including ranking the identified
documents in probability order.
39. In a computer system for identifying documents matching a query, in
which the system has a database containing identifications of documents in
a document collection and defining a plurality of representations
representing the contents of the documents, the collection comprising a
plurality of documents, and query means for defining a query containing a
plurality of concepts, a process of identifying documents of the document
collection containing representations that match the query containing a
plurality of concepts, the process comprising:
computing the full probability that a first document matches the concepts
in the query;
setting a probability threshold equal to the full probability of the first
document;
calculating a partial probability that a second document matches some but
not all concepts in the query;
estimating a maximum probability for the second document based on the
calculated probability and an assumption that the representations in the
document match the concepts of the query for which probabilities have not
been calculated;
repeating the steps of calculating and estimating for additional query
concepts until either the estimated maximum probability tier the second
document is not as large as the probability threshold or the full
probability of the second document is calculated for all concepts in the
query;
repeating the repetitive steps of calculating and estimating for a third
document different from the first and second documents; and
outputting identifications of only documents having a full probability at
least as great as the probability threshold.
40. The process according to claim 39 wherein a predetermined number of
documents of the document collection is identified and wherein documents
whose probabilities are calculated for all concepts in the query are
identified to a result list in probability order, up to said predetermined
number, said process further including resetting the probability threshold
equal to the probability of the document lowest on the result list.
41. In a system identifying a predetermined number of documents matching a
query, in which the system has a database containing identifications of
documents in a document collection and defining a plurality of
representations representing the contents of the documents, the collection
comprising a plurality of documents, query means for defining containing a
plurality of concepts, means for determining a probability that a document
meets the query based on matches of representations in the document and
concepts in the query, and output means for outputting the identifications
of documents having a probability at least as great as a probability
threshold, apparatus for establishing the probability threshold
comprising:
sample selection means for iteratively selecting successive samples of a
plurality of documents from the collection for examination, each sample
containing fewer documents than the entire collection and each successive
sample containing documents different from each previous sample;
calculating means for calculating probabilities that documents contained in
the sample contain representations that match the query;
processing means responsive to the sample selection means to identify,
during each iteration, up to a preselected number of documents having the
highest probabilities, the documents being identified during each
iteration from a group consisting of a respective sample of documents end
the documents identified during the previous iteration; and
threshold setting means responsive to the processing means for setting the
probability threshold to the probability of the identified document having
the lowest probability.
42. The apparatus according to claim 41 including determining means
operable during each respective iteration and responsive to the
identification of the preselected number of documents by the processing
means to determine if the processing means identifies an additional
document of the respective sample having a probability greater than the
probability threshold, the processing means being responsive to the
determining means to replace the previously-identified document having the
lowest probability by the additional document, and the threshold setting
means is responsive to the processing means to reset the probability
threshold to the probability of the identified document having the new
lowest probability.
43. The apparatus according to claim 41 wherein the preselected number is
equal to the number of the respective iteration.
44. In a system for identifying a predetermined number of documents
matching a query, in which the system has a database containing
identifications of documents in a document collection and defining a
plurality of representations representing the contents of the documents,
the collection comprising a plurality of documents, query means for
defining a query containing a plurality of concepts, means for determining
a probability that a document meets the query based on a match of
representations in the document and concepts in the query, and output
means for outputting the identifications of documents having a probability
at least as great as a probability threshold, a process for establishing
the probability threshold comprising:
iteratively selecting successive samples of a plurality of documents from
the collection for examination, each sample containing fewer documents
than the entire collection, and each successive sample containing
documents different from each previous sample;
calculating probabilities that documents in the sample contain
representations that match the query;
identifying, during each iteration, up to a preselected number of documents
having the highest probabilities, the documents being identified during
each iteration from a group consisting of a respective sample of documents
and the documents identified during the next previous iteration; and
setting the probability threshold to the probability of the identified
document having the lowest probability.
45. The process according to claim 44 including during each respective
iteration and after the preselected number of documents has been
identified, determining if an additional document of the sample has been
identified having a probability greater than the probability threshold,
replacing the previously-identified document having the lowest probability
by the additional document, and resetting the probability threshold to the
probability of the identified document having the new lowest probability.
46. The process according to claim 44 wherein the preselected number is
equal to the number of the respective iteration. |
|
|
|
|
Claims  |
|