WikiPatents - Community Patent Review
Create Free Account  |  License or Sell Your Patent  |  WikiPatents Marketplace  |  WikiPatents Blog
Username:  Password:  
    
Advanced Search
System of document representation retrieval by successive iterated probability sampling    
United States Patent5488725   
Link to this pagehttp://www.wikipatents.com/5488725.html
Inventor(s)Turtle; Howard R. (Rosemount, MN); Morton; Gerald J. (South St. Paul, MN); Larntz; F. Kinley (Shoreview, MN)
AbstractAn information retrieval system based on probabilities that documents meet information needs. The frequency of occurrence of a representation in a collection of documents is estimated by identifying the frequency of occurrence of the representation in a sample of documents and calculating the difference between the maximum and minimum probable frequencies of occurrence of the representation in the collection. If the difference does not exceed a limit, a midpoint of the maximum and minimum probable frequencies is the estimated frequency of occurrence of the representation. Document distribution probabilities are optimized and probability thresholds are established for the identification of documents. An initial probability threshold is established and is adjusted as the probabilities are scored for documents in samples. The document result list is iteratively adjusted through the samples.
   














 Title Information Submit all comments and votes
 
Patent Text Patent PDF Print Page Summary File History
Plain text PDF images Print Summary File History
Drawing from US Patent 5488725
System of document representation retrieval by successive iterated

     probability sampling - US Patent 5488725 Drawing
System of document representation retrieval by successive iterated probability sampling
Inventor     Turtle; Howard R. (Rosemount, MN); Morton; Gerald J. (South St. Paul, MN); Larntz; F. Kinley (Shoreview, MN)
Owner/Assignee     West Publishing Company (Eagan, MN)
Patent assignment
All assignments
Publication Date     January 30, 1996
Application Number     08/039,757
PAIR File History     Application Data   Transaction History
Image File Wrapper   Patent Term   Fees
Litigation
Filing Date     March 30, 1993
US Classification     707/5
Int'l Classification     G06F 017/30
Examiner     Black; Thomas G.
Assistant Examiner     Amsbury; Wayne
Attorney/Law Firm     Kinney & Lange
Address
Parent Case     CROSS REFERENCE TO RELATED APPLICATIONS This application is a continuation-in-part of application Ser. No. 07/773,101 filed Oct. 8, 1991, U.S. Pat. No. 5,265,065.
Priority Data    
USPTO Field of Search     395/600 364/419.19
Patent Tags     document representation retrieval successive iterated probability sampling
   
Enter a comma (,) or semicolon (;) between multiple tag words/phrases.
Describe this patent:
 Amusing   
 Clever   
 Complex   
 Efficient   
 Historic   
 Important   
 Innovative   
 Interesting   
 Practical   
 Simple   
[no votes]
Patent WIKI

Share information and news about this patent, including information and news about the technology, inventors, company, ligation and licensing.

 References Submit all comments and votes
 
*references marked with an asterisk below are user-added references
 U.S. References
 
Add a new US reference:  
ReferenceRelevancyCommentsReferenceRelevancyComments
5335345
Frieder

Aug,1994

[0 after 0 votes]
5325298
Gallant
704/9
Jun,1994

[0 after 0 votes]
5321833
Chang
707/5
Jun,1994

[0 after 0 votes]
5301109
Landauer
704/9
Apr,1994

[0 after 0 votes]
5297042
Morita

Mar,1994

[0 after 0 votes]
5278980
Pedersen
707/4
Jan,1994

[0 after 0 votes]
5265065
Turtle
707/4
Nov,1993

[0 after 0 votes]
5263159
Mitsui
707/5
Nov,1993

[0 after 0 votes]
5220625
Hatakeyama
715/809
Jun,1993

[0 after 0 votes]
5159667
Borrey
715/500
Oct,1992

[0 after 0 votes]
5109509
Katayama
704/9
Apr,1992

[0 after 0 votes]
4870568
Kahle
707/5
Sep,1989

[0 after 0 votes]
4843389
Lisle
341/106
Jun,1989

[0 after 0 votes]
4554631
Reddington
707/4
Nov,1985

[0 after 0 votes]
4422158
Galie
707/5
Dec,1983

[0 after 0 votes]
4384329
Rosenbaum
704/10
May,1983

[0 after 0 votes]
5418948
Turtle
707/4
Dec,1969

[0 after 0 votes]
 Foreign References
 Other References
 Market Review Submit all comments and votes
   
Market Size
Estimate the gross annual revenues of the relevant market sector:
> $10B
$5B - $10B
$2B - $5B
$500M - $2B
$100M - $500M
$10M - $100M
$1M - $10M
$500K - $1M
$100K - $500K
< $100K
[No votes]
$0
 
$0   $2.5B   $5B   $7.5B   $10B
Market Share
Estimate the percentage of the relevant market sector this invention will capture:
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Reasonable Royalty
What percentage of gross sales should the inventor or assignee be paid?
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Public's "Guesstimation" of Royalty Value
Market SizeN/A[No votes]
xMarket ShareN/A[No votes]
xReasonable RoyaltyN/A[No votes]

N/A

License Availablity
If you are NOT the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
License Availablity
If you ARE the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
Competitive Advantage
Does this invention have a significant competitive advantage over similar technologies?
Yes

No



[No votes]
Most helpful competitive advantage comment
[No comments]

Commercial Alternatives
Are there viable commercial alternatives for this invention?
Yes

No



[No votes]
Most helpful commercial alternative comment
[No comments]

 Technical Review Submit all comments and votes
 Claims Submit all comments and votes
 


What is claimed is:

1. In a computer system for identifying a predetermined number of documents of a document collection containing representations that have high probabilities of matching a query containing a plurality of concepts, in which the system has a database containing identifications of documents in the document collection and defining a plurality of representations representing the contents of the documents, the collection comprising a plurality of documents, and query means for defining the query, apparatus comprising:

sample selection means for iteratively selecting successive samples of a plurality of documents from the collection, each sample containing fewer documents than the entire collection and each successive sample containing documents different from each previous sample;

processing means responsive to the sample selection means for calculating, during each iteration, probabilities that documents contained in the sample contain representations that match the query and for identifying a preselected number of documents having the highest probabilities, the documents being identified during an iteration from a group consisting of the respective sample of documents and the documents identified during the next previous iteration, the preselected number being different for each iteration and no greater than the predetermined number; and

output means outputting the identifications of the predetermined number of documents identified by the processing means.

2. The apparatus according to claim 1 further including threshold setting means responsive to the processing means for setting a probability threshold equal to the probability of a first identified document.

3. The apparatus according to claim 2 including determining means operable during each respective iteration and responsive to the identification of the preselected number of documents by the processing means to determine if an additional document of the respective sample has a probability greater than the probability threshold, the processing means being responsive to the determining means identifying an additional document having a probability greater that the probability threshold to replace the previously-identified document having the lowest probability by the additional document, and the threshold setting means being responsive to the processing means to reset the probability threshold to the probability of the identified document having the new lowest probability.

4. The apparatus according to claim 1 wherein the preselected number is equal to the number of the respective iteration, and the predetermined number is equal to the number of the last iteration.

5. The apparatus according to claim 1 including estimating means responsive to the processing means for estimating a maximum probability for a second document different from the first document based on a partially calculated probability for the second document and an assumption that the representations in the second document match the concepts of the query for which probabilities have not been calculated, the processing means being responsive to the estimating means to calculate partial probabilities that representations in the second document match concepts of the query until either the estimated maximum probability does not at least equal the threshold or the probability is calculated for all the concepts in the query.

6. The apparatus according to claim 5 wherein the output means includes a result list ranking the identified documents in probability order, the threshold setting means being responsive to the result list to reset the probability threshold equal to the probability of the document lowest on the result list.

7. The apparatus according to claim 1 wherein the processing means includes a result list ranking the identified documents in probability order.

8. A system for identifying documents matching a comprising:

a memory containing a database containing identification of documents in a document collection and defining a plurality of representations representing the contents of the documents, the collection comprising a plurality of documents, the database further containing indications of the frequencies of occurrence of documents containing first representations in the collection;

computer means responsive to a query defining a plurality of concepts, the computer means including

matching means for matching the concepts to representations,

estimating means for estimating the frequency of occurrence of documents containing a second selected representation in the collection, the second selected representation being different from any of the first representations, the estimating means including

sample selection means for selecting a sample comprising a plurality of documents from the collection, the sample containing fewer documents than the entire collection;

frequency identifying means responsive to the sample selection means for identifying the frequency of occurrence of documents containing the second selected representation in the selected sample of documents;

processor means responsive to the memory means and to the frequency identifying means for calculating a maximum and a minimum probable frequency of occurrence of documents containing the second selected representation in the collection; and

selection means responsive to the processor means for selecting the midpoint, of the maximum and minimum probable frequencies as the estimated frequency of occurrence of the second selected representation;

retrieval means for selecting documents meeting the query based on the frequencies of occurrence of documents containing first representations which match the concepts and the estimated frequencies of occurrence of documents containing second representations which match the concepts, and

output means responsive to the retrieval means and the memory for outputting identifications of the selected documents.

9. The system according to claim 8 wherein the processor means includes means for identifying if the difference between the maximum and minimum probable frequencies is within a preselected limit, and further including adjusting means responsive to the processor means for adding additional documents from the collection to the sample of documents if the calculating difference between the maximum and minimum probable frequencies exceeds the preselected limit.

10. The system according to claim 8 where the processor means calculates the maximum probable frequency, f.sub.max, and the minimum probable frequency, f.sub.min, in accordance with relationships based on the number of gaps between documents in the sample containing the second selected representation (n.sub.i), the number of documents in the collection (n.sub.c), and the number of documents in the sample (x.sub.i).

11. The system according to claim 10 where f.sub.max and f.sub.min are calculated accordance with the relationships ##EQU5## where s.sub.i is the greater of x.sub.i /n.sub.i or the standard deviation of the n.sub.i gaps, and

z is the standard critical value for normal distribution for a preselected reliability.

12. In a system for identifying documents matching a query, in which the system has a database containing identifications of documents in a document collection and defining a plurality of representations representing the contents of the documents, the collection compromising a plurality of documents and the database containing a frequency of occurrence of documents containing each of at least some of the representations in the collection of documents, query means for defining a query containing a plurality of concepts, matching means for matching concepts to representations means for selecting documents meeting the query based on frequencies of occurrence of documents in the collection containing representations matching the concepts, and output means responsive to the means for selecting documents for outputting identifications of selected documents, the improvement comprising a process of estimating the frequency of occurrence of documents containing a representation in the collection of documents for which the database does not contain a frequency of occurrence, comprising:

identifying, on the basis of concepts in the query, a representation for which the database does not contain a frequency of occurrence;

selecting a sample comprising a plurality of documents from the collection, the sample containing fewer documents than the entire collection;

identifying the frequency of occurrence of documents containing the identified representation in the selected sample of documents;

calculating a maximum and a minimum probable frequency of occurrence of documents containing the identified representation in the collection; and

selecting a midpoint of the maximum and minimum probable frequencies as the estimated frequency of occurrence of documents containing the identified representation,

whereby the means for selecting documents meeting the query is responsive to the frequencies of occurrence in the database of documents in the collection containing representations matching the concepts and to estimated frequencies of occurrence to select documents in the collection containing representations matching the concepts.

13. The process according to claim 12 further including identifying whether the difference between the maximum and minimum probable frequencies is within a preselected limit, and adding additional documents to the sample from the collection if the calculated difference between the maximum and minimum probable frequencies exceeds the preselected limit.

14. The process according to claim 13 where the preselected limit is 0.05.

15. The process according to claim 12 where the maximum probable frequency, f.sub.max, and the minimum probable frequency, f.sub.min, are calculated in accordance with the relationships ##EQU6## where n.sub.i is the number of gaps between documents in the sample containing the selected representation,

n.sub.c is the number of documents in the collection,

x.sub.i is the number of documents in the sample,

s.sub.i is the greater of x.sub.i /n.sub.i or the standard deviation of the n.sub.i gaps, and

z is the standard critical value for normal distribution for a preselected reliability.

16. The process according to claim 15 where the selected representation contains a plurality of terms, the method including setting f.sub.min equal to n.sub.i if the calculated f.sub.min is smaller than n.sub.i, setting f.sub.max equal to n.sub.i +(n.sub.c 31 x.sub.i) if the calculated f.sub.max is smaller than zero or smaller than n.sub.i, and setting f.sub.max equal to an a priori maximum if the calculated f.sub.max is greater than the a priori maximum.

17. The process according to claim 16 wherein the selected representation is a synonym represented by a plurality of terms, and wherein the a priori maximum is equal to the sum of all frequencies of occurrence of documents in the collection containing a term of the synonym, said method including setting f.sub.min equal to an a priori minimum if the calculated f.sub.min is smaller than the a priori minimum, where the a priori minimum is equal to the frequency of occurrence of documents containing the term of the synonym appearing in the greatest number of documents in the collection.

18. The process according to claim 16 wherein the selected representation is a phrase containing a plurality of terms, and the a priori maximum is equal to the frequency of occurrence of documents containing the term of the phrase appearing in the least number of documents in the collection.

19. The process according to claim 15 where the preselected reliability is 0.995 and z is 2.8070.

20. The process according to claim 12 wherein the midpoint selected between the maximum and minimum probable frequencies is the mean of the maximum and minimum probable frequencies.

21. In a computer system for identifying documents matching a query, in which the system has a database containing identifications of documents in a document collection and defining a plurality of representations representing the contents of the documents, the collection comprising a plurality of documents, and query means for defining a query containing a plurality of concepts, apparatus for identifying documents of the document collection containing representations that match the query containing a plurality of concepts, the apparatus comprising:

processing means for calculating probabilities that documents match the query and for identifying a first document having a calculated probability;

threshold setting means responsive to the processing means for setting a probability threshold equal to the probability of the first document;

estimating means responsive to the processing means for estimating a maximum probability for a second document different from the first document based on a partially calculated probability and an assumption that the representations in the second document match the concepts of the query for which probabilities have not been calculated;

the processing means being responsive to the estimating means to calculate partial probabilities that representations in the second document match concepts of the query until either the estimated maximum probability for the second document does not at least equal the probability threshold or the probability is calculated for all the concepts in the query;

the estimating means being further responsive to the processing means ceasing or completing the calculation of the probability for the second document to estimate a maximum probability for a third document different from the first and second documents; and

output means responsive to the processing means for outputting identifications of only documents whose probability is calculated for all concepts in the query.

22. The apparatus according to claim 21 wherein the output means includes a result list identifying in probability order, up to a predetermined number of documents whose probability is calculated for all concepts in the query, the threshold setting means being responsive to the result list to reset the probability threshold equal to the probability of the document lowest on the result list.

23. Apparatus according to claim 21 wherein the threshold setting means is responsive to the processing means calculating the probability for the second document for all the concepts in the query to set the probability threshold equal to the probability of the second document.

24. The apparatus according to claim 21 wherein the output means includes a result list identifying in probability order, up to a predetermined number of documents whose probability is calculated for all concepts in the query.

25. A document identification system for identifying a predetermined number of documents matching a query, comprising:

a read-only memory containing a database containing identifications of documents in a document collection and defining a plurality of representations representing the contents of documents in the document collection, the collection comprising a plurality of documents;

query means for defining the query containing a plurality of concepts;

computer means responsive to the query containing a plurality of concepts, the computer means including matching means for matching the concepts to representations;

sample selection means for iteratively selecting successive samples of a plurality of documents from the collection for examination, each sample containing fewer documents than the entire collection, and each successive sample containing documents different from each previous sample;

processing means responsive to the sample selection means for calculating, during each iteration probabilities that documents contained in the sample contain representations that match the query and for identifying up to a preselected number of documents having the highest probabilities, the documents being identified during each iteration from a group consisting of the respective sample of documents and the documents identified during the next previous iteration, the preselected number being different for each iteration and no greater than the predetermined number; and

output means outputting identifications of the predetermined number of documents identified by the processing means.

26. The system according to claim 25 further including threshold setting means responsive to the processing means for setting a probability threshold equal to the probability of a first identified document.

27. The system according to claim 26 including determining means operable during each respective iteration and responsive to the identification of the preselected number of documents by the processing means to determine if an additional document of the respective sample has a probability grater than the probability threshold, the processing means being responsive to the determining means identifying an additional document having a probability greater than the probability threshold to replace the previously-identified document having the lowest probability with the additional document, and the threshold setting means is responsive to the processing means to reset the probability threshold to the probability of the identified document having the new lowest probability.

28. The system according to claim 25 including estimating means responsive to the processing means for estimating a maximum probability for a second document different from the first document based on a partially calculated probability for the second document and an assumption that the representations in the second document match the concepts of the query for which probabilities have not been calculated, the processing means being responsive to the estimating means to calculate partial probabilities that representations in the second document match concepts of the query until either the estimated maximum probability for the second document does not at least equal the threshold or the probability is calculated for all the concepts in the query.

29. The system according to claim 28 wherein the output means includes a result list ranking the identified documents in probability order, the threshold setting means being responsive to the result list to reset the probability threshold equal to the probability of the document lowest on the result list.

30. The system according to claim 25 wherein the output means includes a result list ranking the identified documents in probability order.

31. A document identification system for identifying documents matching a query, comprising:

a read-only memory containing a database containing identifications of documents in a document collection and defining a plurality of representations representing the contents of documents in a document collection, the collection comprising a plurality of documents, the database further containing indications of the frequencies of occurrences of a plurality of representations in the documents;

query means for defining the query containing a plurality of concepts;

computer means responsive to the query, the computer means including matching means for matching the concepts to representations;

calculating means for calculating the probabilities that documents meet the query based on the frequencies of occurrence of representations in the respective documents which match the concepts;

processing means responsive to the calculating means for identifying a first document contained in the sample having the highest calculated probability;

threshold setting means responsive to the processing means for setting a probability threshold equal to the probability of the first document;

estimating means responsive to the calculating means for estimating a maximum probability for a second document different from the first document based on a partially calculated probability for the second document and an assumption that the representations in the second document match the concepts of the query for which probabilities have not been calculated,

said calculating means being responsive to the estimating means to calculate partial probabilities that representations in the second document match concepts in the query until either the estimated maximum probability for the second document does not at least equal the probability threshold or the probability is calculated for all concepts in the query,

the estimating means being further responsive to the calculating means ceasing or the completing the calculation of the probability for the second document to estimate a maximum probability for a third document different from the first and second documents; and

output means responsive to the processing means for outputting identifications of only documents whose probability is calculated for all concepts in the query.

32. The document identification system according to claim 31 wherein said output means includes a result list responsive to the calculating means to identify in probability order up to a predetermined number of those documents whose probability is calculated for all concepts in the query, said threshold setting means being responsive to the result list to reset the probability threshold equal to the probability of the document lowest on the result list.

33. In a computer system for identifying documents matching a query, in which the system has a database containing identifications of documents in a document collection and defining a plurality of representations representing the contents of the documents, the collection comprising a plurality of documents, and query means for defining a query containing a plurality of concepts, a process of identifying a predetermined number of documents of the document collection containing representations that have high probabilities of matching the query containing a plurality of concepts, the process comprising:

iteratively selecting successive samples of a plurality of documents from the collection for examination, each sample containing fewer documents than the entire collection, and each successive sample containing documents different from each previous sample;

calculating the probabilities that documents contained in the sample contain representations that match the query;

identifying, during each iteration, a preselected number of documents having the highest probabilities, the documents being selected from a group consisting of a respective sample of documents and the documents identified during the next previous iteration, the preselected number being different for each iteration and no greater than the predetermined number; and

outputting identifications of the predetermined number of identified documents upon completion of the last iteration.

34. The process according to claim 33 including setting a probability threshold to the probability of the identified document having the lowest probability of all identified documents, and during each respective iteration and after the preselected number of documents has been identified, determining if an additional document of the respective sample has been identified having a probability greater than the probability threshold, and if so, replacing the previously-identified document having the lowest probability with the additional document and resetting the probability threshold to the probability of the identified document having the new lowest probability.

35. The process according to claim 33 wherein the preselected number is equal to the number of the respective iteration, and the predetermined number is equal to the number of the last iteration.

36. The process according to claim 33 including setting a probability threshold equal to the probability of a first document, estimating a maximum probability for a second document different from the first document based on a partially calculated probability for the second document and an assumption that the representations in the second document match the concepts of the query for which probabilities have not been calculated, and calculating partial probabilities that representations in the second document match concepts in the query until either the estimated maximum probability for the second document does not at least equal the threshold or the probability is calculated for all the concepts in the query.

37. The process according to claim 36 including ranking the identified documents in probability order, and resetting the probability threshold equal to the probability of the document lowest on the list.

38. The process according to claim 33 including ranking the identified documents in probability order.

39. In a computer system for identifying documents matching a query, in which the system has a database containing identifications of documents in a document collection and defining a plurality of representations representing the contents of the documents, the collection comprising a plurality of documents, and query means for defining a query containing a plurality of concepts, a process of identifying documents of the document collection containing representations that match the query containing a plurality of concepts, the process comprising:

computing the full probability that a first document matches the concepts in the query;

setting a probability threshold equal to the full probability of the first document;

calculating a partial probability that a second document matches some but not all concepts in the query;

estimating a maximum probability for the second document based on the calculated probability and an assumption that the representations in the document match the concepts of the query for which probabilities have not been calculated;

repeating the steps of calculating and estimating for additional query concepts until either the estimated maximum probability tier the second document is not as large as the probability threshold or the full probability of the second document is calculated for all concepts in the query;

repeating the repetitive steps of calculating and estimating for a third document different from the first and second documents; and

outputting identifications of only documents having a full probability at least as great as the probability threshold.

40. The process according to claim 39 wherein a predetermined number of documents of the document collection is identified and wherein documents whose probabilities are calculated for all concepts in the query are identified to a result list in probability order, up to said predetermined number, said process further including resetting the probability threshold equal to the probability of the document lowest on the result list.

41. In a system identifying a predetermined number of documents matching a query, in which the system has a database containing identifications of documents in a document collection and defining a plurality of representations representing the contents of the documents, the collection comprising a plurality of documents, query means for defining containing a plurality of concepts, means for determining a probability that a document meets the query based on matches of representations in the document and concepts in the query, and output means for outputting the identifications of documents having a probability at least as great as a probability threshold, apparatus for establishing the probability threshold comprising:

sample selection means for iteratively selecting successive samples of a plurality of documents from the collection for examination, each sample containing fewer documents than the entire collection and each successive sample containing documents different from each previous sample;

calculating means for calculating probabilities that documents contained in the sample contain representations that match the query;

processing means responsive to the sample selection means to identify, during each iteration, up to a preselected number of documents having the highest probabilities, the documents being identified during each iteration from a group consisting of a respective sample of documents end the documents identified during the previous iteration; and

threshold setting means responsive to the processing means for setting the probability threshold to the probability of the identified document having the lowest probability.

42. The apparatus according to claim 41 including determining means operable during each respective iteration and responsive to the identification of the preselected number of documents by the processing means to determine if the processing means identifies an additional document of the respective sample having a probability greater than the probability threshold, the processing means being responsive to the determining means to replace the previously-identified document having the lowest probability by the additional document, and the threshold setting means is responsive to the processing means to reset the probability threshold to the probability of the identified document having the new lowest probability.

43. The apparatus according to claim 41 wherein the preselected number is equal to the number of the respective iteration.

44. In a system for identifying a predetermined number of documents matching a query, in which the system has a database containing identifications of documents in a document collection and defining a plurality of representations representing the contents of the documents, the collection comprising a plurality of documents, query means for defining a query containing a plurality of concepts, means for determining a probability that a document meets the query based on a match of representations in the document and concepts in the query, and output means for outputting the identifications of documents having a probability at least as great as a probability threshold, a process for establishing the probability threshold comprising:

iteratively selecting successive samples of a plurality of documents from the collection for examination, each sample containing fewer documents than the entire collection, and each successive sample containing documents different from each previous sample;

calculating probabilities that documents in the sample contain representations that match the query;

identifying, during each iteration, up to a preselected number of documents having the highest probabilities, the documents being identified during each iteration from a group consisting of a respective sample of documents and the documents identified during the next previous iteration; and

setting the probability threshold to the probability of the identified document having the lowest probability.

45. The process according to claim 44 including during each respective iteration and after the preselected number of documents has been identified, determining if an additional document of the sample has been identified having a probability greater than the probability threshold, replacing the previously-identified document having the lowest probability by the additional document, and resetting the probability threshold to the probability of the identified document having the new lowest probability.

46. The process according to claim 44 wherein the preselected number is equal to the number of the respective iteration.