An information retrieval system finds information in a Distributed Information System (DIS), e.g. the Internet using query learning and meta search for adding documents to resource directories contained in the DIS. A selection means generates training data characterized as positive and negative examples of a particular class of data residing in the DIS. A learning means generates from the training data at least one query that can be submitted to any one of a plurality of search engines for searching the DIS to find "new" items of the particular class. An evaluation means determines and verifies that the new item(s) is a new subset of the particular class and adds or updates the particular class in the resource directory.
RELATED APPLICATION
Provisional Application, Ser. No. 60/015,231, filed Apr. 10, 1996 and assigned to the same assignee as that of the present invention.
APPENDIX ON CD-ROM
Appendix 3 to this Specification is a computer program listing appendix, submitted on a CD and incorporated by reference in its entirety.
Notice
This document discloses source code for implementing the invention. No license is granted directly, indirectly or by implication to the source code for any purpose by disclosure in this document except copying for informational purposes only or as authorized in writing by the assignee under suitable terms and conditions.
Appendix on CD-ROM
Appendix 3 to this Specification is a computer program listing appendix, submitted on a CD and incorporated by reference in its entirety.
A method and apparatus is provided for producing a general data extraction procedure capable of extracting data from data sources on a network regardless of data format. The general data extraction procedure is determined from a plurality of pairs of data from the network, each pair including a data source and a program which accurately extracts data from the data source. The pairs of data are processed by a learning system to learn a general program for extracting data from new data sources.
A lightweight rule induction method is described that generates compact Disjunctive Normal Form (DNF) rules. Each class may have an equal number of unweighted rules. A new example is classified by applying all rules and assigning the example to the class with the most satisfied rules. The induction method attempts to minimize the training error with no pruning. An overall design is specified by setting limits on the size and number of rules. During training, cases are adaptively weighted using a simple cumulative error method. The induction method is nearly linear in time relative to an increase in the number of induced rules or the number of cases. Experimental results on large benchmark datasets demonstrate that predictive performance can rival the best reported results in the literature.
An apparatus and method of content-based image retrieval generate a plurality of training images, including a first part and a second part. Each training image of the first part is labeled as one of a positive bag or a negative bag. The training image is labeled a positive bag if the training image has a desirable character and labeled a negative bag if the training image does not have a desirable character. A set of N1 training images is identified from a set of all training images of the second part, which identified images have a feature most closely matching a first feature instance of a training image labeled as a positive bag. A first value corresponding to the first feature instance is calculated, based on the number of images labeled as positive bags that are identified in the set of N1 training images.
Computer method and apparatus identifies content owner of a Web site. A collecting step or element collects candidate names from the subject Web site. For each candidate name, a test module (or testing step) runs tests that provide quantitative/statistical evaluation of the candidate name being the content owner name of the subject Web site. The test results are combined mathematically, such as by a Bayesian network, into an indication of content owner name.
A method of learning a user query concept is provided which includes a sample selection stage and a feature reduction stage; during the sample selection stage, sample objects are selected from a query concept sample space bounded by a k-CNF and a k-DNF; the selected sample objects include feature sets that are no more than a prescribed amount different from a corresponding feature set defined by the k-CNF; during the feature reduction stage, individual features are removed from the k-CNF that are identified as differing from corresponding individual features of sample objects indicated by the user to be close to the user's query concept; also during the feature reduction stage, individual features are removed from the k-DNF that are identified as not differing from corresponding individual features of sample objects indicated by the user to be not close to the user's query concept.