The invention concerns a method for estimating characteristic information of data items in a data set, such as a database, based on parameters of a multifractal distribution. The invention facilitates efficient estimation of such characteristic information of data contained in a data set more accurately than known estimation methods and without requiring an exhaustive analysis of the data. The invention also concerns an efficient technique for generating the parameters for the multifractal distribution.
Electronic marketplaces typically apply catalog schema in the format of name-value pairs to store product attribute names and values to achieve a very high degree of flexibility. This vertical schema approach prevents traditional relational database management systems from accurately estimating constraint selectivity and generating efficient query plans. In this invention, methods and systems are disclosed for building and maintaining external histograms and a query planner uses these external histograms to assist query planning in relational databases.
A method for estimating frequency moments of a database, the frequency moments F.sub.k of a sequence containing m.sub.i elements of type i, for 1<i<n, being ##EQU1## for each k.gtoreq.0. The method entails receiving a sequence of data elements for input to the database; selecting a subset of the sequence of data elements; initializing a counter corresponding to each data element of the subset; comparing data elements input to the database with each data element in the subset and incrementing counters of subset elements which correspond to the input data elements; computing a random variable for each data item in the subset, which depends on the counter of this subset item; and, processing the plurality of random variables to determine an estimate of the frequency moment. The estimated frequency moment is optimized to be accurate with a high degree of confidence.
Techniques for maintaining a random sample of a relation in a database in the presence of updates to the relation. The random sample of the relation is referred to as a "backing sample," and it is maintained in the presence of insert, modify and delete operations involving the relation. When a new tuple is inserted into the relation, a sample of the given tuple is added to the backing sample if the size of the backing sample is below an upper bound. Otherwise, a randomly-selected tuple of the backing sample is replaced with the new tuple if a sample of the new tuple must be inserted into the backing sample to maintain randomness or another characteristic. When a tuple in the relation is the subject of a modify operation, the backing sample is left unchanged if the modify operation does not affect an attribute of interest to an application which uses the backing sample. Otherwise, a value field in a sample of the tuple in the backing sample is updated. When a tuple is deleted from the relation, any sample of that tuple in the backing sample is removed. A new backing sample may be computed if this removal causes the size of the backing sample to fall below a prespecified lower bound. The backing sample can be of a size which is negligible in comparison to the relation, and need only be modified very infrequently. As a result, its overhead in terms of computation time and storage space is minimal.
A method, database system, and computer program for collecting statistics about a table are disclosed. The table includes one or more rows and each row includes a respective value. The method includes creating zero or more histogram buckets. Each histogram bucket includes a width representing a respective range of values and a height representing a count of rows having values in the range of values. The method further includes creating one or more high-bias buckets, each high-bias bucket represents one or more values that appear in a minimum percentage of rows.