WikiPatents - Community Patent Review
Create Free Account  |  License or Sell Your Patent  |  WikiPatents Marketplace  |  WikiPatents Blog
Username:  Password:  
    
Advanced Search
Calibration of logical cost formulae for queries in a heterogeneous DBMS using synthetic database    
United States Patent5412806   
Link to this pagehttp://www.wikipatents.com/5412806.html
Inventor(s)Du; Weimin (San Jose, CA); Krishnamurthy; Ravi (Cupertino, CA); Shan; Ming-Chien (Saratoga, CA)
AbstractA programmable machine system and method for managing electronic data access among multiple different relational databases in a network distributed database environment. The machine is programmed so that it can construct cost-effective access strategies for any of the participating databases absent any DBMS-specific cost models. The system provides query optimization across different database management systems in a network distributed database environment based on a calibrating database relying only on typical relational database statistics and cost data is developed by running queries in the various databases against the calibrating database. A logical cost model is constructed using the resulting cost data and is used to estimate the cost of a given query based on logical characteristics of the DBMS, the relations, and the query itself. The cost of a complex query is estimated using primitive queries. Optimal query access strategies are thereby designed and used to control execution of the queries across relational databases controlled by two or more different database management systems.
   














 Title Information Submit all comments and votes
 
Patent Text Patent PDF Print Page Summary File History
Plain text PDF images Print Summary File History
Drawing from US Patent 5412806
Calibration of logical cost formulae for queries in a heterogeneous DBMS

     using synthetic database - US Patent 5412806 Drawing
Calibration of logical cost formulae for queries in a heterogeneous DBMS using synthetic database
Inventor     Du; Weimin (San Jose, CA); Krishnamurthy; Ravi (Cupertino, CA); Shan; Ming-Chien (Saratoga, CA)
Owner/Assignee     Hewlett-Packard Company (Palo Alto, CA)
Patent assignment
All assignments
Publication Date     May 2, 1995
Application Number     07/932,426
PAIR File History     Application Data   Transaction History
Image File Wrapper   Patent Term   Fees
Litigation
Filing Date     August 20, 1992
US Classification     707/2 707/9
Int'l Classification     G06F 015/16 G06F 015/403
Examiner     Black; Thomas G.
Assistant Examiner     Homere; Jean R.
Attorney/Law Firm    
Address
Parent Case    
Priority Data    
USPTO Field of Search     395/600 395/800 395/275 395/100 395/155 395/200
Patent Tags     calibration logical cost formulae queries heterogeneous dbms synthetic database
   
Enter a comma (,) or semicolon (;) between multiple tag words/phrases.
Describe this patent:
 Amusing   
 Clever   
 Complex   
 Efficient   
 Historic   
 Important   
 Innovative   
 Interesting   
 Practical   
 Simple   
[no votes]
Patent WIKI

Share information and news about this patent, including information and news about the technology, inventors, company, ligation and licensing.

 References Submit all comments and votes
 
*references marked with an asterisk below are user-added references
 U.S. References
 
Add a new US reference:  
ReferenceRelevancyCommentsReferenceRelevancyComments
5091852
Tsuchida
707/2
Feb,1992

[0 after 0 votes]
5043872
Cheng
707/2
Aug,1991

[0 after 0 votes]
4956774
Shibamiya
707/2
Sep,1990

[0 after 0 votes]
4769772
Dwyer
707/2
Sep,1988

[0 after 0 votes]
4531186
Knapman
707/5
Jul,1985

[0 after 0 votes]
 Foreign References
 Other References
 Market Review Submit all comments and votes
   
Market Size
Estimate the gross annual revenues of the relevant market sector:
> $10B
$5B - $10B
$2B - $5B
$500M - $2B
$100M - $500M
$10M - $100M
$1M - $10M
$500K - $1M
$100K - $500K
< $100K
[No votes]
$0
 
$0   $2.5B   $5B   $7.5B   $10B
Market Share
Estimate the percentage of the relevant market sector this invention will capture:
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Reasonable Royalty
What percentage of gross sales should the inventor or assignee be paid?
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Public's "Guesstimation" of Royalty Value
Market SizeN/A[No votes]
xMarket ShareN/A[No votes]
xReasonable RoyaltyN/A[No votes]

N/A

License Availablity
If you are NOT the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
License Availablity
If you ARE the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
Competitive Advantage
Does this invention have a significant competitive advantage over similar technologies?
Yes

No



[No votes]
Most helpful competitive advantage comment
[No comments]

Commercial Alternatives
Are there viable commercial alternatives for this invention?
Yes

No



[No votes]
Most helpful commercial alternative comment
[No comments]

 Technical Review Submit all comments and votes
 Claims Submit all comments and votes
 


We claim:

1. A database access system for optimizing database queries in a heterogeneous distributed database system, the system comprising:

a first database machine incorporating a first relational database management system and accompanying first database;

a second database machine incorporating a relational database management system and accompanying second database;

the first and second relational database management systems being different but conforming at least to a predetermined structured query language (SQL);

communication means for electronic bidirectional communications between the different database machines;

means coupled to the communication means for sending and receiving an electronic message to and from any of the database machines, the message containing data defining a database query;

a data access logical cost model comprising logical cost formulae for optimizing queries in each database in the system:

a synthetic database for use in calibrating the data access logical cost model for each relational database management system in the distributed database system;

means for querying the synthetic data base on each database machine to determine cost coefficients for use in said logical costs formula to calibrate the data access logical cost model: and

means responsive to a database query for accessing each of the first and second databases of said first and second database machines in accordance with a least cost index obtained from said data access logical cost model.

2. A system according to claim 1 wherein the means for querying the synthetic database for calibrating a data access logical cost model includes:

a test suite of SQL queries for querying the synthetic database;

means for transmitting the test suite of SQL queries from the first database machine to the second database machine and determining cost model data for responses to the SQL queries; and

a system wide catalog for storing resulting cost model data.

3. A system according to claim 1 wherein the synthetic database for calibrating a data access cost model includes seven columns for any number (n) of relations containing 2.sup.n tuples, said columns, 1 through 7, having the attributes including:

Column.sub.1 :integer {0, n}, indexed and clustered;

Column.sub.2 :integer {0, 2.sup.n -1}, indexed and de facto clustered, but not specified to the DBMS as such;

Column.sub.3 :integer {0, n}, indexed and unclustered;

Column.sub.4 :integer {0, n}, having no index;

Column.sub.5 :integer {0, 2.sup.n -1}, indexed and unclustered;

Column.sub.6 :integer {0,2.sup.n -1}, having no index; and

Column.sub.7 :a long character string to meet the size of the tuple requirement.

4. A system according to claim 1 including means defining a space of executions for a database query including a remote join across the first and second database management systems.

5. A system according to claim 1 wherein the data access logical cost model includes means for determining data access cost information based on logical information of the databases including coefficients defined as functions having values comprising a composite cost of CPU utility and input/output overhead.

6. In a system for accessing data in a plurality of relational computer databases on distributed network of database machines, a method of structuring access strategies based on derived cost models for at least two participating database management systems (DBMS), each DBMS having a structured query language (SQL), but differing associated cost models, the method comprising:

constructing a model database wherein all rows, columns and relational structures are known and controlled;

conducting a series of access tests on each participating DBMS using said model database;

deriving access cost data for each participating DBMS according to said access tests;

storing the access cost data as a logical cost model in a datadictionary/catalog;

determining an optimum application plan for subsequent distributed database queries relying on the logical cost model stored in said data-dictionary/catalog; and

executing subsequent queries to return data from the distributed databases in accordance with said optimum application plan.

7. A method according to claim 6 wherein the model database construction step further comprises:

creating a database wherein for an integer "n," R.sub.n is a relation of seven columns containing 2.sup.n tuples,

each of said seven columns (C.sub.n, where n={1 . .7}) having the following attributes:

C.sub.1 :integer {0, n}, indexed and clustered;

C.sub.2 :integer {0, 2.sup.n -1}, indexed, defacto clustered, but not specified to the DBMS as such;

C.sub.3 :integer {0, n}, indexed and unclustered;

C.sub.4 :integer {0, n}, having no index;

C.sub.5 :integer {0, 2.sup.n -1}, indexed and unclustered;

C.sub.6 :integer {0,2.sup.n -1}, having no index; and

C.sub.7 :a long character string to meet the size of the tuple requirement;

setting the value for the seventh attribute to be a padding field; and

setting the multicolumn key for the relation to be {C.sub.1, C.sub.2 } such that the relation is indexed on this key, in ascending order, with C.sub.1 being the major key and C.sub.2 being the minor key.

8. A method according to claim 6 wherein the step of conducting a series of access tests further comprises:

applying a minimum of twelve individual queries to each participating DBMS; and

posing each query a minimum of thirty times to derive a statistically significant sampling.

9. A method according to claim 6 wherein the step of deriving access cost data further comprises employing a Least Square Fitting algorithm to reduce errors and estimate cost coefficients.

10. A method according to claim 6 including:

structuring the cost model in accordance with a logical execution of the database queries; and

estimating the cost of a given query based on logical characteristics of each participating DBMS, logical relations between data stored in the databases, and a logical structure of the query.

11. A method according to claim 10 including estimating the cost of a complex query using a set of primitive queries.

12. A computer system for querying a plurality of databases on a network of database computers comprising:

a data storage means in each of the computers for holding database datadictionaries/catalogs and database access mechanisms;

input and display means connected to each of the computers for inputting database queries to the system and displaying results in human readable format;

communications means for transmitting and receiving queries and results by and between the plurality of database computers;

each computer including a database management system (DBMS) having a DBMS query mechanism for accessing a database in response to a database query:

a cost model including a data-dictionary/catalog containing cost data based on coefficients of composite costs including at least CPU and I/O operations relative to a database query, derived by running each individual database computer's DBMS query mechanism against a model database for each of the tested DBMSs in the network; and

means for building query access strategies for each database computer's DBMS based on the cost data stored in said data-dictionary/catalog.

13. A system according to claim 12 in which the cost model includes means defining data access Cost information about communication time between the database computers via the communication means.

14. A system according to claim 12 in which the cost model is structured in accordance with a logical execution of the database queries, including means for estimating the cost of a given query based on logical characteristics of a DBMS, logical relations between data stored in a database, and a logical structure of a query.

15. A system according to claim 12 including:

means for constructing a model database to derive the cost data;

means for running each individual database computer's DBMS query mechanisms against the model database; and

means for deriving cost data for each of the tested DBMSs in the network and loading the resulting cost data into the data-dictionary/catalog.

16. A system according to claim 12 in which the means for building query access strategies is operative to construct a query access strategy including a tuple of variable size, and the cost model includes means for estimating a cost of the strategy as a linear function of the size of the tuple.

17. A system according to claim 12 in which the means for building query access strategies is operative to construct a query access strategy including a number of selection and projection clauses, and the cost model includes means for estimating a cost of the strategy as linear functions of the selection and projection clauses.

18. A method of using a programmable system to perform electronic data management among a plurality of electronic relational databases and corresponding DBMSs in a network distributed environment, said programmable system having a plurality of machine components each including a data storage device, a display device, and a communications means for interconnecting the machine components for bidirectional data communications therebetween, the method comprising:

constructing a model database in which all relational structures and components are known and controlled;

running a series of controlled access tests against the model database by each of the plurality of DBMSs in the network;

tracking and recording resulting cost data for each access test;

storing said cost data as a logical cost model in a network datadictionary/catalog file in the data storage device;

determining an optimum application plan for subsequent distributed database queries relying on the logical cost model stored in said data-dictionary/catalog; and

executing subsequent queries to return data from the distributed databases in accordance with said optimum application plan.

19. A method according to claim 18 wherein the constructing a model database step further comprises:

setting up a relation of seven columns containing 2.sup.n tuples for any integer n such that

Column 1 contains integers {0, n}, indexed and clustered;

Column 2 contains integers {0, 2.sup.n -1}, indexed and de facto clustered, but not specified to the DBMS as such;

Column 3 contains integers {0, n}, indexed and unclustered;

Column 4 contains integers {0, n}, having no index;

Column 5 contains integers {0, 2.sup.n -1}, indexed and unclustered;

Column 6 contains integers {0,2.sup.n -1}, having no index; and

Column 7 contains a long character string to meet the size of the tuple requirement.

20. A method according to claim 18 wherein the tracking and recording step further comprises employing a Least Square Fitting algorithm to reduce errors and estimate cost coefficients.
 Description Submit all comments and votes
 


A portion of the disclosure of this patent document contains material which is the subject of copyright protection. The copyright owner has no objection to facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the U.S. Patent and Trademark Office file or records, but otherwise reserves all copyrights whatsoever.

BACKGROUND OF THE INVENTION

This invention relates to multi-computer systems, and more particularly to database management systems within interconnected computer networks.

The term "distributed database management system" as applied here means a computer system for a database management system (hereinafter, "DBMS") involving multiple computer sites, each with a local database connected together in a communications network, in which a user at any site can access data stored at any other site.

Each site, in turn, has a DBMS in its own right: It has its own terminals and users, its own local storage and CPU (central processing unit), running its own database and database administration functions (i.e., a local DBMS). It also has its own data communications manager with the additional responsibility for controlling the exchange of messages with other sites in the overall distributed database system. Taken together, a distributed database management system wherein individual database systems may be from different manufacturers, the overall system is often referred to in the literature as a heterogeneous distributed database management system, or HDBMS. An HDBMS must support various database systems with different database models, languages, and services.

An example of such a system is shown in FIG. 1. The example represents a simple distributed banking system with two sites, for example, one in Portland, Oreg. and one in Washington, D.C. Of course, real distributed systems usually involve more than just two sites. But suppose account records for the Washington, D.C. area are stored in a local database at the D.C. site, while account records for the Oregon area are stored in a local database at the Portland site. Suppose further that the two sites are linked together to form a single "global" or distributed database. The system combines efficiency of processing (the data is stored close to the point where it is most frequently used) with increased accessibility (it is possible to access a Washington, D.C. account from Portland, Oreg., and vice versa, via a communications link).

A Review of the Objectives of a Distributed Database System

A major objective of distributed database systems is to provide what is typically called location transparency, meaning that users should not need to know at which site any given piece of data is stored, but should be able to behave as if the entire database were stored at their local site. A request for some remote piece of data should cause the system to find that data automatically by consulting the system catalog. The system catalog is a data dictionary and may be regarded as a database in its own right (a system database, rather than an end-user database). The contents of the catalog can be regarded as "data about data"--that is, descriptions of other objects in the system, rather than "raw data." In particular, all the various schemas and mappings are stored. The catalog includes base tables, views, indexes, users, application plans, access privileges, etc. For instance, and of specific importance here, optimizers use catalog information to choose specific access strategies.

In a distributed database system, the system catalog not only includes the usual catalog data reviewed above, but also all necessary control information to enable the system to provide the desired location, fragmentation, and replication transparency mentioned in this section. The advantages of such transparency are that it simplifies the logic of application programs, and it allows data to be moved from one site to another as usage patterns change, without necessitating any reprogramming. In fact, location transparency is nothing more than another aspect of physical data independence (e.g., immunity of applications to change in storage structure and access strategy), as that concept applies here in the distributed model.

A second objective of distributed database systems is to support data fragmentation. A system supports data fragmentation if a given logical object, say the complete accounts file, can be divided into pieces (fragments) for physical storage purposes. In fact, we are tacitly assuming such support in our banking example, since we are storing Washington account records in D.C., and Oregon account records in Portland. The fragments in that example consist of, respectively, a "restriction" of the total accounts file (relation) to just those records having the location field set to either "Portland, Oreg.," or "Washington, D.C." Alternatively, we could decide to store checking account records in D.C., while storing savings account records in Portland; the fragments would again be "restrictions." In general, a fragment could be any arbitrary subrelation that can be derived from the original relation by means of "restriction operations."

A system that supports data fragmentation should also support fragmentation transparency;, that is, users should be able to behave in all cases as if the relation were not fragmented at all (data independence). In other words, users should be presented with a view of the data in which the fragments are combined together by means of suitable join and union operations.

Another objective for distributed database systems is to support data replication, and its corollary, replication transparency. The basic idea here is that a given logical object, say a given account record, may be represented at the physical level by many distinct copies (replicas) of the same stored object, at many distinct sites. For example, a given account record could be stored in both the D.C. and Portland databases. One advantage of such an arrangement is that retrievals can be directed to the nearest replica. The corresponding disadvantage is that updates must be directed to all replicas. Replication transparency means that users should not need to be aware of replication, but should be able to behave as if every logical object were represented by a single stored object.

Location, fragmentation, and replication transparency together imply that a distributed system should look and feel like a centralized system to the user. However, achieving this objective is not without problems, in particular, the problem of query access optimization.

The Basic Problem

The basic issue is that a distributed database system is a collection of sites, or nodes in a network, and networks, at least long haul networks, are slow. Long haul networks, which bind geographically dispersed sites, use telephone lines, in which the data rate is typically 50K-100K bits per second or less. Thus, an overriding objective in distributed database systems is to minimize the number and volume of messages. This objective in turn gives rise to problems in subsidiary areas, and in particular here, query processing.

The problem of query processing, particularly in the HDBMS environment, focuses on optimization which, accordingly, is the single most important consideration in distributed systems design. Optimization is the process that, for a given query, determines the optimum "execution" or "application" plan. An optimizer is a major subcomponent of the system responsible for producing an application plan (i.e., the machine code instructions to implement SQL statements). In the distributed system, an application plan must take into consideration multiple database systems, in a networked setting, with attendant overheads such as network communications speed mentioned above. Therefore, optimization is crucial to effective query processing, particularly in the HDBMS environment. To better understand the role of optimization and the problem solved by the invention, a brief review of the classical DBMS structure is in order.

A Review of the Classical DBMS Structure

From the user's viewpoint, there are typically four components to a state-of-the-art relational DBMS, namely, the Precompiler, Bind, Runtime Supervisor, and a Stored Data Manager. A pre-compiler is a preprocessor for application programs that contain embedded SOL statements. It collects those statements into a database request module (DBRM), replacing them in the original program by host language CALLs to a Runtime Supervisor. A Bind component compiles one or more related DBRMs, to produce an application plan (i.e., machine code instructions to implement the SOL statements in those DBRMs, including machine code calls to a Stored Data Manager.) A Runtime Supervisor oversees SOL application programs during execution. When such a program requests some database operation, control goes first to the Runtime Supervisor according to the CALLs inserted by the pre-compiler. The Runtime Supervisor then routes control to the application plan, and the application plan in turn, invokes a Stored Data Manager to perform the required function. A Stored Data Manager manages the actual database, storing and retrieving records as requested by application plans. It invokes other low-level components as necessary to perform detail-level functions such as buffering data, locking, sorting, and the like during the performance of its basic tasks.

From an internal operational viewpoint, the same four components are at play namely: before the application source code can be compiled by its regular language compiler, it must be pre-processed by a Precompiler to strip out the SQL function statements and replace said SQL with call lines to the Runtime Supervisor; then the stripped SQL, which is gathered into a DBRM, is compiled into an application plan, which is then used by the Runtime Supervisor every time it encounters a CALL from the executing application. However, a closer look at the Bind step is necessary to understand the optimization issue.

As already suggested, Bind is really a database compiler: it converts high level database requests, in effect SQL statements, into machine code. However, Bind is actually an optimizing compiler: the output from Bind is not just machine code, it is optimized code. The input to Bind is one or more DBRMs. The output from Bind (i.e., the compiled code, which is an "application plan") is stored away in the system catalog, where it can be found when needed by the Runtime Supervisor.

A major subcomponent of Bind is an Optimizer. Its function is to choose, for each SQL statement processed, an efficient access strategy for implementing that statement. Recall that data manipulation statements in SQL such as SELECT specify only what data the user wants, not how to get to that data. The access path for getting to that data will be chosen by the optimizer. Programs are thus independent of such access paths, which is desirable for reasons of data independence. As an example, consider the following simple SELECT SQL statement:

______________________________________ EXEC SQL SELECT DOCKET INTO :XDCKT FROM ALEX WHERE ALEX# = '17AUG92' ______________________________________

Even in this very simple case, there are at least two ways of performing the desired retrieval: 1) by doing a physical sequential scan of table ALEX until the record for Aug. 17, 1992 is found; or 2) if there is an index on the ALEX# column of that table then by using that index and thus going directly to the Aug. 17, 1992 record.

The optimizer will choose which of these strategies to adopt. More generally, given any particular SQL statement to be optimized, the optimizer will make its choice of strategy on the basis of considerations such as:

--which tables are referenced in the request;

--how large those tables are;

--what indexes exist;

--how selective those indexes are;

--how the data is physically clustered on the disk(s);

--the form of the WHERE clause in the request; and so on. The optimizer will then generate machine code dependent on the choice of strategy. If, for example, the optimizer decides to make use of some existing index, say X, then them will be machine code instructions in the application that refer explicitly to X. The resulting strategy is often the referred to as cost modeling.

The point is that there will be many possible strategies for processing a given query (in general). For example, a request for a join of a relation R(a) stored at site PDX and a relation R(b) stored at site DC could be carried out by moving R(a) to DC or by moving R(b) to PDX, or by moving both R(a) and R(b) to a third site, LA (etc.). In other words, there might be six plausible query processing strategies (relying on a certain set of assumptions) where the response time could run anywhere from several seconds to several days. Accordingly, each processing strategy will have an associated cost. The goal of optimization then is the selection of a least cost strategy.

Query processing optimization remains one of the stumbling blocks to effective distributed heterogeneous database management systems (HDBMSs). More often than not, a distributed database system will be heterogeneous rather than homogenous in nature. In other words, each site within the distributed system may have its own particular flavor of database management system. Importantly, access to internal system management data regarding query access optimization at the local level may be severely restricted or unavailable altogether. Yet, an end-user application requesting data from this distributed environment must be able to optimize its data access strategies lest database query operations over the network be unacceptably slow.

In a heterogeneous DBMS, the execution space must be extended to handle global queries across all of its constituent DBMSs. This can be done simply by means of new join methods that extend across these DBMSs. Therefore, execution space and search strategies of the kind used with existing commercial DBMSs can be used in a heterogeneous DBMS only if a cost model were made available for all categories of DBMSs in the heterogeneous DBMS. Such cost models have not been provided by existing systems. The crux of the problem, then, is to derive a cost model for each of the DBMSs. This involves calibrating a given relational DBMS and deducing the cost coefficients of the cost formulae.

In reality, distributed database systems end up requiring tremendous cooperation and compromise among satellite sites in selecting and using a common database management system which will accommodate such an operational setting. Thus, heterogeneous, or "open" distributed database systems, although very desirable, are typically impractical. It can be seen, therefore, that in certain large corporate computing settings, great cost is involved in systems integration, and in many cases, retooling altogether.

Accordingly, there remains a need for a heterogenous query optimization method that extends the traditional optimizer strategy widely used in a commercial DBMS to allow execution of queries over both a known DBMS and foreign DBMS in a heterogeneous distributed database management system.

SUMMARY OF THE INVENTION

One object of the invention is to successfully optimize and execute queries in a heterogenous database management system (HDBMS) in which there are a variety of database managers.

Another object of the invention as aforementioned is to optimize database queries in different databases with minimal knowledge of physical performance or operational models for each of the database managers.

The invention is a heterogenous database management system that employs a query optimization method capable of optimizing database queries to different databases in a seamless fashion over both known and/or foreign vendor DBMSs that conform to some standard such as providing the usual relational database statistics. The optimizer can optimize database queries in different databases with no knowledge of performance or operational models for each of the database managers being required in order to calculate optimal query access strategies across all databases within the distributed network.

The invention employs a calibrating database that is synthetically created so as to make the process of deducing the cost model coefficients substantially devoid of unpredictability problems. The particular cost model that is employed in the invention makes its estimates by using logical characteristics of query performance rather than physical aspects of the query means. More particularly, the cost of a query is estimated based on logical characteristics of the DMBS, the relations, and the query itself; the cost of complex queries is estimated by using primitive queries.

The calibrating database is utilized in a series of benchmarkings to calibrate the coefficients in the cost formulae for any given relational DBMS without requirement for local database query access cost model data which may be severely restricted or unavailable altogether. The resulting calibrating cost models are stored in a HDBMS system catalog for subsequent access strategy development and management.

The synthetic calibrating database is configured such that, given a test query, the resulting execution is predictable; that is, the optimizer will choose the predicted access and join methods. The system cannot be instrumented to measure the constants (e.g., number of I/O issued) of a traditional cost model. Further, the construction of the database, posing of the queries, and making the observations are performed as a user to a `black-box` DBMS; that is, the calibration effort need know nothing of the performance characteristics or cost models for a given participating DBMS. Therefore, in order to deduce coefficients in the cost formulae, it is imperative that the queries and the database are configured such that the resulting execution is predictable; i.e., the optimizer will choose the predicted access and join methods. Even if the execution is predictable, it must be free from the above-mentioned distortion or else the determination of the cause for the observed effect is not possible. For example, if all the tuples having a particular value for an attribute just happened to be in single page then the observed value could be misleading.

Accordingly, a further object of the invention as aforementioned is to create a database and a set of quedes that are free of these distortions so that subsequently the coefficients of the cost formulae can be accurately deduced. To meet this further object, in another aspect of the invention, the data access logical cost model which results from the calibration process employing the synthetic database, relies on logical information of databases such that the cost coefficients are viewed as functions, and the values associated to these coefficients are a composite cost of CPU utility and input/output overhead.

The foregoing and other objects, features and advantages of the invention will become more readily apparent from the following detailed description incorporating a Case Study which implements the invention, which proceeds with reference to the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates a simple multidatabase machine network of two or more machines interconnected by bidirectional digital data communications channels.

FIG. 2A illustrates the process of constructing a synthetic database and applying it against each of the participating database managers (DBMS) on each database machine in the heterogeneous distributed database network.

FIG. 2B illustrates the process of actually calibrating participating DBMSs in the network as an elaboration of FIG. 2A.

FIG. 3 - unused.

FIG. 4 is a table which lists the actual queries used for calibration.

FIG. 5 is an elaboration of FIG. 3. The process of actually deriving the cost model coefficients involves an additional optimizing step and then storing the cost model data into a system wide catalog for future use by application plans.

FIG. 6 is a description of the cost and join formulae used in the Case Study which is an implementation of the invention in a test environment.

FIG. 7 is a table of values for the six principal attributes of fourth relation (R.sub.4) in the calibrating model employed in the Case Study which implements the invention.

FIG. 8 is the definition for the i.sup.th tuple in R.sub.n and the values of the corresponding attributes utilized in the calibration model employed in the Case Study which implements the invention.

FIG. 9 is a table of the calibrating relations used in the calibrating model employed in the Case Study which implements the invention.

FIG. 10 is a table of the SOL test queries used in the actual calibration process in the Case Study which implements the invention.

FIG. 11 is a table of the cost formulae coefficient values used for the Allbase, DB2 and informix RDBMS systems used in the Case Study which implements the invention.

FIGS. 12A and 12B are a complete list of all SOL test join queries used in the Case Study which implements the invention.

FIG. 13 is a graph of the elapsed time for the test queries running on the DB2 RDBMS for the specified relations and their test joins used in the Case Study which implements the invention.

FIG. 14 is is a graph analysis of the comparison of estimated value to observed values for the type 3.1 join queries used in the Case Study.

FIG. 15 and FIG. 16 are two graph analyses illustrating the comparison of predicted and observed costs of the test queries against the DB2 RDBMS utilized in the Case Study which implements the invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

Referring to FIG. 1, a distributed computer database network 10 has a requestor database machine system 11 and one or more database machine systems 12. The configuration represents, for example, a simple distributed banking system with two sites, one in Portland, Oreg. and one in Washington, D.C. In the invention either a local area or long haul network will work; for purposes of this discussion, we consider the distributed network over a long haul communication means with machines in separate cities.

Query accesses in this network distributed database system involve a requestor machine system 11 having its database manager 13 search databases 14, 14A on itself and one or more other machine systems 12, and return a search result to a requestor workstation 16. High speed communications facilities between machines can be implemented through a variety of means 18 from standard 10 megabit per second ethernet local area networking to long-haul data communications facilities which transfer data over great distances at rates of 50 to 100 kilobits per second. The system 10 is considered heterogeneous if the individual local databases are managed by separate and distinct sub-systems called database managers 13, 13A, etc. whose only unifying means is a standard structured query language. The intended result is that the information database, although distributed across multiple machine systems, perhaps geographically dispersed, appears to be one seamless large system. The machine interconnects and separate components, therefore, are transparent.

The present invention allows this heterogeneous distributed database system 10 to accept unlimited number database queries from an unlimited number of participating workstations 16, 17, directed at the overall database as distributed by the architecture. Moreover, the present invention provides the ability to store, maintain, and modify data in a multi-machine, multi-database network independent of the make or particular nuances of the individual database management sub-systems 13, 13A, etc. The system as detailed below is a significant improvement over the prior art because for the first time, database managers from a variety of makers can be combined to form one seamless large database management network with acceptable, if not excellent, performance in terms of the cost of query accesses across these differing database management systems.

System Overview

Referring to FIG. 2A, the process of calibration requires first constructing a synthetic database 20 which is used to calibrate query methods for all database managers 13, 13A, etc. within the system 5 of FIG. 1. The synthetic database can be constructed on any machine and is applied once to derive cost models associated with accessing data through each of the database managers. The database 20, also referred to as a calibrating database, can be stored for future use should a new DBMS be added to the network. The database 20 is located 22 on each machine system 11, 12 of FIG. 1 and a calibration is run 24 against the database by each local database manager 13, 13A, etc. of FIG. 1 using standard structured query language queries. The performance results of the calibration are retained 26 for future use by application programs requiring data access on any of the participating database machines.

The Synthetic Database

Referring to FIG. 2B, a synthetic database 30 is constructed to calibrate the coefficients in the cost formulae for the relational DBMSs 13, 13A, etc. of FIG. 1. As described above, this synthetic database is constructed and queried by each of the participating DBMSs. This process is iterative in nature, sequentially applying a series of queries through each DBMS against the synthetically created database which temporarily replaces the usual database for calibration purposes. Cost metric values (e.g., elapsed time) for the queries are observed to deduce coefficients.

The synthetic database is created by allowing for an integer "n," R.sub.n to be a relation of seven columns containing 2.sup.n tuples. The seven columns have the following attributes:

C.sub.1 : integer [0, n], indexed and clustered;

C.sub.2 : integer [0, 2.sup.n -1], indexed, defacto clustered, but not specified to the DBMS as such;

C.sub.3 : integer [0, n], indexed, unclustered;

C.sub.4 : integer [0, n], no index;

C.sub.5 : integer [0, 2.sup.n -1], indexed, unclustered;

C.sub.6 : integer [0,2.sup.n -1], no index; and

C.sub.7 : a long character string to meet the size of the tuple requirement.

The value for the seventh attribute is a padding field and can be set to anything and therefore, for convenience, omitted from the rest of the description.

The multicolumn key for the relation is (C.sub.1, C.sub.2). This relation is indexed on this key, in ascending order, with C.sub.1 being the major key and C.sub.2 being the minor key. This index is clustered in the sense that the values of the major key are clustered. The values of the minor key (i.e., C.sub.2) are also clustered. In fact, the values in C.sub.2 are unique and have 2.sup.n values in the range [0, 2.sup.n -1] and therefore these values can also be in ascending order. Therefore, the column, C.sub.2, can be viewed as a sequence number for the rows. This C.sub.2 value is referred to as the row index. The need for the multicolumn key and index is so that the tuples are ordered in the disk pages based on C.sub.2 and the system is informed that C.sub.1 has a clustered index (see the Case Study infra).

Application of the Synthetic Database

Referring again to FIG. 2B, (having spawned a database according to the description above), the synthetic database 30 is subsequently deployed 32 under each of the participating DBMSs 13, 13A, etc. of FIG.1 as a temporary replacement for the regular databases 14, 14A of FIG. 1.

A test suite of queries 33 are run against the synthetic database under each DBMS 13, 13A, etc. of FIG, 1. Resulting performance data (e.g., elapsed time) 34 are stored into a system wide catalog 36. Within the preferred embodiment, the calibration is set up to use mostly single table queries. This is not only because the join queries are time-consuming and therefore take too long to calibrate the system, but also because the cost of most join queries can be estimated using those of single table queries.

There are sixteen relations used in these calibrations (see the Case Study infra). Each type of relation is instantiated with two sizes of tuples and the smaller tuple relation is duplicated. This duplication is required because the join queries need two identical relations. The actual queries used in the calibration are given in FIG. 4, where R.sub.n is a table of cardinality 2.sub.n and c is a constant which determines the selectivity. For each type of query against R.sub.n, a set of queries with selectivity 2.sup.-i (where i=1,2, . . .,n) are constructed and observed.

Referring to FIG. 5, for each query 40 the elapsed time in the DBMS is recorded 42. The elapsed time is calculated by subtracting the start timestamp (when the query is received) from the end timestamp (when the result has been sent out). In all DBMSs the queries are posed after flushing all the buffers to eliminate the distortion due to buffering. Each query is issued thirty times and the average elapse time is calculated. Relative error between actual data and average value is within 5% with a confidence coefficient of 95%. Thus, the repeatability of the observation is assured. From this data in 42, the coefficients for the cost formulae can be deduced. A Least Squares fitting algorithm is employed 44 to minimize any errors and estimate the coefficients.

The resulting cost data 34 of FIG. 2B is then added to the system catalog 36 as a part of a cost model. The cost model includes data access cost information about communication time between the database machines via the communications network as well as cost data concerning operations of the central processing units and input/output structures of the database machines. The cost model is structured in accordance with the logical execution of the database queries and estimating the cost of a given query based on logical characteristics of the DBMS to which a query is direc