|
Claims  |
|
|
What is claimed is:
1. A machine implemented method for automatically determining an optimal
execution strategy for a request comprising query, update or transaction
operations on a distributed database system having a plurality of data
site, with each site having a computer and data storage facility, and said
sites are interconnected by communication lines, said machine implemented
method comprising the steps of:
A. inputting an ad hoc relational query or update request from a first
computer process that formulates the relational query or update request as
a compacted tree having lead nodes that are select, project or join
operators and non-leaf nodes that are union, intersection, difference,
delete, modify or insert operators to a second computer process which
thereafter performs step B;
B. for each node in the compacted tree perform the following starting at
the root node:
1. if the current node is a leaf node perform the following:
a. materialization planning to choose the data sites from which data is to
be accessed;
b. local process planning to determine which operations can be processed
locally at each site determined by materialization planning and estimate
parameters of resultant data from the local operations;
c. non-local process planning to determine strategy for remaining
operations which can not be performed locally without transmission of data
between sites;
d. execution strategy building to build an execution strategy for each data
site in the distributed database system at which data is processed by
access or manipulation;
2.
2. if the current node is a non-leaf node perform the following:
a. for each child node make the child a root of a subtree and perform step
1 above for the subtree; and
b. having processed all child nodes of the current non-leaf node perform
the following for the current non-leaf node:
1. materialization planning to choose the data sites from which data is to
be accessed;
2. local process planning to determine which operations can be processed
locally at each site determined by materialization planning and estimate
parameters of resultant data from the local operations;
3. non-local process planning to determine strategy for remaining
operations which can not be performed locally without transmisson of data
between sites;
4. execution strategy building to build an execution strategy for each data
site in the distributed database system at which data is processed by
access or manipulation; and then
C. outputting execution strategy commands from said second computer process
to a third computer process that coordinates the execution of the
execution strategy commands at sites within the distributed database
system. 2. The method of claim 1 wherein the step of materialization
planning for a leaf node comprises the steps of:
A. choosing a site for each base relation that is an operand of the leaf
node by choosing the sites that have the most number of base relations;
and
B. choosing result data processor, a site where the final result of the
operations will be built, by choosing the site that has the most number of
base relations.
3. The method of claim 1 wherein the step of local process planning for a
leaf node comprises the steps of:
A. determining the select, project and join operations that can be
performed on base relations or temporary relations that reside at the same
node; and
B. estimating parameters of temporary relations produced by select, project
and join operators using the parameter of the base relations.
4. The method of claim 1 wherein the step of non-local process planning for
a leaf node further comprises determining a strategy for performing the
remaining join operators by transmission of base relations and temporary
relations among the sites by an optimization criteria using the parameters
of the temporary relations estimated by local process planning and the
parameters of the base relations.
5. The method of claim 4 wherein the optimization criteria is to minimize
the total time for communications among the sites.
6. The method of claim 5 wherein the communication time between sites is a
function of the size parameter of the data to be transmitted.
7. The method of claim 6 wherein the size parameter is reduced by use of
semi join operations on the base relations or temporary relations.
8. The method of claim 7 wherein the reduction of the size parameter by use
of the semi join operations is estimated by use of selectivity parameters.
9. The method of claim 4 wherein the optimization criteria is to minimize
the response time for communication among the sites.
10. The method of claim 9 wherein the communication time between sites is a
function of the size parameter of the data to be transmitted.
11. The method of claim 10 wherein the size parameter is reduced by use of
semi join operations on the base relations or temporary relations.
12. The method of claim 11 wherein the reduction of the size parameter by
use of the semi join operations is estimated by use of selectivity
parameters.
13. The method of claim 4 wherein the optimization criteria is to ignore
the parameters and to have all base relations and temporary relations
transmitted to the result data processor chosen by materialization
planning and perform remaining operators at the result data processor.
14. The method of claim 1 wherein the step of building for a leaf node
further comprises the step of building select, project, join and move
requests resulting from the local process planning and non-local process
planning steps.
15. The method of claim 1 wherein the step of materialization planning for
a non-leaf node comprises the steps of:
A. determining a site for each temporary relation that is an operand of a
non-leaf node based upon the result data processor chosen for each child
node;
B. for a union, intersection or difference operator, choosing a result data
processor, a site where the final result of the operations will be built,
by choosing the site that has the most number of temporary relations; and
C. for a delete, modify or insert operator, determining all sites where the
base relation to be updated resides.
16. The method of claim 1 wherein the step of local process planning for a
non-leaf node comprises the steps of:
A. determining union, intersection or difference operations that can be
performed on temporary relations that reside at the same node; and
B. determining delete, modify or insert operations that can be performed on
base relations that reside at the same node as the temporary relations.
17. The method of claim 1 wherein the step of non-local process planning
for a non-leaf node comprises the steps of:
A. for a union, intersection or difference operator, transmitting all
temporary relations to the result data processor chosen by materialization
planning and perform the remaining operations at the result data
processor; and
B. for a delete, modify or insert operator, transmitting all temporary
relations to the sites where the base relation to be updated resides as
determined by materialization planning and performing the delete, modify
or insert operation at those sites.
18. The method of claim 1 wherein the step of execution strategy building
for a non-leaf node further comprises the step of building union,
intersection, difference, delete, modify, insert and move requests
resulting from the local process planning and non-local process planning
steps. |
|
|
|
|
Claims  |
|
|
Description  |
|
|
BACKGROUND OF THE INVENTION
1. Field of the Invention
This invention relates to computer database systems; and more specifically
to a method for optimizing queries, updates and transactions in a
distributed database system.
2. Description of the Prior Art
Modern database systems are frequently distributed, meaning that the data
is kept at widely dispersed locations. Several different computers control
access to different parts of the data, and several computers are used to
interface with users at different locations; these may or may not be the
same machines that access the data itself. The various computers are
connected by communications links, and it is normal for these links to be
relatively low speed, compared with, say, the speed at which a file can be
read off of a disk. The consequence of this assumption about communication
is that the transfer of data between computers becomes a bottleneck, and
most of the issues unique to distributed systems concern ways of dealing
with this bottleneck.
A distributed database consists of a number of sites, each of which is a
computer and a facility for storing data. In general, each site has both a
transaction server to process queries and updates generated by a user, and
a file server to handle data access. However, it is possible that one of
these two functions is missing at any particular site, and often one
computer will act as both the transaction and file server.
A database is composed of "items", which are individually lockable pieces
of data. It is a possibility that some items are duplicated, in the sense
that they appear at two or more sites. The reason for doing so is that
accesses to frequently read items may be sped up if the item is available
at the same site that processed the read request. Further, the redundant
storage of items makes it less likely that the item will be lost in a
crash. The penalty is that more messages must be sent to update or lock a
duplicated item than an item that appears only once.
Some or all pairs of sites are connected by links, over which messages can
be sent and data transmitted. Based on the technology of the early 1980's,
the time to send even a short message between geographically distributed
sites is nontrivial, say 0.1 seconds, and the rate at which data can be
transmitted across a link is not too great, perhaps around 10,000 bytes
per second or less. The tenth of a second overhead to send a message
represents the time for the two processors connected by a link to execute
the protocol necessary for one to send a message, assured that the other
will receive it and know what to do with it. The shipping rate for data is
limited by the speed at which a processor can send or receive data, or the
bandwidth of the link (rate at which data can be transmitted across the
line) or both.
A prior art system which optimizes queries in a distributed database system
is System R*. System R* is the experimental extension of System R to the
distributed environment. Like its parent, it performs optimization in an
exhaustive way, confident that the cost of considering many strategies
will pay off in lower execution costs for queries, especially for queries
that are compiled and used many times.
The optimization method of System R* applies to an algebraic query, where
the expression to be optimized represents the query applied to logical
relations, which in turn are expressed in terms of physical relations. The
operators assumed to appear in the query are the usual select, project,
join and union, plus a new operator choice, which represents the ability
of the system to choose any of the identical, replicated copies of a given
relation. The System R* method of optimization considers all ways to
evaluate the nodes of the compacted tree, taking into account
1. the various orders in which an operator like union or join of many
relations can be performed as a sequence of binary steps,
2. the various sites at which each node of the tree could have its result
computed, and
3. several different ways that the join could be computed.
When the System R* method considers each of the options listed above, it
must evaluate the cost of these methods. The actual cost function used in
System R* is quite complex, it takes account of the computation cost at a
site as well as the transmission cost between sites. Further, the System
R* method for optimizing distributed queries exhaustively considers all
possible strategies for query evaluation that can be built from a fixed
collection of options. The System R* is described in greater detail in
Principles of Database Systems, by Jeffrey D. Ullman, Computer Science
Press, 1982, which is incorporated herein by reference.
The System R* method has the disadvantage that the computation time spent
in determining an optimum query execution strategy may exceed the time to
be saved by execution of the optimal strategy. Further, the System R*
method requires the site at which optimization is being performed to have
access to sufficient information about the various sites so that the
computation cost at the sites can be calculated. This site information has
to be transmitted to the optimization site and must be updated as
conditions at the various sites change. In addition, the System R*
optimization method requires that all, global and local, optimization must
be performed before any execution of the query can be started.
OBJECTS OF THE INVENTION
Accordingly, it is an object of the present invention to provide a
distributed database query optimization method that allows some of the
optimization to be done locally.
It is another object of the present invention to provide a distributed
database query optimization method that separates optimization into global
and local processes allowing multiple local optimizations to be done in
parallel thereby decreasing the elapsed optimization time.
It is a further object of the present invention to provide a distributed
database query optimization method which has a less exhaustive global
optimization process and allows some of the optimization to be done
locally.
It is a yet further object of the present invention to provide a
distributed database query optimization method which does not require
local database parameters to be present at all sites which perform query
optimization.
It is a still further object of the present invention to provide a
distributed database query optimization method which performs local
optimization at the local sites using current local database parameters.
This invention is pointed out with particularly in the appended claims. An
understanding of the above and further objects and advantages of this
invention can be obtained by referring to the following description taken
in conjunction with the drawings.
SUMMARY OF THE INVENTION
A method for determining an optimal execution strategy for a request
comprising query, update or transaction operations on a distributed
database system having multiple data sites interconnected by communication
lines in which optimization is done by inputting the request in the form
of a compacted tree having leaf and non-leaf nodes and then for each node
in the compacted tree: (1) doing materialization planning to choose the
data sites from which data is to be accessed; (2) doing local process
planning to determine which operations can be processed locally at each
site determined by materialization planning and estimate parameters of
resultant data from the local operations; (3) doing non-local process
planning to determine strategy for remaining operations which can not be
performed locally without transmission of data between sites; and then (4)
building the requests for each data site in the distributed database
system at which data is processed by access or manipulation. After
performing the optimization, the built requests are output to a process
that coordinates the execution of the built requests at sites within the
distributed database system.
BRIEF DESCRIPTION OF THE DRAWINGS
The manner in which the method of the present invention is performed can
best be understood in light of the following detailed description together
with the accompanying drawings in which like reference numbers identify
like elements in the several figures and in which:
FIG. 1 is a diagram of the five schema architecture of the distributed data
system.
FIG. 2 is a diagram of a distributed data system logical architecture
showing data processors and application processors connected by a
communications network.
FIG. 3 is a diagram of the steps associated with processing a distributed
data system query, update or transaction.
FIG. 4 is a diagram of an example Entry-Category-Relationship (ECR) schema.
FIG. 5 is a diagram of an example relational schema corresponding to the
example ECR schema of FIG. 4.
FIG. 6 is a diagram of an example binary tree.
FIG. 7 is a compacted tree corresponding to the binary tree of FIG. 6.
FIG. 8 is a flow diagram of the optimization method of the present
invention.
DESCRIPTION OF THE PREFERRED EMBODIMENT
In a Distributed Database System (DDS), database management and transaction
management are extended to a distributed environment.
This distributed environment may provide multiple user interfaces including
non-procedural query interfaces and programming language interfaces;
distributed database processing; an active Data Dictionary/Directory
(DD/D) system that may be distributed; multiple database managers; and a
comprehensive set of user support facilities.
The Materialization and Access Planning (MAP) process for a distributed
query, update, or transaction is an important part of the processing of
the query, update, or transaction. Materialization and access planning
results in a strategy for processing a query, update, or transaction in
the Distributed Database Management System (DDBMS). Materialization
consists of selecting data copies used to process the query, update, or
transaction. This step is necessary since data may be stored at more than
one site (i.e., computer) on the network. Access planning consists of
choosing the execution order of operations and the actual execution site
of each operation.
The MAP process of the preferred embodiment selects a good materialization
for all data in the query, update, or transaction. Three different access
planning methods are provided, GENERAL (RESPONSE), GENERAL (TOTAL), and
Initial Feasible Solution (IFS). GENERAL (RESPONSE) and GENERAL (TOTAL)
build optimal execution strategies using a cost function of the
communications costs. Local processing costs are not considered. GENERAL
(RESPONSE) builds an execution strategy with minimum response time, the
time elapsed between the start of the first transmission and the time at
which the result arrives at the required computer. GENERAL (TOTAL) builds
an execution strategy with minimum total time, the sum of the time of all
transmissions. IFS does not perform any optimization.
The IFS solution is optimal when small amounts of data are required for
transmission by the user request. The IFS solution always has the smallest
number of messages; GENERAL (RESPONSE) and GENERAL (TOTAL) consider
solutions with additional messages that result in additional local
processing in an attempt to minimize the amount of data transmitted, i.e.,
these solutions have more messages than the IFS solution, but the sum of
the size of the messages is smaller. When small amounts of data in the
database are being accessed, the solution of IFS is optimal.
Further performance improvements can be obtained by improving the cost
function used by GENERAL (RESPONSE) and GENERAL (TOTAL). The current
function only includes communications costs. The communications costs were
assumed to be a linear function of the message size, and the same cost
function was used for sending a message between any two sites. Local
processing costs could be included in the cost function. In addition, the
load and capacity of each site are very important to overall performance.
For the DDS system environment in which the sites are locally distributed,
these factors are not negligible compared to communications. For
geographically distributed environments, they will be much less important.
GENERAL (RESPONSE) should be used when optimal user response time is the
goal of the system; GENERAL (TOTAL) should be used when optimal total
communications time is the goal of the system. The choice between GENERAL
(RESPONSE) and GENERAL (TOTAL) depends on the particular system
applications and goals. The cost functions used by GENERAL (RESPONSE) and
GENERAL (TOTAL) can be tuned for each system. The communications cost
function can be determined from system performance data. For a
geographically distributed system, this function may be sufficient. In a
locally distributed system, the inclusion of local processing costs, load,
and capacity of the sites should be considered. The MAP method of the
preferred embodiment is also described in "A Study of Materialization and
Access Planning", by P. A. Dwyer, CSC-84-10: 8212, Honeywell Inc. Computer
Sciences Center, February 1984, which is incorporated herein by reference.
The Distributed Database System Architecture
In this section, the information architecture, system architecture, and
transaction management of DDS are described. The information architecture
describes how information is to be conceptually modeled, represented, and
allocated. The system architecture describes processor characteristics,
processor locations, communications, and system level protocols for
distributed execution. Transaction management describes the processing
steps required for a query, update, or transaction using the information
and system architectures.
Information Architecture
The ANSI/SPARC 3-schema architecture described in "The ANSI/X3/SPARC Report
of the Study Group on Data Base Management Systems", edited by A. Klug and
D. Tsichritzis, AFIPS Press, 1977, which is incorporated herein by
reference, a database system is extended in DDS to provide for a
distributed system. In the 3-schema architecture, the three levels of
database description are the conceptual schema, the external schemas, and
the internal schema. The conceptual schema is a description of the portion
of the world being modeled by the database. An external schema is a
consistent and logical subset of the conceptual schema; it is the
description of a user view of the database. The internal schema describes
the system representation of the database.
This 3-schema proposal for centralized DBMSs has been extended to the
5-schema architecture in DDS as illustrated in FIG. 1 as described in
"Transaction Processing in a Multi-schema Distributed Database Testbed
System" by C. C. Devor, R. Elmasri and S. Rahimi, HR-81-253: 17-38,
Honeywell Corporate Computer Sciences Center, February 1981, which is
incorporated herein by reference. The external schemas 4a, 4b and 4c are
descriptions of the users 2a, 2b, 2c and 2d views, and serve as the user
interface. The conceptual schema 6 is a semantic description of the total
distributed database. The global representation schema 8 is a synthetic
representation of the total database content. The local representation
schemas 12a, 12b and 12c are syntactic descriptions of the data resident
at specific nodes. The local internal schemas 14a, 14b and 14c are the
local database management system implementations that manage the database
fragments 16a, 16b and 16c.
In DDS, the global representation schema 8 and the local representation
schemas 12a, 12b and 12c are described with the relational model, as
described in "A Relational Model for Large Shared Data Banks" by E. F.
Codd, Communications of the ACM, Vol. 13, No. 6, June 1970, which is
incorporated herein by reference, because the relational data model is
convenient for describing the distribution and replication of data across
the sites of the DDBMS, and data described in this model can be accessed
using a nonprocedural set oriented data manipulation language as opposed
to a procedural, record at a time data manipulation language. In the DDS
implementation of the preferred embodiment, the local internal schemas
14a, 14b and 14c use the network model because the local database
management systems are IDS/II systems described in GCOS 6 IDS/II User's
Guide, Honeywell Information Systems Inc., Waltham, Mass., November 1978,
which is incorporated herein by reference. Other DBMSs with possibly
different local internal schemas can be used.
System Architecture
DDS consists of a set of Application Processors (APs) 20a, 20b and 20c and
Data Processors (DPs) 24a, 24b and 24c connected by a communications
network 22 (See FIG. 2). The APs control the interface to the users 24a,
24b and 24c and all transaction management, and the DPs perform the data
management of the databases 26a, 26b and 26c. DDS of the preferred
embodiment is implemented on Honeywell Level 6 computers 24a, 24b and 24c
with the GCOS6 MOD400 operating system. Each Level 6 computer 24a, 24b
and 24c can support zero or more APs and/or zero or more DPs. Additional
details of the DDS system architecture can be found in "Implementing the
Distributed Database Testbed System (DDTS) in a Multi-computer GCOS 6/MOD
Environment", by C. C. Devor, M. Spinrad, P. A. Tovo, K. Koeneman, and M.
Smith, HR-81-264: 17-38, Honeywell Corporate Computer Sciences Center,
June 1981, which is incorporated herein by reference.
Transaction Management
The user's view of DDS is built around the concepts of a query, an update,
or a transaction. The steps involved in processing a query, an update, or
a transaction are illustrated in FIG. 3. A query, an update, and a
transaction in DDS are defined in the GORDAS high-level data manipulation
and description language as described in "GORDAS: A Data Definition, Query
and Update Language for the Entity-Category-Relationship Model of Data",
by R. Elmasri, HR-81-250: 17-38, Honeywell Corporate Computer Sciences
Center, January 1981, which is incorporated herein by reference. GORDAS
(Graph ORiented DAta Selection Language) is a language for graph-oriented
data models such as the Entity-Relationship (ER) model as described in
"The Entity-Relationship Model: Towards a Unified View of Data", by P.
P-S. Chen, ACM Transactions on Database Systems, Volumn 1, Number 1, March
1976, which is incorporated herein by reference and the
Entity-Category-Relationship (ECR) model described in "The
Entity-Category-Relationship Model of Data" by J. A. Weeldreyer,
HR-80-250: 17-38, Honeywell Corporate Computer Sciences Center, March
1980, which is incorporated herein by reference. A transaction expressed
in GORDAS 30 is simply more than one GORDAS query or update to be
processed as a group. Currently, GORDAS transactions cannot contain
control constructs such as, IF-THEN-ELSE-ENDIF and DO-WHILE. A query or an
update in GORDAS 30 is translated to a single cluster tree that is an
internal non-procedural representation of the query or update (see FIG.
3). A transaction expressed in GORDAS 30 is translated to a list of
cluster trees 34 by Transaction and Integrity Control (TIC) 32; there is
one cluster tree per query or update in the transaction. The cluster tree
operators operate on the data described by the global representation
schema.
After translation of a query, an update, or a transaction, materialization
and access planning 38 take place. For each query or update, a
materialization is chosen and an execution strategy 40 is built. The
execution strategy is expressed in terms of the local representation
schemas. Both TIC 32 and MPA 38 use information from the database
dictionary directory (DD/D) 36 to perform their functions.
A Distributed Execution Monitor (DEM) 42 coordinates the processing of the
execution strategy. The DEM 42 sends commands that specify the local
relational operations to one or more Local Execution Monitors (LEMs) 44a,
44a for local optimization and execution. The DEM 42 coordinates
synchronization among the commands, and is responsible for commitment and
concurrency control of updates at multiple sites. The local relational
operators are translated to the operations on the local DBMS 46a and 46b
by the Local Operations Module (LOM) 45a and 45b and are executed against
the databases 48a, 48b.
Compiling and saving a transaction for later execution offers an
alternative scheme. "Compilation" consists of processing the transaction
through materialization and access planning 38. At this point the
operations are expressed in terms of the local representation schemas.
They are stored at the local nodes for later execution. At execution time
the DEM 42 sends commands to the local execution monitors 44a and 44b
specifying which stored oper | | |