|
Description  |
|
|
TECHNICAL FIELD
The present invention relates generally to query optimization, and more
specifically to a system and method for generating uniqueness properties
of intermediate query subexpressions.
BACKGROUND ART
Relational DataBase Management System (RDBMS) products using a Structured
Query Language (SQL) interface are well known in the art. The SQL
interface has evolved into a standard language for RDBMS products and has
been adopted as such by both the American Nationals Standard Organization
(ANSI) and the International Standards Organization (ISO). In RDBMS
products, all data is externally structured into tables. The SQL interface
allows users to formulate relational operations on the tables either
interactively, in batch files, or embedded in host languages such as C,
COBOL, etc. Operators are provided in SQL that allow the user to
manipulate the data, wherein each operator operates on either one or two
tables and produces a new table as a result. The power of SQL lies in its
ability to link information from multiple tables or views together to
perform complex sets of procedures with a single statement.
The current state-of-the-art in query optimization provides few solutions
for optimizing query expressions involving joins, outer joins and full
outer joins. Some have suggested that queries could be optimized if
certain uniqueness information is known about relations in a given
expression. In particular, C. J. Date, and Hugh Darwen, in "Relational
Database Writings 1989-1991," pp 133-153, indicate that uniqueness
information can be used to optimize database queries.
However, there is little guidance regarding how to determine this much
needed uniqueness information. Thus, if full advantage of certain
optimization techniques is to be obtained, a technique for determining
uniqueness information is needed.
DISCLOSURE OF INVENTION
The present invention is directed toward a system and method for
determining uniqueness properties of an expression so that queries
containing join and outer join operations can be optimized. According to
the invention, two processes are implemented to determine the uniqueness
properties of a given expression.
The first process, termed an augmented unique process, is a recursive
process applied to the roots of the expression in question. The augmented
unique process is used to call a second process, termed a uniqueness
process. The uniqueness process is used to determine uniqueness properties
of the elements of the expression in question. In operation, the augmented
unique process begins at the root of the expression in question. The
augmented unique process calls an appropriate procedure depending on the
child, or children, of the root. Subsequent procedures of the augmented
unique process are called in a recursive manner depending on the progeny
of the root.
When a base relation is reached, the augmented unique process calls an
appropriate procedure of the uniqueness process to determine the
uniqueness properties of that base relation. The augmented unique process
`unwinds` by calling an appropriate procedure of the uniqueness process
for each of the progenitors of the base relation. If the root of the
expression in question is a binary operator, the processes described above
are repeated for the other progeny of the binary operator.
A feature of the invention is that the uniqueness properties determined for
the relations in the expression can be used for numerous applications. For
example, in one application, the uniqueness properties are used to
determine whether an intersect all operation can be replaced by an
intersect distinct operation, thus simplifying the expression.
The uniqueness properties provide important information pertaining to the
attributes of the expression. For each relation, base or otherwise, the
uniqueness properties determined can include: the weakly bound set, the
strongly bound set, the predictor set and the key set.
Further features and advantages of the present invention, as well as the
structure and operation of various embodiments of the present invention,
are described in detail below with reference to the accompanying drawings.
BRIEF DESCRIPTION OF DRAWINGS
The present invention is described with reference to the accompanying
drawings. In the drawings, like reference numbers indicate identical or
functionally similar elements. Additionally, the left-most digit(s) of a
reference number identifies the drawing in which the reference number
first appears.
FIG. 1 is a diagram illustrating an exemplary computer hardware embodiment
in which the present invention can be implemented;
FIG. 2 is an operational flow diagram illustrating a method of interpreting
and executing SQL statements;
FIG. 3 is an operational flow diagram illustrating a method of interpreting
and executing SQL statements embedded in source code;
FIG. 4 is an operational flow diagram illustrating an application of an
augmented unique process according to one embodiment of the invention;
FIG. 5 is an operational flow diagram illustrating steps associated with
the augmented unique process according to one embodiment of the invention;
FIG. 6 is an operational flow diagram illustrating an application of
procedure two of the uniqueness process according to one embodiment of the
invention;
FIG. 7 is an operational flow diagram illustrating an application of
procedure three of the uniqueness process according to one embodiment of
the invention;
FIG. 8 is an operational flow diagram illustrating an application of
procedure four of the uniqueness process according to one embodiment of
the invention;
FIG. 9 is an operational flow diagram illustrating an application of
procedure five of the uniqueness process according to one embodiment of
the invention; and
FIG. 10 is an operational flow diagram illustrating an application of
procedure six of the uniqueness process according to one embodiment of the
invention.
BEST MODE FOR CARRYING OUT THE INVENTION
The present invention is directed toward a system and method for optimizing
SQL queries. According to the invention, uniqueness properties of each
subexpression of an SQL query are determined using a recursive process
called the augmented unique process. The augmented unique process
recursively determines the uniqueness properties of each relation using a
uniqueness process.
Further according to the invention, a concept of strong and weak bindings
is used to propagate uniqueness information up an expression tree for a
query. Then, for example, such information can be used to eliminate
unnecessary duplicate removal operations, convert Intersect All to
Intersect Distinct .andgate..sub.a to .andgate..sub.d which can then be
converted to joins and/or used for "what-if" .delta.-pushdown to do
subtree pruning. These transformations can provide orders of magnitude
performance improvement in query execution times.
1. Environment of the Invention
FIG. 1 illustrates an exemplary computer hardware environment that in which
the present invention can be implemented. In this exemplary environment, a
computer system 102 is comprised of one or more processors connected to
one or more electronic storage devices 104 and 106, such as disk drives,
on which one or more databases are stored.
Operators of the computer system 102 use a standard terminal interface 108,
such as IMS/DB/DC, CICS, TSO, or other interface, to perform various
search and retrieval functions, termed queries, against the databases.
These functions are typically performed using Relational DataBase
Management System (RDBMS) software that incorporates the Structured Query
Language (SQL) standard. In one embodiment of the present invention, the
RDBMS software comprises the DB2 product offered by IBM for the MVS or
OS/2 operating systems. As will become apparent to those skilled in the
relevant art, the present invention has application to any RDBMS that uses
SQL or any other data manipulation language.
The DB2 architecture includes three major components: (1) the IMS Resource
Lock Manager (IRLM) 110, the Systems Services module 112, and the Database
Services module 114. The IRLM 110 handles locking services for concurrency
control. These locking services are provided because DB2 treats data as a
shared resource, thereby allowing any number of users to access the same
data simultaneously. Therefore, concurrency control is required to isolate
users and to maintain data integrity.
The Systems Services module 112 controls the overall DB2 execution
environment, including managing log data sets 106, gathering statistics,
handling startup and shutdown, and providing management support.
At the center of the DB2 architecture is the Database Services module 114.
The Database Services module 114 contains several submodules, including
the Relational Database System (RDS) 116, the Data Manager 118, the Buffer
Manager 120 and other components 122 such as an SQL compiler/interpreter.
These modules support the functions of the SQL language, i.e., definition,
access control, retrieval, and update of user and system data.
FIG. 2 is a flowchart illustrating the steps necessary for the
interpretation and execution of SQL statements in an interactive
environment. Block 202 represents the input of SQL statements into the
computer from the user. Block 204 represents the step of compiling or
interpreting the SQL statements. An optimization function within block 204
may reorder the SQL query for optimum performance.
Block 206 represents the step of generating from the compiled SQL
statements a compiled set of runtime structures, termed an application
plan. Generally, the SQL statements received as input from the user
specify only the data that the user wants, but do not specify how to get
to that data. This step considers both the available access paths
(indexes, sequential reads, etc.) and system-held statistics on the data
to be accessed (the size of the table, the number of distinct values in a
particular column, etc.), to choose what it considers to be the most
efficient access path for the query. Block 208 represents the execution of
the application plan, and block 210 represents the output of the results
of the application plan to the user.
FIG. 3 is a flowchart illustrating the steps necessary for the
interpretation and execution of SQL statements embedded in source code
according to the present invention. Block 302 represents program source
code containing a host language (such as COBOL or C) and embedded SQL
statements. The program source code is then input to a pre-compile step
304. There are two outputs from the pre-compile step 304: a modified
source module 306 and a database request module (DBRM) 308. The modified
source module 306 contains host language calls to DB2, which the
pre-compile step 304 inserts in place of SQL statements. The DBRM 308
consists of the SQL statements from the program source code 302.
A compile and link-edit step 310 uses the modified source module 306 to
produce a load module 312, while an optimize and bind step 314 uses the
DBRM 308 to produce a compiled set of runtime structures for the
application plan 316. As indicated above in conjunction with FIG. 2, the
SQL statements from the program source code 302 specify only the data that
the user wants, but not how to get to it. The optimize and bind step 314
may reorder the SQL query so as to optimize the query. Thereafter, the
optimize and bind step 314 considers both the available access paths
(indexes, sequential reads, etc.) and system held statistics on the data
to be accessed (the size of the table, the number of distinct values in a
particular column, etc.), to choose what it considers to be the most
efficient access path for the query. The load module 312 and application
plan 316 are then executed together at step 318.
The present invention is described in terms of this example environment.
Description in these terms is provided for convenience only. It is not
intended that the invention be limited to application in this example
environment. In fact, after reading the following description, it will
become apparent to a person skilled in the relevant art how to implement
the invention in alternative environments.
2.0 Conventional Optimization Techniques
In a given query expression, information about the uniqueness (or,
"duplicate-free" status) of a sub(result) can be used to influence the
optimal plan selection for the query. For example, the paper "Exploiting
uniqueness in query optimization," by Paulley, G. N. and Larson, P-A.,
CASCON, pp. 804-822, Vol. II, October 1993, illustrates the case when an
expensive, duplicate elimination clause such as SQL's SELECT DISTINCT can
be converted to a much cheaper SELECT if it can be determined that the
result is already duplicate free.
As another example the paper "Extensible/rule based query rewrite
optimization in Starburst," by Pirahesh et al., SIGMOD, pp. 39-48, San
Diego, Calif., June 1992, illustrates how special rules are allowed to
fire in a rule-based optimizer when a "one-tuple condition" is determined.
Uniqueness propagation exploits available information about the uniqueness
of tuples in base relations (enforced, e.g., through integrity
constraints, key definitions, index uniqueness, etc.), along with
properties of algebraic expressions that preserve and/or generate unique
results, to influence the optimization of the query. For example, if it is
known that one of the operands of an intersect all operation is an
expression with unique tuples, then intersect all can be replaced by
intersect distinct. See, for example, Paulley, G. N. and Larson, P-A.,
"Exploiting uniqueness in query optimization," CASCON, pp. 804-822, Vol.
II, October 1993.
3.0 Optimization Using Uniqueness Propagation
Uniqueness propagation as described by Paulley and Larsen can be exploited
as follows for query optimization:
1. If one of the operands of the intersect all is known to have unique
tuples, a SELECT DISTINCT . . . can be applied to this sub-expression,
which may allow for the "pruning" away of some expensive outer joins from
this sub-expression. See, for example, Chen, A. L. P., "Outerjoin
optimization in multidatabase systems," 2nd International Symposium on
Databases in Parallel and Distributed Systems, 1990
2. The newly transformed intersect distinct operator could be transformed
into a join operator. See for example, Pirahesh et al., "Extensible/rule
based query rewrite optimization in Starburst," SIGMOD, pp. 39-48, San
Diego, Calif., June 1992.
3. The join operator could then be part of an extensive join, outer join,
full outer join reorderability evaluation algorithm.
This invention defines the concept of weak bindings in the context of
queries involving joins, outer joins, and intersection operations. This,
together with new uniqueness determination identities, is used to present
a technique unique for determining the unique set of a (sub)tree
corresponding to a relational (sub-)expression.
3.1 Comparison With Conventional Techniques
(Paulley, G. N. and Larson, P-A., "Exploiting uniqueness in query
optimization," CASCON, pp. 804-822, Vol. II, October 1993) provides an
algorithm to determine when the results of a join operation are duplicate
free. That work is aimed at converting SQL's SELECT DISTINCT to SELECT for
a class of queries that includes selections, projections, and joins,
together with intersection operations. That work does not consider queries
involving outer join, or full join operations, nor does it describe how
uniqueness results can propagate across join, outer join, full outer join,
and intersection operators.
4. Basic definitions
Before describing the invention in detail, some important definitions are
set forth.
Tuple
A tuple t is a mapping from a finite set of attributes, R.orgate.V, to a
set of atomic (possibly null) values where R is non-empty set of real
attributes and V is a non-empty set of virtual attributes, R.andgate.V={},
and the mapping t maps at least one virtual attribute v.di-elect cons.V to
a non-null value. For an attribute set X, t[X] represents the values
associated with attributes X under the mapping t, where X.OR
right.R.circleincircle.V and X.noteq.{}.
The set of real attributes of a tuple t is the same as the schema of the
tuple in the traditional relational algebras. These attributes are
accessible to users and can be referenced externally, e.g., in user
queries, etc. On the other hand, virtual attributes are (mostly) used to
provide unique conceptual tuple-ids to tuples, and are not available for
user queries. In some cases, when real attributes are not sufficient to
identify tuples, virtual attributes have to be manipulated explicitly by
the RDBMS.
Relation
A relation r is a triple (R, V, E) where R is a non-empty set of real
attributes, V is a non-empty set of virtual attributes, and E, the
extension of relation r, is a set of tuples such that (.A-inverted.t.sub.1
.di-elect cons.E) (.A-inverted.t.sub.2 .di-elect cons.E) (t.sub.1
.noteq.t.sub.2 t.sub.1 [V].noteq.t.sub.2 [V]). R.orgate.V is called the
schema of relation r. We use sch(r) to denote the set of real attributes
in the schema of r. Virtual and real attributes of relations are used to
provide a conceptual framework for relations with duplicates.
Predicate
A predicate p over a set of real attributes sch(p), called the schema of p,
is a total mapping from tuples to the Boolean values {TRUE, FALSE}, where
sch(p) is the minimum set of attributes such that for all tuples t.sub.1
and t.sub.2 (t.sub.1 [sch(p)]=t.sub.2 [sch(p)]p(t.sub.1)=p(t.sub.2)). For
a tuple t with real schema .OR left.sch(p), p(t) is TRUE if and only if
(.A-inverted.A.di-elect cons.sch(p)) (substitution of t[A] for an A in p
causes it to evaluate to TRUE).
Null-intolerant
A predicate p is null-intolerant if p evaluates to FALSE for tuples
undefined on one or more attributes in sch(p). More formally, p is
null-intolerant if (.A-inverted.t)(.E-backward..di-elect cons.sch(p))
(t[A]=NULLp(t)=FALSE).
Throughout this document it is assumed that predicates are null-intolerant,
and predicates do not contain disjunction.
5. Algebraic operators and expressions
Following are definitions for algebraic operators and notational
conventions used in SQL queries. These definitions are helpful in
understanding later portions of this document.
5.1 Algebraic operators
In the following, let r=<R, V, E>, r.sub.1 =<R.sub.1, V.sub.1, E.sub.1 >
and r.sub.2 =<R.sub.2, V.sub.2, E.sub.2 > denote relations such that
R.sub.1 .andgate.R.sub.2 ={} and V.sub.1 .andgate.V.sub.2 ={}.
Projection
The projection, .pi..sup.a .sub.X (r), of relation r onto attributes X is
the relation <X, V, E'> where X.OR right.R and
E'={t.v.vertline.(.E-backward.t'.di-elect cons.E) (t=t'[X]v=t'[V])}.
The .pi..sup.R operator is a projection operator that does not remove
"duplicates" in the real attributes part of the source expression. The
superscript a in .pi..sup.R denotes the fact that all the virtual
attributes of the source expression are included in the virtual schema of
the result expression. For ease of reading, the superscript a is omitted
from .pi..sup.R when there is no ambiguity and the operator is simply
written as .pi..
The projection.sup.c, .pi..sup.c.sub.X.sbsb.R.sub.X.sbsb.V (r) , of
relation r on attributes X.sub.R X.sub.V is the relation <X.sub.R,
X.sub.V, E'>, where X.sub.R .OR right.R, X.sub.V .OR right.V and:
E'={t.v.vertline.(.E-backward.t'.di-elect cons.E)(t=t'[X.sub.R
]v=t'[X.sub.V ])}.
In contrast to .pi., .pi..sup.c allows a choice in the selection of the
virtual attributes from the source expression. This operation is useful in
defining the outer join operator defined below.
Delta Projection
The delta-projection, .delta..sub.X.sbsb.R.sub.X.sbsb.V (r) , of r on
attributes X.sub.R X.sub.V is the relation <X.sub.R X.sub.V, V.sub.new,
E'>, where X.sub.R .OR right.R, X.sub.V .OR right.V, and:
E'={t.vertline.(.E-backward.t'.di-elect cons.E) (t[X.sub.R X.sub.V
]=t'[X.sub.R X.sub.V ]t[V.sub.new ]=a new, unique value)}.
The .delta. operator models the SELECT DISTINCT . . . construct of SQL
which allows elimination of "duplicates" from a relation. The .delta.
operator is called the distinct projection operator and produces a result
relation which has distinct values on the attributes X.sub.R X.sub.V and a
new virtual attribute.
Selection
The selection, .sigma..sub.p (r), of relation r on predicate p is the
relation <R, V, E'>, where sch(p).OR right.R, and:
E'={t.vertline.t.di-elect cons.Ep(t)}
Cross Product and Difference
The cross product, r.sub.1 .times.r.sub.2, and difference, r.sub.1
-r.sub.2, of relations r.sub.1 and r.sub.2 is the relation <R.sub.1
R.sub.2, V.sub.1 V.sub.2, E.sub.1 .times.E.sub.2 > and <R, V, E.sub.1
-E.sub.2 >, respectively.
Join
The join, r.sub.1 r.sub.2, or relations r.sub.1 and r.sub.2 is the relation
<R.sub.1 R.sub.2, V.sub.1 V.sub.2, E'>, where:
E'={t.vertline.t.di-elect cons.(E.sub.1 .times.E.sub.2)p(t)}.
Equi-join predicate
The predicate p is called an equi-join predicate if it consists of
conjuncts of clauses of the form X.sub.i =Y.sub.j, where X.sub.i and
Y.sub.j are attributes of the relations referenced by the predicate.
Union Compatible
A relation r.sub.1 =<{A.sub.1, A.sub.2, . . . , A.sub.n }, V.sub.1, E.sub.1
> is said to be union compatible with relation r.sub.2 =<{B.sub.1,
B.sub.2, . . . , B.sub.n }, V.sub.2, E.sub.2 > iff:
domain(A.sub.i)=domain (B.sub.i), for 1.ltoreq.i.ltoreq.n.
That is, the real attributes of r.sub.1 and r.sub.2 can be ordered such
that the domains of the first real attributes of r.sub.1 and r.sub.2 are
the same, the domains of the second real attributes of r.sub.1 and r.sub.2
are the same, and so on.
Union and Outer Union
The union, r.sub.1 .orgate.r.sub.2, of union compatible relations r.sub.1
and r.sub.2 is the relation (R, V, E.sub.1 .orgate.E.sub.2). The outer
union, r.sub.1 r.sub.2, is the relation (R.sub.1 .orgate.R.sub.2, V.sub.1
.orgate.V.sub.2, E'), where:
E'={t.vertline.(.E-backward.t'.di-elect cons.E.sub.1)(t[R.sub.1 V.sub.1
]=t'(.A-inverted.A .di-elect cons.(R.sub.2 -R.sub.1).orgate.(V.sub.2
-V.sub.1))(t[A]=NULL))
(.E-backward.t".di-elect cons.E.sub.2)(t[R.sub.2 V.sub.2
]=t"(.A-inverted.A.di-elect cons.(R.sub.1 -R.sub.2).orgate.(V.sub.1
-V.sub.2))(t[A]=NULL))}.
Note that rows in r.sub.1 r.sub.2 are padded with NULLs for those
attributes that are not present either in relation r.sub.1 or in relation
r.sub.2.
The following notational convention is useful for the following
definitions. If r=<R, V, E> is a relation, the term ext(r) is used to
represent E, the extension part of r. In order to maintain readability in
the definitions, in the following few definitions we adopt the shorthand
convention of simply writing E.sub.1 E.sub.2 instead of ext(r.sub.1
r.sub.2), or .pi..sub.x (E.sub.1 E.sub.2) instead of ext(.pi..sub.X
(r.sub.1 r.sub.2)), etc.
Left Outer Join
The left outer join, r.sub.1 r.sub.2, is the relation <R.sub.1 R.sub.2,
V.sub.1 V.sub.2, E'>, where
E'=(E.sub.1 E.sub.2)(E.sub.1 -.pi..sup.c.sub.R.sbsb.1.sub.V.sbsb.1 (E.sub.1
E.sub.2)).
Preserved Relation and Null Supplying Relation
Relation r.sub.1 in the above definition of left outer join, r.sub.1
r.sub.2, is called the preserved relation and relation r.sub.2 is called
the null supplying relation. The right outer join, r.sub.1 r.sub.2, can
similarly be defined in which r.sub.1 is the null supplying relation and
r.sub.2 is the preserved relation.
Full Outer Join
The full outer join, r.sub.1 r.sub.2, of relations r.sub.1 and r.sub.2 is
the relation <R.sub.1 R.sub.2, V.sub.1 V.sub.2, E'>, where
E'=(E.sub.1 E.sub.2)(E.sub.1 -.pi..sup.c.sub.R.sbsb.1.sub.V.sbsb.1 (E.sub.1
E.sub.2))(E.sub.2 -.pi..sup.c.sub.R.sbsb.2.sub.V.sbsb.2 (E.sub.1 E.sub.2))
.
Equivalent
Values v.sub.1 and v.sub.2 are said to be equivalent, denoted v.sub.1
.cuberoot.v.sub.2 if either both v.sub.1, v.sub.2 are non-NULL and v.sub.1
=v.sub.2, or if both are NULL.
Intersect Distinct
The intersect distinct, r.sub.1 .andgate..sub.d r.sub.2, of two union
compatible relations r.sub.1 and r.sub.2 is the relation <{C.sub.1,
C.sub.2, . . . , C.sub.n}, V.sub.new, E'>, where
domain(A.sub.i)=domain(B.sub.i)=domain(C.sub.i), 1.ltoreq.i.ltoreq.n, and
E'={t.vertline.(.E-backward.t.sub.1 .di-elect
cons.E.sub.2)(.A-inverted.i)(t[C.sub.i ].tbd.t.sub.1 [A.sub.i ]t[C.sub.i
].tbd.t.sub.2 [B.sub.i ]t[V.sub.new ]=a new, unique value)}
Intersect distinct retains the common tuples in relations r.sub.1 and
r.sub.2. If a tuple t.sub.1 .di-elect cons.E.sub.l contains null value in
attribute A.sub.i and t.sub.2 .di-elect cons.E.sub.2 contains null value
in attribute B.sub.i and identical no-null values in the remaining
attributes then t.sub.1 and t.sub.2 are considered equivalent and only a
single copy is retained in the result. If either operand contains
duplicates then only one copy of the common tuple is retained in the
result.
Intersect All
By contrast, the operand intersect all, .andgate..sub.a, retains "some" of
the duplicate copies of the common tuples. More precisely, in two union
compatible relations r.sub.1 and r.sub.2, if a common tuple t appears i
times in r.sub.1 and j times in r.sub.2, then t appears min{i,j} times in
the result relation r.sub.1 .andgate..sub.a r.sub.2.
5.2 Expressions and expression trees
The following provides a recursive definition of expressions.
1. If r=<R, V, E> is a relation, then r is an expression. Henceforth,
whenever there is no ambiguity, the shorthand notation e is used to
represent the expression e=<R, V, E>,
2. If e=<R, V, E> is an expression then .pi..sup.R.sub.X (e) is an
expression, where X.OR right.R.
3. If e=<R, V, E> is an expression then .delta..sub.X.sbsb.R.sub.X.sbsb.V
(e) is an expression, where X.sub.R .OR right.R and X.sub.V .OR right.V.
4. If e=<R, V, E> is an expression then .sigma..sub.p (e) is an expression,
where sch(p).OR right.R.
5. If e.sub.1 =<R.sub.1, V.sub.1, E.sub.1 >and e.sub.2 =<R.sub.2, V.sub.2,
E.sub.2 > are expressions then e.sub.1 e.sub.2 is an expression, where:
.circle-w/dot..di-elect cons.{, , , , .andgate..sub.d, .andgate..sub.a },
and p is a predicate such that sch(p).OR right.R.sub.1 R.sub.2, and
sch(p).andgate.R.sub.1 .noteq.{}, and sch(p).andgate.R.sub.2 .noteq.{}.
6. If e=<R, V, E> is an expression then so is (e), where (e)=<R, V, E>.
This parenthesization is required due to the fact that some of the binary
operations defined above are not fully associative. Parenthesization
determines the evaluation order so that expressions can be evaluated
un-ambiguously. However, whenever there is no ambiguity, parenthesis will
be dropped freely.
Expression Tree
An expression can also be represented by a corresponding expression tree in
which the inner nodes are the operators occurring in the expression and
the leaves are the base relations referenced in the expression.
Let denote one of the binary operators defined in the previous section,
then an expression tree T with left subtree T.sub.1, right subtree T.sub.r
and root is denoted by (T.sub.1 T.sub.r).
Henceforth, the two equivalent representations are used interchangeably.
6. Uniqueness Properties
As stated above, the invention is directed toward a system and method for
generating uniqueness information for relations in an expression. In one
application, this information is used to optimize a query, such as an SQL
query. To accomplish this objective, several innovations can be used as
described below. These innovations include:
The concept of weak binding in the context of outer joins;
Powerful, yet simple, new identities for key set determination;
A process, termed `uniqueness process` which provides a method for
determining the unique set of a subtree corresponding to a relational
sub-expression; and
A process, termed `augmented-unique process` for selecting roots of the
expression to which the uniqueness process is applied. Roots are selected
in the post order.
A unique set is the set of those attribute groups that uniquely identify
tuples in the expression. For base relations, this is the same as the key
set; for other expressions, the unique set may contain attributes which
are not part of the expression's schema. In this latter case, the
uniqueness test boils down to simply determining whether the schema of the
expression contains a member of the unique set.
This point may be best illustrated by the following example: Consider the
relation r=<R, V, E> having key K, and the expression .pi..sub.X (r). Then
the unique set for both r and .pi..sub.X (r) contains the key K. However,
tuples in .pi..sub.X (r) will be unique if X.OR left.K.
The uniqueness process, described in detail below, can be applied
recursively in order to determine the unique set of every subtree in a
given expression tree T. First, the unique sets of the base relations are
determined. After the unique sets of the base relation have been
determined, they are used to determine the unique sets of every subtree of
expression tree T.
To determine the unique sets, the concept of a keySet(r) is used. First,
KEY is used to denote a key of base relations r=<R, V, E>, where KEY.OR
right.R and KEY is minimal in the sense that no subset of the attributes
in KEY is a key for r. Then, keySet(r) is used to denote the set
containing all minimal keys of r. Therefore, keySet(r) is called the
unique set of relation r. Note, keySet(r) may be an empty set.
If U(e.sub.x) and U(e.sub.y) are two unique sets then
U(e.sub.x).smallcircle.U(e.sub.y) denotes the following set:
U(e.sub.x).smallcircle.U(e.sub.y)={K.sub.1 .orgate.K.sub.2
.vertline.K.sub.1 .di-elect cons.U(e.sub.x), K.sub.2 .di-elect
cons.U(e.sub.y)}
Intuitively, U(e.sub.x).smallcircle.U(e.sub.y) is the set of all possible
keys that can be generated by combining the keys of e.sub.x and e.sub.y.
Strong Binding
A concept called strong binding is also important in the invention. A real
attribute A is said to be strongly bound if all tuples defined on schemas
containing A have exactly the same non-null value for attribute A.
Typically, an attribute A in the real schema of expression e gets strongly
bound when the select operation .sigma..sub.p.sbsb.x (e) assigns a literal
constant to attribute A.di-elect cons.sch(p.sub.x). For example
##STR1##
selects all tuples of expression r where A.sub.1 is equal to five. Because
A.sub.1 =5 for all tuples in the selected set of tuples, A.sub.1 is said
to be strongly bound.
Further, if R is a set of attributes, the set:
sBind(p.sub.x, R)={A.vertline.p.sub.x strongly binds attribute AA.di-elect
cons.(sch(p.sub.x).andgate.R) }.
Weak Bindings
Equally important to the invention is a concept called weak binding. A real
attribute A is said to be weakly bound if all tuples defined on schemas
containing A have either a null value or exactly the same non-null value
for attribute A.
Typically, a strongly bound attribute A in the real schema of expression
e.sub.1 gets weakly bound in the result of an expression such as e.sub.0
.fwdarw.e.sub.1.
Predictor Sets
Yet another key concept of the invention is the identification of a
predictor set. If A is a (strongly or weakly) bound attribute in the real
schema of expression e, the predictor set of A, denoted pSet(A), is
defined as follows:
1. A.di-elect cons.pSet(A).
2. (Closure under predicate P:) If X.di-elect cons.pSet(A) and subclause
X=Y is part of an (inner/outer) join predicate P in e, then Y.di-elect
cons.pSet(A).
In the uniqueness process, information about bound attributes is used to
get smaller keys from already available keys. More precisely, given that X
is a key for expression e, and A.di-elect cons.X is bound, the uniqueness
process attempts to determine if X-A is also a key. This is trivially true
if A is strongly bound and X-A.noteq.{}.
If, however, A is weakly bound, the following result occurs: If X is a key
for e and A.di-elect cons.X is weakly bound to constant c, then X-A is a
key for e if:
.pi..sub.X-A (.sigma..sub.A=c (e)).andgate..pi..sub.X-A (e-.sigma..sub.A=c
(e))={}, where X-A.noteq.{}
In other words, if the same values for attributes X-A are not present in
tuples having the value c for column A and tuples having NULL for column
A, then A can be pruned from the key X.
The uniqueness process, described below, uses one manifestation of this
sufficient condition using predictor sets (pSets) to determine the unique
sets of the expression. The query expression is processed bottom-up, and
pSets are built as bound attributes and equijoin predicates that reference
these attributes are encountered.
As a later example shows, it is important to remember "levels" at which
attributes are added to pSets. From a structural point-of-view, attribute
B will be at a higher level than C in a query expression, if either B is
from the preserved side and C is from the null-supplying side of an outer
join on predicate B=C, or the query specifies application of predicate A=B
after that of predicate A=C. Semantically, if B and C are attributes in
pSet(A) and B has a higher level than C, then whenever values for C are
non-null, so are the values for B. With this in mind, the definition of
pSets is augmented to be a set of <attribute,level> pairs.
6.2 Identities for uniqueness propagation
Paulley, G. N. and Larson, P-A., in their paper "Exploiting uniqueness in
query optimization," CASCON, pp. 804-822, Vol. II, October 1993, present
the following result for join queries: If relations r and s both have
primary keys, then key(r) .smallcircle.key(s) is a key for rx. However,
that work does not consider key recognition across binary operators,
.fwdarw., .rarw., and .
The following identity applies to binary operations of , .fwdarw., .rarw.,
and . For .fwdarw., .rarw., and the identity is novel and unique. For
operations the identity is generally known, but is presented here for the
sake of completeness.
If e=LR is an expression having L and R as its left and right operands,
respectively, and is a binary operator.di-elect cons.{.sup.1, .fwdarw.,
.rarw., }, then:
keySet(e)=S.sub.1 .orgate.S.sub.2 .orgate.S.sub.3,
where
##EQU1##
.sup.1 For operator this is prior art.
To optimize th | | |