WikiPatents - Community Patent Review
Create Free Account  |  License or Sell Your Patent  |  WikiPatents Marketplace  |  WikiPatents Blog
Username:  Password:  
    
Advanced Search
System and method for generating uniqueness information for optimizing an SQL query    
United States Patent5659728   
Link to this pagehttp://www.wikipatents.com/5659728.html
Inventor(s)Bhargava; Gautam (Cupertino, CA); Goel; Piyush (Monte Sereno, CA); Iyer; Balakrishna R. (San Jose, CA)
AbstractA system and method of determining uniqueness properties of an expression. A root of the expression is first determined, where the root is one of a base relation, a unary operation or a binary operation. Once the root is determined, a first procedure of an augmented unique process is called to determine uniqueness properties of a child of that root. The procedure called is chosen based on the determined root. Where the root is a base relation, a first procedure of a uniqueness process is applied to determine the uniqueness properties of the base relation. Where the root is a unary or binary operation, the called procedure is suspended, a second procedure of the augmented unique process is called to determine the uniqueness properties of the child of the operation, and this process is repeated until a base relation is reached. Once a base relation is reached, the first procedure of the uniqueness process is applied to determine the uniqueness properties of the reached base relation. A next procedure of a uniqueness process is applied to determine the uniqueness properties of a parent operator of the based relation. The procedure applied is chosen based on a type of operation represented by the parent. The process then unwinds to determine the uniqueness properties for each ancestor of the base relation(s).



 Title Information Submit all comments and votes
 
Patent Text Patent PDF Print Page Summary File History
Plain text PDF images Print Summary File History
Drawing from US Patent 5659728
System and method for generating uniqueness information for optimizing

     an SQL query - US Patent 5659728 Drawing
System and method for generating uniqueness information for optimizing an SQL query
Inventor     Bhargava; Gautam (Cupertino, CA); Goel; Piyush (Monte Sereno, CA); Iyer; Balakrishna R. (San Jose, CA)
Owner/Assignee     International Business Machines Corporation (Armonk, NY)
Patent assignment
All assignments
Publication Date     August 19, 1997
Application Number     08/366,560
PAIR File History     Application Data   Transaction History
Image File Wrapper   Patent Term   Fees
Litigation
Filing Date     December 30, 1994
US Classification     707/2
Int'l Classification     G06F 017/30
Examiner     Black; Thomas G.
Assistant Examiner     Alam; Hosain T.
Attorney/Law Firm     Johnson; Prentiss Sterne, Kessler, Goldstein & Fox P.L.L.C., Wayne
Address
Parent Case    
Priority Data    
USPTO Field of Search     395/600 395/602
Patent Tags     generating uniqueness information optimizing sql query
   
Enter a comma (,) or semicolon (;) between multiple tag words/phrases.
Describe this patent:
 Amusing   
 Clever   
 Complex   
 Efficient   
 Historic   
 Important   
 Innovative   
 Interesting   
 Practical   
 Simple   
[no votes]
Patent WIKI

Share information and news about this patent, including information and news about the technology, inventors, company, ligation and licensing.

 References Submit all comments and votes
 
*references marked with an asterisk below are user-added references
 U.S. References
 
Add a new US reference:  
ReferenceRelevancyCommentsReferenceRelevancyComments
5551031
M. Cheng
707/2
Aug,1996

[0 after 0 votes]
5423035
DePrez
707/2
Jun,1995

[0 after 0 votes]
 Foreign References
 Other References
 Market Review Submit all comments and votes
   
Market Size
Estimate the gross annual revenues of the relevant market sector:
> $10B
$5B - $10B
$2B - $5B
$500M - $2B
$100M - $500M
$10M - $100M
$1M - $10M
$500K - $1M
$100K - $500K
< $100K
[No votes]
$0
 
$0   $2.5B   $5B   $7.5B   $10B
Market Share
Estimate the percentage of the relevant market sector this invention will capture:
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Reasonable Royalty
What percentage of gross sales should the inventor or assignee be paid?
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Public's "Guesstimation" of Royalty Value
Market SizeN/A[No votes]
xMarket ShareN/A[No votes]
xReasonable RoyaltyN/A[No votes]

N/A

License Availablity
If you are NOT the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
License Availablity
If you ARE the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
Competitive Advantage
Does this invention have a significant competitive advantage over similar technologies?
Yes

No



[No votes]
Most helpful competitive advantage comment
[No comments]

Commercial Alternatives
Are there viable commercial alternatives for this invention?
Yes

No



[No votes]
Most helpful commercial alternative comment
[No comments]

 Technical Review Submit all comments and votes
 Claims Submit all comments and votes
 


Having thus described our invention, what we claim as new and desire to secure by Letters Patent is:

1. A method of determining uniqueness properties of an expression for query optimization, comprising the steps of:

a) determining a root of the expression, wherein said root is one of a base relation, a unary operation or a binary operation;

b) calling a first one of a plurality of procedures of an augmented unique process to determine uniqueness properties of a child of the root, wherein the procedure called is chosen based on said determined root;

c) where said root is a base relation, applying a first procedure of a uniqueness process to determine the uniqueness properties of said base relation;

d) where said root is a unary or binary operation,

i suspending the called procedure,

ii calling a second one of a plurality of procedures of said augmented unique process to determine the uniqueness properties of the child of said unary or binary operation,

iii repeating said steps i and ii until a base relation is reached, and

iv applying a first procedure of a uniqueness process to determine the uniqueness properties of said reached base relation; and

e) applying a second of a plurality of procedures of a uniqueness process to determine the uniqueness properties of a parent operator of said base relation, wherein said procedure applied is chosen based on a type of operation represented by said parent; and

f) repeating said step (e) for each ancestor of a base relation.

2. The method of claim 1, wherein said first procedure of said uniqueness process comprises the steps of:

setting a unique set for said base relation equal to a key set for said base relation; and

defining a strongly bound set, a weakly bound set and a join attribute set as empty sets for the expression.

3. The method of claim 1, wherein if said root is a selector operation, said second of a plurality of procedures of said uniqueness process comprises the steps of:

determining whether there exists a set of attributes, K, in a unique set of a child of the expression such that all attributes in K are strongly bound by a predicate to determine whether a one-tuple condition exists;

if said one-tuple condition exists, adding all attributes currently in a node to a unique set of said node and to a strongly bound set of said node;

if said one-tuple condition does not exist, adding attributes in K that are not strongly bound to said unique set and to said strongly bound set of said node;

defining a weakly bound set and a join attribute set as empty sets for said node; and

defining each attribute that is strongly bound as being a level one attribute.

4. The method of claim 1, wherein if said root is a delta projection operation, said second of a plurality of procedures of said uniqueness process comprises the steps of:

(a) defining an attribute set as including all attributes for the projection;

(b) determining whether said attribute set contains a joining attribute set of a child of the projection;

(c) if said attribute set contains a joining attribute set of the child, then for each attribute in said attribute set that is found in a weakly bound set of its child as well as in said attribute set, determining whether a level of said attribute is lower than the level of another attribute in said attribute set; and

(d) if said level of said attribute is lower than the level of another attribute in said attribute set, removing said first attribute from said attribute set.

5. The method of claim 4, wherein said second of a plurality of procedures of said uniqueness process further comprises the step of removing from said attribute set all of said attributes that are strongly bound in the child.

6. The method of claim 5, wherein said second of a plurality of procedures of said uniqueness process further comprises the steps of:

determining whether there are any attributes remaining in said attribute set;

if no attributes remain in said attribute set, a unique set is defined as all of said attributes originally in said defined attribute set; and

if one or more of said attributes remain in said attribute set, said unique set is defined as all of the attributes in the child combined with all of said attributes remaining in said attribute set.

7. The method of claim 6, wherein said second of a plurality of procedures of said uniqueness process further comprises the steps of:

defining a strongly bound set as the intersection of a strongly bound set of the child with a set comprising all of said attributes originally in said defined attribute set; and

defining a weakly bound set as the intersection of a weakly bound set of the child with a set comprising all of said attributes originally in said defined attribute set.

8. The method of claim 1, wherein if said root is a non-duplicate removing projection operation, said second of a plurality of procedures of said uniqueness process comprises the steps of:

defining a unique set as a unique set of a child of the root;

defining a strongly bound set as the intersection of a strongly bound set of the child with a set of all attributes of said root;

defining a weakly bound set as the intersection of a weakly bound set of the child with a set of all attributes of said root; and

defining a join attribute set as a join attribute set of the child.

9. The method of claim 1, wherein if said root is a binary operator having a left child and a right child, said second of a plurality of procedures of said uniqueness process comprises the steps of:

concentrating each element in a unique set of a left child with each element in a unique set of the right child to create a first set;

if the operator is a join or left outer join operator, and a predicate P.sub.LR is an equi-join, and if the attributes that are common to both a schema of the predicate and a schema of the right child are in a unique set of the right child, then a second set is defined as a unique set of the left child, otherwise, said second set is defined as an empty set;

if the operator is a join or right outer join operator, and the predicate P.sub.LR is an equi-join, and if the attributes that are common to both the schema of the predicate and the schema of the left child are in the unique set of the left child, then a third set is defined as a unique set of the left child, otherwise said third set is defined as the empty set; and

defining uniqueness properties of the operator as minimum attributes of said first, second and third sets.

10. The method of claim 9, wherein if said root is a join operation, said second of a plurality of procedures of said uniqueness process further comprises the steps of:

defining a strongly bound set of said attributes as the union of a strongly bound set of the left child and the strongly bound set of the right child; and

defining a weakly bound set of said attributes as the union of a weakly bound set of the left child and the weakly bound set of the right child;

wherein if said root is a left-outer join operation, said second of a plurality of procedures of said uniqueness process further comprises the steps of:

defining a strongly bound set of said attributes as a strongly bound set of the left child; and

defining a weakly bound set of said attributes as the union of a weakly bound set of the left child and the weakly bound set of the right child and the strongly bound set of the right child;

wherein if said root is a right-outer join operation, said second of a plurality of procedures of said uniqueness process further comprises the steps of:

defining a strongly bound set of said attributes as the strongly bound set of the right child; and

defining a weakly bound set of said attributes as the union of a weakly bound set of the left child and the weakly bound set of the right child and the strongly bound set of the left child; and

wherein if said root is a full-outer join operation, said second of a plurality of procedures of said uniqueness process further comprises the steps of:

defining a strongly bound set of said attributes as an empty set; and

defining a weakly bound set of said attributes as the union of a weakly bound set of the left child and the weakly bound set of the right child and the strongly bound set of the left child and the strongly bound set of the right child.

11. The method of claim 10, further comprises the steps of:

for each attribute in said strongly bound set and said weakly bound set, determining a closure of a predictor set under the predicate,

wherein if the predicate contains subclause A=B, one of the following steps is performed

(1) if said root is a join operation or a full-outer join operation, the attribute level of B is the same as the level of A,

(2) if said root is a left-outer join, the attribute level of B is one lower than the attribute level of A, and

(3) if said root is a right-outer join, the attribute level of B is one higher than the attribute level of A.

12. The method of claim 9, wherein if said root is a binary intersect operator, said second of a plurality of procedures of said uniqueness process further comprises the steps of:

defining a strongly bound set as a minimum of a renamed combination of a strongly bound set of the left child and a strongly bound set of the right child;

defining a weakly bound set and a join attribute set as the empty set;

if the intersect operator is an intersect distinct operator, defining a unique set as the minimum of the renamed set of the combination of the output attributes of the intersect distinct and the unique set of the left child and the unique set of the right child; and

if the intersect operator is not an intersect distinct operator, defining the unique set as the minimum of the renamed set of the combination of the unique set of the left child and the unique set of the right child.

13. The method of claim 9, wherein said second of a plurality of procedures of said uniqueness process comprises the steps of:

defining a join attribute set as the union of the attribute set of the left child and the attribute set of the right child and the attributed mentioned in the predicate.

14. The method of claim 1, further comprising the step of using said uniqueness properties to optimize an SQL query.
 Description Submit all comments and votes
 


TECHNICAL FIELD

The present invention relates generally to query optimization, and more specifically to a system and method for generating uniqueness properties of intermediate query subexpressions.

BACKGROUND ART

Relational DataBase Management System (RDBMS) products using a Structured Query Language (SQL) interface are well known in the art. The SQL interface has evolved into a standard language for RDBMS products and has been adopted as such by both the American Nationals Standard Organization (ANSI) and the International Standards Organization (ISO). In RDBMS products, all data is externally structured into tables. The SQL interface allows users to formulate relational operations on the tables either interactively, in batch files, or embedded in host languages such as C, COBOL, etc. Operators are provided in SQL that allow the user to manipulate the data, wherein each operator operates on either one or two tables and produces a new table as a result. The power of SQL lies in its ability to link information from multiple tables or views together to perform complex sets of procedures with a single statement.

The current state-of-the-art in query optimization provides few solutions for optimizing query expressions involving joins, outer joins and full outer joins. Some have suggested that queries could be optimized if certain uniqueness information is known about relations in a given expression. In particular, C. J. Date, and Hugh Darwen, in "Relational Database Writings 1989-1991," pp 133-153, indicate that uniqueness information can be used to optimize database queries.

However, there is little guidance regarding how to determine this much needed uniqueness information. Thus, if full advantage of certain optimization techniques is to be obtained, a technique for determining uniqueness information is needed.

DISCLOSURE OF INVENTION

The present invention is directed toward a system and method for determining uniqueness properties of an expression so that queries containing join and outer join operations can be optimized. According to the invention, two processes are implemented to determine the uniqueness properties of a given expression.

The first process, termed an augmented unique process, is a recursive process applied to the roots of the expression in question. The augmented unique process is used to call a second process, termed a uniqueness process. The uniqueness process is used to determine uniqueness properties of the elements of the expression in question. In operation, the augmented unique process begins at the root of the expression in question. The augmented unique process calls an appropriate procedure depending on the child, or children, of the root. Subsequent procedures of the augmented unique process are called in a recursive manner depending on the progeny of the root.

When a base relation is reached, the augmented unique process calls an appropriate procedure of the uniqueness process to determine the uniqueness properties of that base relation. The augmented unique process `unwinds` by calling an appropriate procedure of the uniqueness process for each of the progenitors of the base relation. If the root of the expression in question is a binary operator, the processes described above are repeated for the other progeny of the binary operator.

A feature of the invention is that the uniqueness properties determined for the relations in the expression can be used for numerous applications. For example, in one application, the uniqueness properties are used to determine whether an intersect all operation can be replaced by an intersect distinct operation, thus simplifying the expression.

The uniqueness properties provide important information pertaining to the attributes of the expression. For each relation, base or otherwise, the uniqueness properties determined can include: the weakly bound set, the strongly bound set, the predictor set and the key set.

Further features and advantages of the present invention, as well as the structure and operation of various embodiments of the present invention, are described in detail below with reference to the accompanying drawings.

BRIEF DESCRIPTION OF DRAWINGS

The present invention is described with reference to the accompanying drawings. In the drawings, like reference numbers indicate identical or functionally similar elements. Additionally, the left-most digit(s) of a reference number identifies the drawing in which the reference number first appears.

FIG. 1 is a diagram illustrating an exemplary computer hardware embodiment in which the present invention can be implemented;

FIG. 2 is an operational flow diagram illustrating a method of interpreting and executing SQL statements;

FIG. 3 is an operational flow diagram illustrating a method of interpreting and executing SQL statements embedded in source code;

FIG. 4 is an operational flow diagram illustrating an application of an augmented unique process according to one embodiment of the invention;

FIG. 5 is an operational flow diagram illustrating steps associated with the augmented unique process according to one embodiment of the invention;

FIG. 6 is an operational flow diagram illustrating an application of procedure two of the uniqueness process according to one embodiment of the invention;

FIG. 7 is an operational flow diagram illustrating an application of procedure three of the uniqueness process according to one embodiment of the invention;

FIG. 8 is an operational flow diagram illustrating an application of procedure four of the uniqueness process according to one embodiment of the invention;

FIG. 9 is an operational flow diagram illustrating an application of procedure five of the uniqueness process according to one embodiment of the invention; and

FIG. 10 is an operational flow diagram illustrating an application of procedure six of the uniqueness process according to one embodiment of the invention.

BEST MODE FOR CARRYING OUT THE INVENTION

The present invention is directed toward a system and method for optimizing SQL queries. According to the invention, uniqueness properties of each subexpression of an SQL query are determined using a recursive process called the augmented unique process. The augmented unique process recursively determines the uniqueness properties of each relation using a uniqueness process.

Further according to the invention, a concept of strong and weak bindings is used to propagate uniqueness information up an expression tree for a query. Then, for example, such information can be used to eliminate unnecessary duplicate removal operations, convert Intersect All to Intersect Distinct .andgate..sub.a to .andgate..sub.d which can then be converted to joins and/or used for "what-if" .delta.-pushdown to do subtree pruning. These transformations can provide orders of magnitude performance improvement in query execution times.

1. Environment of the Invention

FIG. 1 illustrates an exemplary computer hardware environment that in which the present invention can be implemented. In this exemplary environment, a computer system 102 is comprised of one or more processors connected to one or more electronic storage devices 104 and 106, such as disk drives, on which one or more databases are stored.

Operators of the computer system 102 use a standard terminal interface 108, such as IMS/DB/DC, CICS, TSO, or other interface, to perform various search and retrieval functions, termed queries, against the databases. These functions are typically performed using Relational DataBase Management System (RDBMS) software that incorporates the Structured Query Language (SQL) standard. In one embodiment of the present invention, the RDBMS software comprises the DB2 product offered by IBM for the MVS or OS/2 operating systems. As will become apparent to those skilled in the relevant art, the present invention has application to any RDBMS that uses SQL or any other data manipulation language.

The DB2 architecture includes three major components: (1) the IMS Resource Lock Manager (IRLM) 110, the Systems Services module 112, and the Database Services module 114. The IRLM 110 handles locking services for concurrency control. These locking services are provided because DB2 treats data as a shared resource, thereby allowing any number of users to access the same data simultaneously. Therefore, concurrency control is required to isolate users and to maintain data integrity.

The Systems Services module 112 controls the overall DB2 execution environment, including managing log data sets 106, gathering statistics, handling startup and shutdown, and providing management support.

At the center of the DB2 architecture is the Database Services module 114. The Database Services module 114 contains several submodules, including the Relational Database System (RDS) 116, the Data Manager 118, the Buffer Manager 120 and other components 122 such as an SQL compiler/interpreter. These modules support the functions of the SQL language, i.e., definition, access control, retrieval, and update of user and system data.

FIG. 2 is a flowchart illustrating the steps necessary for the interpretation and execution of SQL statements in an interactive environment. Block 202 represents the input of SQL statements into the computer from the user. Block 204 represents the step of compiling or interpreting the SQL statements. An optimization function within block 204 may reorder the SQL query for optimum performance.

Block 206 represents the step of generating from the compiled SQL statements a compiled set of runtime structures, termed an application plan. Generally, the SQL statements received as input from the user specify only the data that the user wants, but do not specify how to get to that data. This step considers both the available access paths (indexes, sequential reads, etc.) and system-held statistics on the data to be accessed (the size of the table, the number of distinct values in a particular column, etc.), to choose what it considers to be the most efficient access path for the query. Block 208 represents the execution of the application plan, and block 210 represents the output of the results of the application plan to the user.

FIG. 3 is a flowchart illustrating the steps necessary for the interpretation and execution of SQL statements embedded in source code according to the present invention. Block 302 represents program source code containing a host language (such as COBOL or C) and embedded SQL statements. The program source code is then input to a pre-compile step 304. There are two outputs from the pre-compile step 304: a modified source module 306 and a database request module (DBRM) 308. The modified source module 306 contains host language calls to DB2, which the pre-compile step 304 inserts in place of SQL statements. The DBRM 308 consists of the SQL statements from the program source code 302.

A compile and link-edit step 310 uses the modified source module 306 to produce a load module 312, while an optimize and bind step 314 uses the DBRM 308 to produce a compiled set of runtime structures for the application plan 316. As indicated above in conjunction with FIG. 2, the SQL statements from the program source code 302 specify only the data that the user wants, but not how to get to it. The optimize and bind step 314 may reorder the SQL query so as to optimize the query. Thereafter, the optimize and bind step 314 considers both the available access paths (indexes, sequential reads, etc.) and system held statistics on the data to be accessed (the size of the table, the number of distinct values in a particular column, etc.), to choose what it considers to be the most efficient access path for the query. The load module 312 and application plan 316 are then executed together at step 318.

The present invention is described in terms of this example environment. Description in these terms is provided for convenience only. It is not intended that the invention be limited to application in this example environment. In fact, after reading the following description, it will become apparent to a person skilled in the relevant art how to implement the invention in alternative environments.

2.0 Conventional Optimization Techniques

In a given query expression, information about the uniqueness (or, "duplicate-free" status) of a sub(result) can be used to influence the optimal plan selection for the query. For example, the paper "Exploiting uniqueness in query optimization," by Paulley, G. N. and Larson, P-A., CASCON, pp. 804-822, Vol. II, October 1993, illustrates the case when an expensive, duplicate elimination clause such as SQL's SELECT DISTINCT can be converted to a much cheaper SELECT if it can be determined that the result is already duplicate free.

As another example the paper "Extensible/rule based query rewrite optimization in Starburst," by Pirahesh et al., SIGMOD, pp. 39-48, San Diego, Calif., June 1992, illustrates how special rules are allowed to fire in a rule-based optimizer when a "one-tuple condition" is determined.

Uniqueness propagation exploits available information about the uniqueness of tuples in base relations (enforced, e.g., through integrity constraints, key definitions, index uniqueness, etc.), along with properties of algebraic expressions that preserve and/or generate unique results, to influence the optimization of the query. For example, if it is known that one of the operands of an intersect all operation is an expression with unique tuples, then intersect all can be replaced by intersect distinct. See, for example, Paulley, G. N. and Larson, P-A., "Exploiting uniqueness in query optimization," CASCON, pp. 804-822, Vol. II, October 1993.

3.0 Optimization Using Uniqueness Propagation

Uniqueness propagation as described by Paulley and Larsen can be exploited as follows for query optimization:

1. If one of the operands of the intersect all is known to have unique tuples, a SELECT DISTINCT . . . can be applied to this sub-expression, which may allow for the "pruning" away of some expensive outer joins from this sub-expression. See, for example, Chen, A. L. P., "Outerjoin optimization in multidatabase systems," 2nd International Symposium on Databases in Parallel and Distributed Systems, 1990

2. The newly transformed intersect distinct operator could be transformed into a join operator. See for example, Pirahesh et al., "Extensible/rule based query rewrite optimization in Starburst," SIGMOD, pp. 39-48, San Diego, Calif., June 1992.

3. The join operator could then be part of an extensive join, outer join, full outer join reorderability evaluation algorithm.

This invention defines the concept of weak bindings in the context of queries involving joins, outer joins, and intersection operations. This, together with new uniqueness determination identities, is used to present a technique unique for determining the unique set of a (sub)tree corresponding to a relational (sub-)expression.

3.1 Comparison With Conventional Techniques

(Paulley, G. N. and Larson, P-A., "Exploiting uniqueness in query optimization," CASCON, pp. 804-822, Vol. II, October 1993) provides an algorithm to determine when the results of a join operation are duplicate free. That work is aimed at converting SQL's SELECT DISTINCT to SELECT for a class of queries that includes selections, projections, and joins, together with intersection operations. That work does not consider queries involving outer join, or full join operations, nor does it describe how uniqueness results can propagate across join, outer join, full outer join, and intersection operators.

4. Basic definitions

Before describing the invention in detail, some important definitions are set forth.

Tuple

A tuple t is a mapping from a finite set of attributes, R.orgate.V, to a set of atomic (possibly null) values where R is non-empty set of real attributes and V is a non-empty set of virtual attributes, R.andgate.V={}, and the mapping t maps at least one virtual attribute v.di-elect cons.V to a non-null value. For an attribute set X, t[X] represents the values associated with attributes X under the mapping t, where X.OR right.R.circleincircle.V and X.noteq.{}.

The set of real attributes of a tuple t is the same as the schema of the tuple in the traditional relational algebras. These attributes are accessible to users and can be referenced externally, e.g., in user queries, etc. On the other hand, virtual attributes are (mostly) used to provide unique conceptual tuple-ids to tuples, and are not available for user queries. In some cases, when real attributes are not sufficient to identify tuples, virtual attributes have to be manipulated explicitly by the RDBMS.

Relation

A relation r is a triple (R, V, E) where R is a non-empty set of real attributes, V is a non-empty set of virtual attributes, and E, the extension of relation r, is a set of tuples such that (.A-inverted.t.sub.1 .di-elect cons.E) (.A-inverted.t.sub.2 .di-elect cons.E) (t.sub.1 .noteq.t.sub.2 t.sub.1 [V].noteq.t.sub.2 [V]). R.orgate.V is called the schema of relation r. We use sch(r) to denote the set of real attributes in the schema of r. Virtual and real attributes of relations are used to provide a conceptual framework for relations with duplicates.

Predicate

A predicate p over a set of real attributes sch(p), called the schema of p, is a total mapping from tuples to the Boolean values {TRUE, FALSE}, where sch(p) is the minimum set of attributes such that for all tuples t.sub.1 and t.sub.2 (t.sub.1 [sch(p)]=t.sub.2 [sch(p)]p(t.sub.1)=p(t.sub.2)). For a tuple t with real schema .OR left.sch(p), p(t) is TRUE if and only if (.A-inverted.A.di-elect cons.sch(p)) (substitution of t[A] for an A in p causes it to evaluate to TRUE).

Null-intolerant

A predicate p is null-intolerant if p evaluates to FALSE for tuples undefined on one or more attributes in sch(p). More formally, p is null-intolerant if (.A-inverted.t)(.E-backward..di-elect cons.sch(p)) (t[A]=NULLp(t)=FALSE).

Throughout this document it is assumed that predicates are null-intolerant, and predicates do not contain disjunction.

5. Algebraic operators and expressions

Following are definitions for algebraic operators and notational conventions used in SQL queries. These definitions are helpful in understanding later portions of this document.

5.1 Algebraic operators

In the following, let r=<R, V, E>, r.sub.1 =<R.sub.1, V.sub.1, E.sub.1 > and r.sub.2 =<R.sub.2, V.sub.2, E.sub.2 > denote relations such that R.sub.1 .andgate.R.sub.2 ={} and V.sub.1 .andgate.V.sub.2 ={}.

Projection

The projection, .pi..sup.a .sub.X (r), of relation r onto attributes X is the relation <X, V, E'> where X.OR right.R and E'={t.v.vertline.(.E-backward.t'.di-elect cons.E) (t=t'[X]v=t'[V])}.

The .pi..sup.R operator is a projection operator that does not remove "duplicates" in the real attributes part of the source expression. The superscript a in .pi..sup.R denotes the fact that all the virtual attributes of the source expression are included in the virtual schema of the result expression. For ease of reading, the superscript a is omitted from .pi..sup.R when there is no ambiguity and the operator is simply written as .pi..

The projection.sup.c, .pi..sup.c.sub.X.sbsb.R.sub.X.sbsb.V (r) , of relation r on attributes X.sub.R X.sub.V is the relation <X.sub.R, X.sub.V, E'>, where X.sub.R .OR right.R, X.sub.V .OR right.V and:

E'={t.v.vertline.(.E-backward.t'.di-elect cons.E)(t=t'[X.sub.R ]v=t'[X.sub.V ])}.

In contrast to .pi., .pi..sup.c allows a choice in the selection of the virtual attributes from the source expression. This operation is useful in defining the outer join operator defined below.

Delta Projection

The delta-projection, .delta..sub.X.sbsb.R.sub.X.sbsb.V (r) , of r on attributes X.sub.R X.sub.V is the relation <X.sub.R X.sub.V, V.sub.new, E'>, where X.sub.R .OR right.R, X.sub.V .OR right.V, and:

E'={t.vertline.(.E-backward.t'.di-elect cons.E) (t[X.sub.R X.sub.V ]=t'[X.sub.R X.sub.V ]t[V.sub.new ]=a new, unique value)}.

The .delta. operator models the SELECT DISTINCT . . . construct of SQL which allows elimination of "duplicates" from a relation. The .delta. operator is called the distinct projection operator and produces a result relation which has distinct values on the attributes X.sub.R X.sub.V and a new virtual attribute.

Selection

The selection, .sigma..sub.p (r), of relation r on predicate p is the relation <R, V, E'>, where sch(p).OR right.R, and:

E'={t.vertline.t.di-elect cons.Ep(t)}

Cross Product and Difference

The cross product, r.sub.1 .times.r.sub.2, and difference, r.sub.1 -r.sub.2, of relations r.sub.1 and r.sub.2 is the relation <R.sub.1 R.sub.2, V.sub.1 V.sub.2, E.sub.1 .times.E.sub.2 > and <R, V, E.sub.1 -E.sub.2 >, respectively.

Join

The join, r.sub.1 r.sub.2, or relations r.sub.1 and r.sub.2 is the relation <R.sub.1 R.sub.2, V.sub.1 V.sub.2, E'>, where:

E'={t.vertline.t.di-elect cons.(E.sub.1 .times.E.sub.2)p(t)}.

Equi-join predicate

The predicate p is called an equi-join predicate if it consists of conjuncts of clauses of the form X.sub.i =Y.sub.j, where X.sub.i and Y.sub.j are attributes of the relations referenced by the predicate.

Union Compatible

A relation r.sub.1 =<{A.sub.1, A.sub.2, . . . , A.sub.n }, V.sub.1, E.sub.1 > is said to be union compatible with relation r.sub.2 =<{B.sub.1, B.sub.2, . . . , B.sub.n }, V.sub.2, E.sub.2 > iff:

domain(A.sub.i)=domain (B.sub.i), for 1.ltoreq.i.ltoreq.n.

That is, the real attributes of r.sub.1 and r.sub.2 can be ordered such that the domains of the first real attributes of r.sub.1 and r.sub.2 are the same, the domains of the second real attributes of r.sub.1 and r.sub.2 are the same, and so on.

Union and Outer Union

The union, r.sub.1 .orgate.r.sub.2, of union compatible relations r.sub.1 and r.sub.2 is the relation (R, V, E.sub.1 .orgate.E.sub.2). The outer union, r.sub.1 r.sub.2, is the relation (R.sub.1 .orgate.R.sub.2, V.sub.1 .orgate.V.sub.2, E'), where:

E'={t.vertline.(.E-backward.t'.di-elect cons.E.sub.1)(t[R.sub.1 V.sub.1 ]=t'(.A-inverted.A .di-elect cons.(R.sub.2 -R.sub.1).orgate.(V.sub.2 -V.sub.1))(t[A]=NULL))

(.E-backward.t".di-elect cons.E.sub.2)(t[R.sub.2 V.sub.2 ]=t"(.A-inverted.A.di-elect cons.(R.sub.1 -R.sub.2).orgate.(V.sub.1 -V.sub.2))(t[A]=NULL))}.

Note that rows in r.sub.1 r.sub.2 are padded with NULLs for those attributes that are not present either in relation r.sub.1 or in relation r.sub.2.

The following notational convention is useful for the following definitions. If r=<R, V, E> is a relation, the term ext(r) is used to represent E, the extension part of r. In order to maintain readability in the definitions, in the following few definitions we adopt the shorthand convention of simply writing E.sub.1 E.sub.2 instead of ext(r.sub.1 r.sub.2), or .pi..sub.x (E.sub.1 E.sub.2) instead of ext(.pi..sub.X (r.sub.1 r.sub.2)), etc.

Left Outer Join

The left outer join, r.sub.1 r.sub.2, is the relation <R.sub.1 R.sub.2, V.sub.1 V.sub.2, E'>, where

E'=(E.sub.1 E.sub.2)(E.sub.1 -.pi..sup.c.sub.R.sbsb.1.sub.V.sbsb.1 (E.sub.1 E.sub.2)).

Preserved Relation and Null Supplying Relation

Relation r.sub.1 in the above definition of left outer join, r.sub.1 r.sub.2, is called the preserved relation and relation r.sub.2 is called the null supplying relation. The right outer join, r.sub.1 r.sub.2, can similarly be defined in which r.sub.1 is the null supplying relation and r.sub.2 is the preserved relation.

Full Outer Join

The full outer join, r.sub.1 r.sub.2, of relations r.sub.1 and r.sub.2 is the relation <R.sub.1 R.sub.2, V.sub.1 V.sub.2, E'>, where

E'=(E.sub.1 E.sub.2)(E.sub.1 -.pi..sup.c.sub.R.sbsb.1.sub.V.sbsb.1 (E.sub.1 E.sub.2))(E.sub.2 -.pi..sup.c.sub.R.sbsb.2.sub.V.sbsb.2 (E.sub.1 E.sub.2)) .

Equivalent

Values v.sub.1 and v.sub.2 are said to be equivalent, denoted v.sub.1 .cuberoot.v.sub.2 if either both v.sub.1, v.sub.2 are non-NULL and v.sub.1 =v.sub.2, or if both are NULL.

Intersect Distinct

The intersect distinct, r.sub.1 .andgate..sub.d r.sub.2, of two union compatible relations r.sub.1 and r.sub.2 is the relation <{C.sub.1, C.sub.2, . . . , C.sub.n}, V.sub.new, E'>, where domain(A.sub.i)=domain(B.sub.i)=domain(C.sub.i), 1.ltoreq.i.ltoreq.n, and

E'={t.vertline.(.E-backward.t.sub.1 .di-elect cons.E.sub.2)(.A-inverted.i)(t[C.sub.i ].tbd.t.sub.1 [A.sub.i ]t[C.sub.i ].tbd.t.sub.2 [B.sub.i ]t[V.sub.new ]=a new, unique value)}

Intersect distinct retains the common tuples in relations r.sub.1 and r.sub.2. If a tuple t.sub.1 .di-elect cons.E.sub.l contains null value in attribute A.sub.i and t.sub.2 .di-elect cons.E.sub.2 contains null value in attribute B.sub.i and identical no-null values in the remaining attributes then t.sub.1 and t.sub.2 are considered equivalent and only a single copy is retained in the result. If either operand contains duplicates then only one copy of the common tuple is retained in the result.

Intersect All

By contrast, the operand intersect all, .andgate..sub.a, retains "some" of the duplicate copies of the common tuples. More precisely, in two union compatible relations r.sub.1 and r.sub.2, if a common tuple t appears i times in r.sub.1 and j times in r.sub.2, then t appears min{i,j} times in the result relation r.sub.1 .andgate..sub.a r.sub.2.

5.2 Expressions and expression trees

The following provides a recursive definition of expressions.

1. If r=<R, V, E> is a relation, then r is an expression. Henceforth, whenever there is no ambiguity, the shorthand notation e is used to represent the expression e=<R, V, E>,

2. If e=<R, V, E> is an expression then .pi..sup.R.sub.X (e) is an expression, where X.OR right.R.

3. If e=<R, V, E> is an expression then .delta..sub.X.sbsb.R.sub.X.sbsb.V (e) is an expression, where X.sub.R .OR right.R and X.sub.V .OR right.V.

4. If e=<R, V, E> is an expression then .sigma..sub.p (e) is an expression, where sch(p).OR right.R.

5. If e.sub.1 =<R.sub.1, V.sub.1, E.sub.1 >and e.sub.2 =<R.sub.2, V.sub.2, E.sub.2 > are expressions then e.sub.1 e.sub.2 is an expression, where:

.circle-w/dot..di-elect cons.{, , , , .andgate..sub.d, .andgate..sub.a },

and p is a predicate such that sch(p).OR right.R.sub.1 R.sub.2, and sch(p).andgate.R.sub.1 .noteq.{}, and sch(p).andgate.R.sub.2 .noteq.{}.

6. If e=<R, V, E> is an expression then so is (e), where (e)=<R, V, E>. This parenthesization is required due to the fact that some of the binary operations defined above are not fully associative. Parenthesization determines the evaluation order so that expressions can be evaluated un-ambiguously. However, whenever there is no ambiguity, parenthesis will be dropped freely.

Expression Tree

An expression can also be represented by a corresponding expression tree in which the inner nodes are the operators occurring in the expression and the leaves are the base relations referenced in the expression.

Let denote one of the binary operators defined in the previous section, then an expression tree T with left subtree T.sub.1, right subtree T.sub.r and root is denoted by (T.sub.1 T.sub.r).

Henceforth, the two equivalent representations are used interchangeably.

6. Uniqueness Properties

As stated above, the invention is directed toward a system and method for generating uniqueness information for relations in an expression. In one application, this information is used to optimize a query, such as an SQL query. To accomplish this objective, several innovations can be used as described below. These innovations include:

The concept of weak binding in the context of outer joins;

Powerful, yet simple, new identities for key set determination;

A process, termed `uniqueness process` which provides a method for determining the unique set of a subtree corresponding to a relational sub-expression; and

A process, termed `augmented-unique process` for selecting roots of the expression to which the uniqueness process is applied. Roots are selected in the post order.

A unique set is the set of those attribute groups that uniquely identify tuples in the expression. For base relations, this is the same as the key set; for other expressions, the unique set may contain attributes which are not part of the expression's schema. In this latter case, the uniqueness test boils down to simply determining whether the schema of the expression contains a member of the unique set.

This point may be best illustrated by the following example: Consider the relation r=<R, V, E> having key K, and the expression .pi..sub.X (r). Then the unique set for both r and .pi..sub.X (r) contains the key K. However, tuples in .pi..sub.X (r) will be unique if X.OR left.K.

The uniqueness process, described in detail below, can be applied recursively in order to determine the unique set of every subtree in a given expression tree T. First, the unique sets of the base relations are determined. After the unique sets of the base relation have been determined, they are used to determine the unique sets of every subtree of expression tree T.

To determine the unique sets, the concept of a keySet(r) is used. First, KEY is used to denote a key of base relations r=<R, V, E>, where KEY.OR right.R and KEY is minimal in the sense that no subset of the attributes in KEY is a key for r. Then, keySet(r) is used to denote the set containing all minimal keys of r. Therefore, keySet(r) is called the unique set of relation r. Note, keySet(r) may be an empty set.

If U(e.sub.x) and U(e.sub.y) are two unique sets then U(e.sub.x).smallcircle.U(e.sub.y) denotes the following set:

U(e.sub.x).smallcircle.U(e.sub.y)={K.sub.1 .orgate.K.sub.2 .vertline.K.sub.1 .di-elect cons.U(e.sub.x), K.sub.2 .di-elect cons.U(e.sub.y)}

Intuitively, U(e.sub.x).smallcircle.U(e.sub.y) is the set of all possible keys that can be generated by combining the keys of e.sub.x and e.sub.y.

Strong Binding

A concept called strong binding is also important in the invention. A real attribute A is said to be strongly bound if all tuples defined on schemas containing A have exactly the same non-null value for attribute A.

Typically, an attribute A in the real schema of expression e gets strongly bound when the select operation .sigma..sub.p.sbsb.x (e) assigns a literal constant to attribute A.di-elect cons.sch(p.sub.x). For example ##STR1## selects all tuples of expression r where A.sub.1 is equal to five. Because A.sub.1 =5 for all tuples in the selected set of tuples, A.sub.1 is said to be strongly bound.

Further, if R is a set of attributes, the set:

sBind(p.sub.x, R)={A.vertline.p.sub.x strongly binds attribute AA.di-elect cons.(sch(p.sub.x).andgate.R) }.

Weak Bindings

Equally important to the invention is a concept called weak binding. A real attribute A is said to be weakly bound if all tuples defined on schemas containing A have either a null value or exactly the same non-null value for attribute A.

Typically, a strongly bound attribute A in the real schema of expression e.sub.1 gets weakly bound in the result of an expression such as e.sub.0 .fwdarw.e.sub.1.

Predictor Sets

Yet another key concept of the invention is the identification of a predictor set. If A is a (strongly or weakly) bound attribute in the real schema of expression e, the predictor set of A, denoted pSet(A), is defined as follows:

1. A.di-elect cons.pSet(A).

2. (Closure under predicate P:) If X.di-elect cons.pSet(A) and subclause X=Y is part of an (inner/outer) join predicate P in e, then Y.di-elect cons.pSet(A).

In the uniqueness process, information about bound attributes is used to get smaller keys from already available keys. More precisely, given that X is a key for expression e, and A.di-elect cons.X is bound, the uniqueness process attempts to determine if X-A is also a key. This is trivially true if A is strongly bound and X-A.noteq.{}.

If, however, A is weakly bound, the following result occurs: If X is a key for e and A.di-elect cons.X is weakly bound to constant c, then X-A is a key for e if:

.pi..sub.X-A (.sigma..sub.A=c (e)).andgate..pi..sub.X-A (e-.sigma..sub.A=c (e))={}, where X-A.noteq.{}

In other words, if the same values for attributes X-A are not present in tuples having the value c for column A and tuples having NULL for column A, then A can be pruned from the key X.

The uniqueness process, described below, uses one manifestation of this sufficient condition using predictor sets (pSets) to determine the unique sets of the expression. The query expression is processed bottom-up, and pSets are built as bound attributes and equijoin predicates that reference these attributes are encountered.

As a later example shows, it is important to remember "levels" at which attributes are added to pSets. From a structural point-of-view, attribute B will be at a higher level than C in a query expression, if either B is from the preserved side and C is from the null-supplying side of an outer join on predicate B=C, or the query specifies application of predicate A=B after that of predicate A=C. Semantically, if B and C are attributes in pSet(A) and B has a higher level than C, then whenever values for C are non-null, so are the values for B. With this in mind, the definition of pSets is augmented to be a set of <attribute,level> pairs.

6.2 Identities for uniqueness propagation

Paulley, G. N. and Larson, P-A., in their paper "Exploiting uniqueness in query optimization," CASCON, pp. 804-822, Vol. II, October 1993, present the following result for join queries: If relations r and s both have primary keys, then key(r) .smallcircle.key(s) is a key for rx. However, that work does not consider key recognition across binary operators, .fwdarw., .rarw., and .

The following identity applies to binary operations of , .fwdarw., .rarw., and . For .fwdarw., .rarw., and the identity is novel and unique. For operations the identity is generally known, but is presented here for the sake of completeness.

If e=LR is an expression having L and R as its left and right operands, respectively, and is a binary operator.di-elect cons.{.sup.1, .fwdarw., .rarw., }, then:

keySet(e)=S.sub.1 .orgate.S.sub.2 .orgate.S.sub.3,

where ##EQU1##

.sup.1 For operator this is prior art.

To optimize th