|
Description  |
|
|
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention concerns data base system generally and more specifically concerns side effects of transactions in data base systems.
2. Description of the Prior Art
A popular type of data base system is the relational data base system, in which the data in the data base is logically organized into relations, that is, tables with rows and columns. When a user of a data base system wishes to obtain
information from it, the user makes a query, which describes the information to be obtained in terms of relations, columns, and limitations on the values of the columns in individual rows. The result of the query is another relation, which is termed a
view. The relations from which the view is computed are termed the base relations of the view. If the view is in turn used as a base relation in the computation of a second view, the computation of the second view may be speeded up by materializing the
view, that is, storing it in the data base system, rather than recomputing it each time it is used as a base relation.
In addition to querying data bases, users often add information to the data base and delete information from the data base. Adding or deleting information is termed a transaction. Of course, when a transaction changes a relation which is a base
relation for a materialized view, the materialized view may need to be changed as well. One way of changing the materialized view is to recompute it from the base relations, but the whole point of the materialized view is to avoid recomputation. The
problem of finding ways of updating materialized views without completely recomputing them is termed in the art the view maintenance problem.
Materialized views are however not the only components of data base systems which are affected by changes in base relations. Many data base systems have components which monitor changes in the data base and take actions if certain changes occur. Included among these components are integrity constraints, that is, rules for preventing transactions which are inconsistent with some property of the data base, monitors, which notify the user when certain changes occur, triggers, which take action when
the changes occur, and active queries, which are computed in response to triggers. The same techniques which are used to avoid complete recomputation of views when a base relation changes can also be used generally to reduce the amount of computation
needed to determine whether an action is to take place in response to a change in a base relation.
For the most part, prior art solutions to the view maintenance problem have been algorithmic, that is, for each kind of change to the base relations, an algorithm is provided which computes the changes to the view base relations, an algorithm is
produced that computes the changes to the view. A typical example of the algorithmic approach is S. Ghandeharizadeh, R. Hull, D. Jacobs et al., On implementing a language for specifying active database execution models, In: Proc. VLDB-93. Qian and
Widerhold have employed equational reasoning, that is, the change required in the view which is required by the change in the base relations is determined by means of a series of algebraic translations of the change in the base relations. See X. Qian
and G. Wiederhold, Incremental Recomputation of Active Relational Expressions, IEEE Transactions on Knowledge and Data Engineering, 3(3):337"341,1991.
Equational reasoning offers a number of advantages over the algorithmic approach:
Unlike the algorithmic approach, it provides us with precise semantics of changes to the views. Consequently, using the equational approach makes it easier to prove correctness of the change propagation algorithm.
This approach is robust: if language changes (e.g. new primitives are added), one only has to derive new rules for the added primitives, leaving all other rules intact. As long as the new rules are correct, the correctness of the change
propagation algorithm is not affected.
The resulting changes to the view are obtained in form of expressions in the same language used to define the view. This makes additional optimizations possible. For example, the expressions for changes that are to be made (e.g. for sets/bags
of tuples to be deleted/added) can be given as an input to any query optimizer that might find an efficient way of calculating them.
Qian and Wiederhold's work and most of the algorithmic approaches have assumed that relations are set-valued, that is, duplicate rows are eliminated from the view. However, most practical database systems do not eliminate duplicates from views,
and consequently cannot be modelled by sets. Instead, they must be modelled by bags, that is, multisets, or collections in which duplicates are permitted. The ability to handle duplicates is particularly important in dealing with aggregate functions in
data base systems. Such functions obtain a value by aggregating other values. For instance, if the average salary of employees is to be computed, then one applies the aggregate function AVG to II.sub.Salary (Employees). Duplicates cannot be removed
from the view since the result would be wrong when at least two employees had the same salary. Not eliminating duplicates also speeds up query evaluation, as duplicate elimination is generally a rather expensive operation.
Unfortunately, the theoretical work done on relations that are assumed to have no duplicates does not carry over to relations that do have them. Further, the little work which has been done on view maintenance when duplicates are present is
algorithmic, see A. Gupta, I. S. Mumick, and V. S. Subrahmanian, Maintaining views incrementially, in: SIGMOD-93, pp. 157-166. Finally, Qian and Wiederhold's work contains an error which renders it unable to guarantee a minimality condition, namely
that no rows are changed unnecessarily in the view being updated in response to a change in its base relations.
It is thus an object of the present invention to provide solutions for the view update problem which take the existence of duplicate rows in the view into account, which employ equational reasoning, and which are in fact able to guarantee a
strong minimality condition.
SUMMARY OF THE INVENTION
The above object is achieved by making a definition of the side effect resulting from the transaction in terms of the relation or relations whose alteration by the transaction causes the side effect. The definition is then transformed using
equational reasoning into a pre-expression which defines the effects of the side effect and can deal correctly with duplicates. In a preferred embodiment, the method deals correctly with duplicates because the definition, the equational reasoning, and
the pre-expression are all done in terms of a bag algebra. One important application of the invention is making updates of materialized views.
Other objects and advantages of the apparatus and methods disclosed herein will be apparent to those of ordinary skill in the art upon perusal of the following Drawing and Detailed Description, wherein:
BRIEF DESCRIPTION OF THE DRAWING
FIG. 1 is a block diagram of a prior-art data base system;
FIG. 2 is a diagram of a data dictionary;
FIG. 3 is a diagram of an equational materialized view description generator;
FIG. 4 is a diagram showing how a materialized view maintenance description is used to update a materialized view;
FIG. 5 shows example base relations and an example view relation;
FIG. 6 shows change propagation equations used in a preferred environment; and
FIG. 7 shows recursive functions used in a preferred embodiment to compute a pre-expression.
Reference numbers in the Drawing have two parts: the two least-significant digits are the number of an item in a figure; the remaining digits are the number of the figure in which the item first appears. Thus, an item with the reference number
201 first appears in FIG. 2.
DETAILED DESCRIPTION OF A PREFERRED EMBODIMENT
The following Detailed Description will begin with an overview of a data base system, will then present a general view of the problem of materialized view maintenance and the details of how equational reasoning may be employed in materialized
view maintenance, and will finally present details of an implementation of materialized view maintenance in a data base system.
Overview of a Data Base System: FIGS. 1 and 2
FIG. 1 is a block diagram of a data base system 101 implemented in a digital computer system. The digital computer system has the usual components: a processor unit 109 with a processor 119 and a memory 111, an input device 107, for example a
keyboard or a pointing device such as a mouse, connected to processor unit 109 for receiving inputs from a user, an output device 101 such as a CRT or flat panel display for displaying output from processing unit 109 to the user, and mass storage device
121, for example, a hard disk drive, which stores persistent data. The digital computer system may be of any size, from a personal computer to a mainframe or supercomputer.
When the digital computer system is being used to implement a data base system 101, processor 119 is executing a data base program 113 from memory 111 using program data 115. As the digital computer system executes the program, it responds to
inputs from input device 107 and displays outputs on display 101. Data base user interface 102, which controls operation of data base system 101, is displayed in a window of display 101. For present purposes, data base user interface 102 has two
components: command area 103 and view display 105 Command area 103 displays input from input device 107. The input may be a query, in which case, a view 117 which satisfies the query is constructed in memory 111 and is displayed in view display 105, or
it may be a transaction, in which case data base system 101 changes the data it contains as specified in the transaction.
The actual information in the data base is generally contained in mass storage 121, although it may be maintained in memory 111 in systems where extremely rapid access is required. The information itself is contained in relations 124. There are
two kinds of relations: base relations 125, in which the data in the data base is originally stored, and which are therefore the relations that are directly changed in a transaction, and materialized views, which are views, i.e., relations derived from
other relations by a query, which have been written to mass storage 121. Data dictionary 123, finally, is what makes the data in relations 124 into relations. For each relation in relations 124, data dictionary 123 specifies the relation's name, the
names and data types of the relations's column, and the actual location of the data for the relation in mass storage 121.
To operate data base system 101, a user uses input device 107 to input a command to command display 103. The command may specify either a transaction or a query. In either case, relations and fields in relations are specified in the command by
name and the command may also specify conditions on field values. The command is typically written in a query language such as SQL, but graphical command interfaces may also be employed. If the command is a query, processor 119 under control of data
base program 113 responds to the query by using data dictionary 123 to interpret the relation names and field names in the query, retrieves the rows and fields specified in the query into program data 115 in memory 111, uses the conditions to determine
which of the retrieved rows and fields are to be used in view 117 specified by the query, sets up view 117 in memory 111, and then displays it in view display 105.
If the command is a transaction, it specifies relations, fields, and perhaps even conditions as in a query, but it additionally specifies changes in the relations or fields, for example, deletions or additions of rows in relations or changes in
the values of fields. Processor 119 under control of data base program 113 again uses data dictionary 123 to interpret the transaction and locate the row which needs to be deleted or changed or to determine where a new row should be stored in mass
storage 121. In the case of a deletion, data base program deletes the indicated row; in the case of an addition or change, program 113 constructs the new record in program data 115 and then writes it to the proper location in mass storage 121.
As is apparent from the foregoing description of data base system 101, data dictionary 123 contains the descriptions which permit data base program 113 to properly interpret queries and transactions. FIG. 2 presents a block diagram of a data
dictionary 123 of the type employed in systems which require view maintenance. For present purposes, two parts of data dictionary 123 are of interest: relation descriptions 201 and transaction response descriptions 207. Relation descriptions 201
contain a description of each of the relations 124 in mass storage 121. In a system which materializes views, there are two kinds of relation descriptions: base relation descriptions 203, which describe base relations 125, and materialized view
descriptions 205, which describe materialized views. A materialized view description 205 is locatable in data dictionary 123 from the base relation descriptions 203 for the base relations used to make the view.
Transaction response descriptions 207 describe data base system 101's responses to a change in a relation in relations 124. As indicated above, depending on the data base system, the responses may include maintaining materialized views (209),
performing consistency checks (211), executing trigger code (211), executing monitor code (215), and performing active queries (217). Transaction response descriptions which involve a particular base relation are accessible from the base relation
description 203 for the base relation. When data base system 101 executes a transaction, the transaction response descriptions 207 for the base relations involved in the transaction are checked and if any of them describes the transaction being
performed, the responses specified in the transaction response descriptions are carried out as part of the transaction.
Producing Improved Response Descriptions
As pointed out in the Description of the Prior Art, supra, the prior art has not provided techniques for producing transaction response descriptions which were both based on equational reasoning and were able to deal with duplicates. The
following discussion of novel techniques for producing such transaction response descriptions begins with an example which illustrates the inadequacies of prior-art techniques based on equational reasoning for producing transaction response descriptions,
then explains the bag algebra upon which the novel techniques are based, next specifies the equations underlying the novel techniques, and finally presents the algorithm which implements the equations.
An Example: FIG. 5
FIG. 5 shows a data base 501 which will be used as an example in the following discussion. Data base 501 has three base relations, indicated by the reference numbers 502, 507, and 511, and a view 513. Each relation is a table which is made up
of one or more rows 503 and columns 505. Relations S1(Pid, Cost, Date) 502 S2(Pid, Cost, Date) 507 are relations for recording shipments of parts received from two different suppliers. In the notation used to describe the relations, S1 is the name of
the relation and the names in parentheses following the name of the relation are the names of the columns. Thus, each row in relations S1 and S2 has three attributes (or fields). The attribute Pid is a unique identifier for the part with which the row
is concerned. Cost is the cost of the part identified by Pid, and Date is the day the shipment which included the part arrived. In addition we have the relation Paid(Pid, Cost, S), which is a table of parts that have already been paid for. Since there
are only two suppliers, S1 and S2, The attribute S in this relationship must have either the value 1 or the value 2, indicating which supplier was paid.
We would like to compute the total amount of money we owe--the cost of all parts received by not yet paid for. One way of doing this is to define a view Unpaid 513 which contains a row for each part which has been received but not yet paid for.
The attributes of the row are Pid and Cost. Once the view has been computed, the total amount owed can be found by adding the values in Cost. One way of defining Unpaid 513 is to produce a first view V.sub.1 which contains rows made up of the Pid and
Cost fields from both S1 and S2 and a second view V.sub.2 which contains rows made up of the Pid and Cost fields from Paid, and then produce Unpaid by finding all the rows in V.sub.1 which are not in V.sub.2.
A definition of these operations in bag algebra uses the projection operator II, the additive union operator , and the monus operator -. II makes a relation which has a row for each row of the relation from which the projection is made, but has
only the attributes specified in the operation. is a union operator that includes duplicates when it makes a union of two bags. - subtracts multiplicities. If a record r occurs n times in S and m times in T, then the number of occurrences of r in S-T
is n-m if n.gtoreq.m and 0 if n<m. The bag algebra definition looks like this: ##EQU1## Here, because is used, V.sub.1 contains two copies of the record [Pid P1, Cost 1,200]. Because- is used, Unpaid contains 1 row for P1 and two rows for P4.
Once we have Unpaid, the TOTAL() aggregation operator can be used to obtain the amount still owed:
Note that multiset semantics gives the correct answer here, while set-theoretic semantics, which do not permit duplicate rows, do not. For example, for the relations shown in FIG. 5, the amount Owe is $7,400. However, if we switch to set
semantics and calculate Unpaid as (II.sub.Pid,Cost (S1).orgate. II.sub.Pid,Cost (S2))-II.sub.Pid,Cost (Paid), then Owe calculated by the same formula equals $4,800.
The materialized view maintenance problem arises if we materialize Unpaid 513. Then, if we find that Paid contains a mistake, for instance, a payment was made to supplier S1 for part P3 instead of to supplier S2 for part P5 and we correct the
mistake in Paid, we must also update Unpaid. In the bag algebra, this correction is represented by using the delete bag .gradient.Paid to represent the rows that the correction will delete from Paid and the insert bag .DELTA.Paid to represent the rows
that the correction will insert in Paid. The transaction which corrects Paid may thus be described in the bag algebra as follows:
Rather than recomputing the entire view Unpaid from scratch, we would like to find expressions .gradient.Unpaid and .DELTA.Unpaid such that
This has the potential of greatly reducing the amount of computational resources needed to recompute this new value.
As required for our correction, let .gradient.Paid contain the single record [Pid P5 Cost 4,000, S 2] and .DELTA.Paid contain the single record [Pid P3, Cost 1,300, S 1]. (That is, we discovered that a payment was made to the first supplier for
P3 rather than the second for PS.) Then it is fairly easy to see that .gradient.Unpaid should evaluate to [Pid P3, Cost 1,300] and that .DELTA.Unpaid should evaluate to [Pid P5, Cost 4,000].
Since our algorithm works with delete bags and insert bags, it need not know anything about what kind of changes occurred in the rows being inserted and deleted. Thus, in the example, it simply produces the delete bag, .gradient.Unpaid, as
follows:
and the insert bag, .DELTA.Unpaid, like this:
The min operator used to produce .gradient.Unpaid is defined as follows: S min T is a multiset such that the number of occurrences of a record r in it is min(n,m), where n and m are numbers of occurrences in S and T respectively.
An advantage of the way that the delete and insert bags are defined is that evaluation of the expressions for .gradient.Unpaid and .DELTA.Unpaid can be made very efficient. First we assume that all relations S1,S2 and Unpaid have an index built
on the them that uses Pid as a key. Then, in order to evaluate the expressions for .gradient.Unpaid and .DELTA.Unpaid we only have to find the numbers of occurrences of elements of II.sub.Pid,Cost (.gradient.Paid) and II.sub.Pid,Cost (.DELTA.Paid) in
V.sub.1, V.sub.2 and II.sub.Pid,Cost (Paid). For example, to find .gradient.Unpaid, for each r .DELTA.Paid we find x,y,z, .upsilon. as the numbers of occurrences of r in .DELTA.Paid, .gradient.Paid, V.sub.1 and V.sub.2. Then R occurs min
{(x-y),(z-.upsilon.)} times in .gradient.Unpaid. Thus, the complexity of such evaluation depends on how fast we can access elements of the base relations, not on how fast we can access the views. Typically, access to base relations is much faster than
access to views. Even if no index exists, the time complexity is still linear in the sizes of the base relations.
Another big benefit of the above approach is reduced memory requirements. Whereas recomputing the whole view Unpaid would require memory space linear in the size of base relations, the propagation algorithm only requires that we find the number
of occurrences of certain records in base relations and then evaluate an arithmetic expression. Therefore, space needed for updating the view Unpaid is linear in the size of changes to the base relations. Typically, the number of rows which change is
relatively small compared to the number of rows in the relations. Thus, calculating changes to the view as opposed to recomputing the view leads to substantial improvement in space usage.
Once changes to Unpaid are calculated, the new value of Owe is found as
The correctness of this is guaranteed for our solution. Indeed, Owe.sup.new calculated above is $10,100, and one can see that it is the correct amount still owed once changes to Paid have been made.
The Bag Algebra
Several equivalent approaches to bag-based database languages have been proposed. See S. Grumbach, T. Milo, Towards tractable algebras for bags, In PODS-93, pp.49-58; L. Libkin, L. Wong, Aggregate functions, conservative extension, and linear
order, in Proc. Database Programming Languages, Springer Verlag, 1994, pp. 97-114; and L. Libkin, L. Wong, New techniques for studying set languages, bag languages, and aggregate functions, in PODS-94, pp. 155-166. The bag-based database language
that we use to formulate the change propagation algorithm differs from the proposed languages in that it is restricted to to flat bags (that is, bag-valued attributes are not allowed).
Definitions of Bag Algebra Operations
In what follows, base relation names are denoted by the symbols R, R.sub.1, R.sub.2, . . .. Let p range over quantifier-free predicates, and A range over sets of attribute names. expressions are generated by the following grammar. ##EQU2##
To define the semantics of these operations, let count(x, S) be the number of occurrences of x in a bag S. Then, for any operation e in the language, we define count(x, e(S,T))or count(x,e(S))as a function of count(x, S) and count(x, T) as
follows: ##EQU3## This language is not intended to be minimal. For example, min can be defined as S min T def S- (S-T). For the full characterization of interdefinability of the operations of , consult Libkin and Wong, Some Properties . . ., supra.
We use the symbols S, T, W, and Z to denote arbitrary expressions, and s to denote a database state, that is, a partial map from relation names to multisets. If s is a database state and T is a expression such that s is defined on all relation
names mentioned in T, then s(T) denotes the multiset resulting from evaluating T in the state s. (Note that s is a function, so we consider evaluating T in s as the result of applying s to T.) The notation T=.sub.b S means that for all database states s,
if s is defined on all relation names mentioned in S and T, then s(T)=s(S).
Specification of Transactions
A transaction is a program that changes the state of a database in one atomic step. There are many approaches to languages for specifying transactions. Here, we adopt an abstract view of transactions, in order to make the results independent of
a particular language used, but at the same time readily applicable to any such language.
The abstract transactions to be considered are of the form ##EQU4## As before, the expression .gradient.R.sub.i represents a bag deleted from relation R.sub.i and .DELTA.R.sub.i represents a bag inserted into base relation R.sub.i. More
formally, when transaction t is executed in state s of the data base, then value of R.sub.i in state t(s) becomes s((R.sub.i -.gradient.R.sub.i) .DELTA.R.sub.i).
A pre-expression for a transaction is an expression which can be evaluated before the transaction to determine the result of the transaction. In the case of an update of a materialized view, the pre-expression indicates how the materialized view
is to be updated; in the case of consistency checking, the value of the pre-expression indicates whether the results of the transaction are consistent with the data base's consistency constraints; in the cases of monitors, triggers, and active queries,
the value of the pre-expression indicates which monitor, trigger, or active query will be activated.
More formally, The expression T is a pre-expression of S w.r.t. t if for every database state s we have s(T)=.sub.b t(s)(S). It is easy to check that ##EQU5## is a pre-expression of S w.r.t.t.
Goals to be Achieved by the Technique
The aim of the technique is to provide an algorithm for generating strongly minimal pre-expressions for view maintenance at compile time which are computationally more efficient than recomputation of the materialized view. More particularly,
suppose S(R.sub.1, . . . R.sub.n) is a expression and t is a transaction. We would like to determine how t's changes to the base relations propagate to changes in the value of S. In particular, we seek to construct expressions .DELTA.S and .gradient.S,
called a solution for pre(t,S), which may be used in pre-expressions of the form
Clearly, not all pre-expressions having the above form are equally acceptable. For example, .gradient.S=S and .DELTA.S=pre(t,S) is always a solution. What are "good" solutions? First, if S is a materialized view, then it will be generally
cheaper to evaluate (S-.gradient.S) .DELTA.S than to evaluate pre(t,S) in the current state (or to evaluate S after t has been executed). Second, we will impose some "minimality" conditions on .gradient.S and .DELTA.S to make sure that no unnecessary
tuples are produced. In particular,
1. .gradient.S-S=.sub.b .phi.: We only delete tuples that are in S.
2. .DELTA.S min .gradient.S=.sub.b .phi.: We do not delete a tuple and then reinsert it.
A solution meeting condition (1) will be called weakly minimal, while a solution meeting both conditions (1) and (2) will be called strongly minimal. Note that, in contrast to the relational case, it does not make sense to insist that S be
disjoint from .DELTA.S since a transaction may increase the multiplicities of elements in S.
Minimality (weak or strong) is especially desirable due to the way in which changes interact with aggregate functions. For example, we have
assuming a (weakly or strongly) minimal solution.
Again, not all strongly minimal solutions are equally acceptable. For example, the pair
and
is a strongly minimal solution. However, there is no advantage in using it to maintain the view given by Q.
The Algorithm for Finding Strongly Minimal Pre-Conditions
The following discussion will first present the collection of change propagation equations employed in the preferred embodiment and will then describe the recursive algorithm used in the preferred embodiment to apply the change propagation
equations.
The Change Propagation Equations: FIG. 6
As already pointed out, techniques for doing equational reasoning using equations written in a relational algebra have been described in the Qian-Wiederhold reference supra. That reference discloses a change propagation algorithm which is based
on a collection of equations that are used to "bubble up" change sets to the top of an expression. For example, the Qian-Wiederhold reference uses the equation
to take the insertion .DELTA.S into S and propagate it upward to the insertion .DELTA.S-T into S-T.
Our algorithm is based on a collection of such propagation rules for expressions in the bag algebra . Finding such a collection is complicated by the fact that expressions, unlike set-valued relational expressions, do not obey the familiar
Boolean algebra used with set-valued relational expressions. When (S.orgate..DELTA.S)-T is replaced with the bag expression (S .DELTA.S)-T, the above example becomes
which is not immediately obvious. FIG. 6 shows the collection 601 of equations for change propagation in bag expressions which are used in the preferred embodiment of the algorithm. Some subexpressions are annotated with a .gradient. (for a
deletion bag) or a .DELTA. (for an insertion bag). This annotation simply emphasizes the intended application of these equations: when read as left-to-right rewrite rules, they tell us how to propagate changes upward in an expression. Note that the
correctness of these equations involves no assumptions concerning minimality of the change bags.
By repeated applications of the equations in FIG. 6, we can propagate any number of changes upward. For example, consider the expression U=S T. Suppose that
The changes to S and T can be propagated upward and expressed as changes to U as follows: ##EQU6## where .gradient..sub.1 U=(.gradient.S min S) (.gradient.T min T) and .DELTA..sub.1 U=.DELTA.S .DELTA.T. The last step is simply an application of
the general rules
G1. (S T) W=.sub.b S (T W)
G2. (S-T)-W=.sub.b S-(T W)
which are applied in order to collect all deletions into one delete bag and all insertions into one insert bag.
Repeated application of the equations of FIG. 6 guarantees a solution, but not necessarily a strongly minimal one. However, the following theorem shows that any solution can be transformed into a strongly minimal one:
Theorem 1 Suppose that W=.sub.b (Q-.gradient..sub.1 Q) .DELTA..sub.1 Q. Let .gradient..sub.2 Q=(Q min .gradient..sub.1 Q)-.DELTA..sub.1 Q and .DELTA..sub.2 Q=.DELTA..sub.1 Q-(Q min .gradient..sub.1 Q). Then
a) W=.sub.b (Q-.gradient..sub.2 Q) .DELTA..sub.2 Q
b) .gradient..sub.2 Q-Q=.sub.b .phi.
c) .gradient..sub.2 Q min .DELTA..sub.2 | | |