|
Description  |
|
|
BACKGROUND ART
In the development of information processing systems, relational database
management programs evolved allowing the user to search, access and alter
data contained in numerous different database tables by using specific
fields common to all such tables.
As these database systems improved, the speed and efficiency of access to
these records in the database increased and additional capability was
provided. For example, more recent data processing systems began to
provide support for multiple simultaneous users enabling each user to even
access data concurrently at a sub-page level.
Notwithstanding such improvements, one area that remained of great concern
was in providing for recovery of data, such as, after I/O or power
failures, i.e., system crashes. One reason for this was the vast amount of
time and money which became associated with the compilation of data
resident in the database as well as the great dependence which users came
to have on their increasingly vital database resources. Accordingly, a
great deal of development effort was expended in attempting to solve the
problems associated with such data loss.
Perhaps one of the most obvious general approaches to the problem was to
provide for redundancy whereby backup copies of the data were available in
the event that the database or portions thereof needed to be reconstructed
due to such incomplete log writes or detected log write failures.
Accordingly, several techniques were developed in the art for providing
such redundancy, one of the earliest being known as shadow paging which
essentially involves retaining a copy of an entire page of data while
updates were made to a second copy. After the newer copy containing the
changes was safely written to the permanent medium, the archival copy
could thence be written over. This technique was employed for example in
the database product of the IBM Corporation known commercially as SQL/DS.
A survey of various systems employing this shadow copy technique may be
found in "File Servers for Network Based Distribution Systems", Liba
Svobodova, ACM Computing Surveys, Vol. 16, No. 4 (December 1984), pages
353-399.
Although shadow paging appeared to be a viable solution in some
environments it was not without its disadvantages including the expense
and space involved in maintaining such shadow copies. Accordingly,
database systems began implementing the transaction recovery facility by
only writing changes to database records to both the changed record and to
a database recovery log. The information recorded in the database recovery
log insured that changes of committed transactions were incorporated into
the database state during system restart following a system failure (as
well as allowing changes to database records to be reversed or undone in
support of transaction rollback for uncommitted transactions).
A form of this technique became developed known as write-ahead logging
wherein the protocol required that changes be written to the recovery log
in the permanent file prior to being made to the actual database records
themselves. One problem with such logging related to the aforementioned
desirability of concurrency wherein multiple users could simultaneously
access the database desirably at a sub-page level, the need for such
concurrency being translated for example into a need for concurrent access
to index files commonly used by database programs to provide quick and
efficient access to records.
Information contained in index nodes of these index files was extremely
important in providing key record information that was frequently deleted
or inserted as records were deleted or inserted into the database tables,
and consequently such concurrent accessibility on a sub-page level was
highly desirable. A particularly important aspect of such index files was
that individual fields of a record in a database might frequently
logically contain data which was not kept in the record for itself but
rather, (by means of a pointer or descriptor) kept in a separate file.
Example of such a file is known as a long field file, wherein a long field
is contained, which may have an image associated with large data set type
items such as audio or image data which can be extremely valuable, thus
illustrating the importance of such indexes.
With the foregoing in mind, it will be appreciated that it was desirable to
provide for a database recovery system of the write-ahead logging type
which nevertheless provided for such sub-page level concurrency. Systems
were accordingly developed such as those described in U.S. Pat.
Application Ser. No. 07/059,666, filed Jun. 8, 1987, and entitled "Method
for Managing Sub-Page Concurrency Control and Partial Transaction Rollback
in a Transaction-Oriented System of the Write-Ahead Logging Type", now
abandoned, and refiled On Sept. 7, 1989, as pending continuation Ser. No.
07/406,186, as well as pending U.S. Pat. Application Ser. No. 07/115,146,
filed Oct. 30, 1987, and entitled "Method for Concurrent Record Access
Using an Index Tree", U.S. Pat. No. 4,914,569. An additional reference
that discusses these index files such as those commonly configured in a
B-tree structure known in the art is "Efficient Locking for Concurrent
Operation on a B-Tree"by Lehman and Yao, ACM Transactions on Database
Systems, Vol. 6, No. 4, (December 1981), pages 650-670, the hereinbefore
noted references being incorporated herein by reference.
Notwithstanding the aforementioned advances, problems nevertheless remained
in providing for effective database recovery First on restart processing
of such systems, files with I/O errors were not readily detectable so as
to prevent and safeguard restart operations from accessing the files with
attendant data loss. Further, means were not provided for readily
detecting incomplete log writes or detected log write failures in order to
stop the further writing of transactions. Moreover, no effective means was
provided for readily identifying such error files during restart.
Additionally, rebuilding of error file indexes was by no means automatic
but rather required explicit user action and invalidated access plans
related to the failing index.
Accordingly, systems and methods were desired for reducing data loss due to
I/O errors and power failure during non-atomic writes to disk in a
transaction management system using write-ahead logging protocol. Such
systems and methods were highly sought whereby I/O error on index files,
including system tables, caused no data loss. Also, techniques were
desired for providing automatic recovery from the errors without an
explicit user action to rebuild the affected indexes. Means were desired
whereby power failure during log file writes caused no data loss without
the necessity for employing double writes, shadow paging or the like. It
was further highly desired to provide effective means whereby I/O error on
user tables had limited data loss effect to the table in error.
Additionally, it was desirable to provide a technique for index file
rebuilds which did not invalidate the access plans related to the index.
These and other desired features not met by the prior art are provided by
the subject invention as hereinafter described in greater detail.
DISCLOSURE OF THE INVENTION
A system and method is provided for data recovery due to database system
crashes during non-atomic memory writes wherein a transaction management
system with write-ahead logging protocol is employed. In a preferred
embodiment log records are written during normal processing. The recovery
log is traversed during REDO insuring completion of logged operations.
Recovery from detected incomplete or failed log writes is effected and
non-committed transactions undone. Files with I/O errors are detected and
flagged preventing subsequent RESTART operations from accessing the files.
The system further provides for automatic rebuilding of error index files
as part of the RESTART procedure without requiring explicit user action
for invalidating access plans related to the failed index.
During normal processing flagged data file error is reported to the
application attempting access thereto. Flagged long field file error is
however reported to the application only when the application attempts use
of the particular long field file. Flagged index file error effects
rebuilding thereof without error status to the application when the system
accesses the underlying data file or the application accesses the index
file to access the correlative data records. Loss of previously committed
data as a result of detected I/O error on index files and log write
failure or power failure during log write is avoided.
BRIEF DESCRIPTION OF THE DRAWINGS
The novel features believed to be characteristic of the invention are set
forth in the appended claims. The invention itself, however, as well as
other features and advantages thereof, will be best understood by
reference to the following description of the preferred embodiment, when
read in conjunction with the accompanying figures, wherein:
FIG. 1 is an illustration of a database table;
FIG. 2 is an illustration of data pages of the database table of FIG. 1
conceptually as they would be stored in media;
FIG. 3 is an illustration of an index B-tree for the database table of FIG.
1 as it would be stored conceptually in media;
FIG. 4. is an illustration of long field file data corresponding to the
employees in the table of FIG. 1;
FIG. 5 is a flow chart illustrating the overall operation of various
components of the invention;
FIG. 6 is a flow chart illustrating the method of writing log pages to
media;
FIG. 7 is a flow chart illustrating the processing of the log pages written
according to the method of FIG. 5 and the handling of error detected
during database restart processing;
FIG. 8 is a flow chart illustrating the detection and processing of an
error on a data, index, or long field file during database restart
processing;
FIG. 9 is a flow chart illustrating an error file processing during normal
non-database restart processing;
FIG. 10 is a flow chart illustrating an error index file processing during
normal non-database restart index processing;
FIG. 11 is a flow chart illustrating the recreation of error index files
after completion of system restart.
FIG. 12 is a general functional block diagram illustrating a computerized
system in accordance with the invention for executing the routines
described with reference to FIGS. 5-11.
BEST MODE FOR CARRYING OUT THE INVENTION
In order to better understand the invention, first with reference to FIGS.
1-4 a more detailed description of representative data and the manner in
which it is stored in a database system will be given using the
illustration of the employee table of FIG. 1. With reference to FIG. 3, an
example will be given of the concept of a database index. Next, with
reference to FIG. 4, the use of such index information of FIG. 3 will be
illustrated in accessing a particularly important form of files shown in
FIG. 4, i.e., long field files. As will become more apparent, these
indexes actually reference corresponding data records which in turn
reference correlative long fields.
Next, a general description of the overall operation of the invention in a
system restart mode and the restart redo and restart undo phases will be
given with reference to FIG. 5. This will be followed by a more detailed
discussion of operations during the redo with reference to FIGS. 7 and 8
and a more detailed description of system operation after the system
restart shown in FIG. 5 with reference to FIG. 11.
FIG. 1 conceptually represents an example of actual data that a user may
have entered in a database. This data is essentially in this example a
list having headers such as employee name 10, some form of employee number
12, and some type of image data 14 such as a corresponding employee
picture. Thus, the first record in this table is Andrew, employee number
1, and data comprising a picture of Andrew. Similar records appear for
other employees entered into the database.
FIG. 2 is an illustration of the file representing part of the table data
of FIG. 1 as it might be stored on computer disk. The file represents,
sequentially, data pages such as pages 0-2 (reference numerals 16, 18, and
20, respectively) each having some of the records of the FIG. 1 table.
Each data page has a header and trailer. Page 0, for example, has 22 for
the header and 30 for the trailer; page 1 has 36 and 44; and 20 has 50 and
58 for the header and trailer, respectively. The header and trailer each
contain a log sequence number or LSN. This LSN is copied from the header
to the trailer prior to writing out a page. After reading back a page, the
header and trailer are compared to make sure that they are identical in
order to verify that the previous write was completed.
Referring more particularly to page 0 (reference numeral 16), there is a
record pointer directory 24 which has a first slot 0 pointer 32. This
points to the record 26 which itself contains the name "Edwards", the
serial number "24", and a pointer to a long field file containing video
data of Edward's image. The next slot in record pointer directory 24 is
not used in this example, i.e., it contains no pointer pointing to another
long field file with video data. Slot 2 has a pointer 34 pointing to
record 28, which, in turn, contains a pointer to a long field containing
video data of Andrew's image In summary the indexes will point to data
records directories containing pointers to records. These records contain
names, serial numbers and long field descriptors or pointers to the actual
long field image data.
Similarly, the page 1 (18) has a record pointer directory 38. In this case
a slot 0 pointer 48 points to record 42 containing Howell's data wherein a
long field pointer to the image data resides. The slot 1 pointer 46,
points to a record 40 containing a long field pointer to a long field
containing Baker's image data. Similarly, for page 2, the record pointer
directory 52 has a slot 0 having a pointer 62 which points to a record 56.
This record contains a long field pointer to image data of Chester in a
corresponding long field file. Slot 1, in like manner, has a pointer 60
which points to record 54 having, in turn, a pointer to a long field
wherein Edgar'image data resides.
The long fields, as is well known in the art, are where the actual
digitized video data of the images are stored, preferably in a compressed
format. Thus, the pointers 32, 34, 46, 48, 60 and 62 to the records, and
the records contain actual indicated displacements to the beginning of
their respective long field files wherein image data resides. As an
example in FIG. 2 the pointer 62 points to a record 56 which would contain
a number corresponding to 1098 (as shown in the top portion of FIG. 4)
which is the beginning of the segment directory to the long field where
the compressed image data of Chester 82 resides.
In relational databases, it is necessary to efficiently access the data
such as that of Table 1, FIG. 1. This is conventionally done by means of
an index file, an example of which is schematically illustrated (for the
Table 1 data) in FIG. 3, in a manner well known in the art. FIG. 3 depicts
an index for Table 1 wherein the index is in the form of a two level tree.
The root level on page 3 (64) of the database file has pointers to the
three leaves or nodes 66, 68 and 70. These nodes have index data stored in
pages 5, 4, and 6 of the index file, respectively. D, F, and "null" in
page 3 (64) of the index file each represent the highest possible "key" in
the particular node 66, 68, 70 to which they point, respectively. In this
case, a key is the first alphabetic character of the employee name
although in other applications wherein numeric data is stored the keys may
be numbers. Andrew, Baker, and Chester have first alpha characters less
than D and thus indexing data related to them are stored in node 66. F in
root 64 points to node 68 which contains Edgar and Edward's indexing data
because their first alpha characters are less than F but greater than D.
The null pointer of root 64 is the highest allowable key and thus points
to the last leaf 70. Leaf 70 contains Howell since its first alpha, H, is
greater than F (the highest key in the preceding node 68).
The significance of the numbers in FIG. 3 following the names is that they
represent the record identifications or i.d.s of their respective names.
For example, for node 66, Andrew 0,2 indicates that the image data for
Andrew may be located by first retrieving the location 34 of record 28
containing the long field pointer from page 0's record pointer directory
24, slot 2, and thence retrieving the actual long field data at 92 (FIG.
4) pointed to by the long field pointer contained in record 28.
FIG. 4 is an illustration of the storage of long field files in a computer
disk. Reference numeral 72 indicates the beginning of a long field file
having the first image data at 82. An internal segment directory for each
image provides details about the respective image data. Thus directory 80
for Chester's image data 82 indicates one entry with a length of 10,500
bytes and a starting displacement at 1106. Similarly, in reference number
18, page 1 of FIG. 2, the long field pointer in record 42 to image data of
Howell would point to 11,800, (FIG. 4) the beginning of the internal
segment directory 86 for the actual image data 88 of Howell.
With reference to beginnings 74 and 76 of additional long field data,
appropriate correlative long field file pointers depicted in FIG. 2 would
point to corresponding internal segment directories 90, 94, 98, and 102.
These directories, in turn, would contain details about their respective
actual compressed image data 92, 96, 100, and 104 relating to Andrew,
Edwards, Baker, and Edgar. It will be noted that although in the
embodiment herein depicted the segment directories appear in the long
field areas, these areas may if desired carry only image data, in which
case the segment directories may appear as a matter of choice in the
records depicted in FIG. 2.
Now that a clearer understanding has been gained of the roll of indexes in
database systems an example of which is in accessing and updating through
a pointer found from the index an important type of database file just
described, namely the long field file, the overall operation of the
recovery system of the present invention will be described with reference
to FIG. 5.
When a system restart begins, 260, log records will begin to be read in,
262. Each log record will contain an indication of a particular page to
which the log record and the database action contained therein relates.
Essentially the software through steps 260-272 will read in the data pages
referenced by these log records or the LSNs associated therewith in order
to verify that the database operations or actions of the log records have
been done - this phase being referred to as the system restart-redo phase.
At 266 for a particular log record, the associated page file TYPE such as
a data, index, or long field file is identified. The particular page is
then read by the system, 268, and a check is made at 270 as to whether the
LSN on the page just read in is equal to or greater than the LSN of the
associated log record. If Yes, it is known that the operation identified
in the log record has been completed and the process loops back to read
the next log record at 262. If, however, the check yields a No, this
indicates that the original database action or operation associated with
the log record just read must be redone, 272. The basic idea, thus, per
the aforementioned U.S. Pat. Application Ser. No. 07/115,146, now U.S.
Pat. No. 4,914,569. (Write-Ahead Logging Protocol), is that the entire log
file is read until no more log records are left, as checked at 264,
whereupon if Yes at 264, this indicates that all history has been repeated
and the restart-redo phase has been completed. Thus, exiting the decision
block 264 at Yes signifies that now all logged transactions which did not
commit must be rolled back, and accordingly the system restart-undo phase
is entered at 274.
Generally, during this undo phase, each log record representing uncommitted
transactions is undone in time order, i.e., all log records are checked in
the backward direction to see if they are part of uncommitted
transactions. If so, an abort is performed on the log record. This undo
phase may be seen represented by steps 274-282 of FIG. 5.
More particularly, the previous log record is read at 276 and a check is
made as to whether any log record has been found at 278, If not, a check
is made at 279 of whether during system profile or configuration the user
specified a request that error indexes automatically be recreated. If No,
normal operation resumes at 281, i.e. system restart is completed. If Yes
the software program performs step 284. This corresponds to what
transpires right after the system restart operation of FIG. 5. Wherein the
process exits to complete recreation of error index files as shown in more
detail with reference to FIG. 11.
As previously noted regarding the purpose of the start-undo phase, when a
log record is found, the check at 280 determines whether the specific log
record is associated with an uncommitted transaction. If Yes, the system
performs and undo of the database action identified in the log record at
282 whereupon the process loops back to read the next previous log record,
276. If, of course, the particular log record under consideration is not
involved in an uncommitted transaction at 280, the process loops back to
276 to read the next previous log record without performing any undo
operation such as that of step 282.
With reference to the steps 262 and 266 at the top of FIG. 5, it is of
course obvious that a page must be read to obtain a log record as shown at
step 136 of FIG. 7. In addition to describing in more detail such
operation of reading log pages as shown in FIG. 7, the steps of FIG. 7
also indicate operation of the system of the present invention when errors
are encountered in reading pages. For example, in one case wherein the
page is found with a log error, the process may continue to step 268. If
on the other hand, for example, I/O errors occur on both pages such as
indicated at 180 of FIG. 7, a fatal error has occurred causing a return
error log file status. Another case with reference to FIG. 7 is when an
end of log file status is returned such as 170, this obviously indicating
with reference back to FIG. 5 again that the step 264 is reached in the
process whereby all history has been repeated, i.e., the restart-redo
phase is completed.
In like manner to FIG. 7 providing more detail as to steps 262 and 266,
FIG. 8, to be hereinafter described in greater detail, provides more
details as to the steps actually involved in reading pages from the data,
long field, and index files as shown at step 268 of FIG. 5. Particularly
shown are steps effected upon encountering an error in reading a page from
the file at 184, and 186 of FIG. 8 wherein the file is renamed at 194 and
whereupon the process loops back on the Yes path exiting block 270 to read
the next log record per step 262.
Thus, it will be noted that FIGS. 7, 8 and 11 relate to operations of the
invention during system restart. Specifically FIG. 7 details the
processing of log pages and handling errors detected during the restart.
FIG. 8 details the detecting and processing of errors relating to the
actual data, index, and long field files during restart, and FIG. 11
details the steps in recreating error index files after completion of
restart-undo. In contrast, FIGS. 6, 9 and 10 relate to the normal forward
processing operations of the database system. More particularly, FIG. 6
details the writing of log pages during such normal processing, FIG. 9
details processing of read errors conventionally encountered in normal
database operations, and FIG. 10 describes error index file processing
during such normal, i.e., non-restart, index processing during
conventional operation of the database system.
FIG. 6 is a flow chart illustrating the method of writing log pages to a
computer media such as a hard file. Generally the system is in an "append"
mode whereby log records are continuously being added or written to the
recovery log. Periodically a long page is then written out to disk. 106
indicates the start of writing log pages to disk in response to a
requester function which has issued a commit. Eventually the log record
requested by the requester function will be written out to a page after
the requester has issued a commit, however it will be appreciated that the
process depicted in FIG. 6 will occur before the page is actually written
out. At commit time it is desirable to write out the log page containing
the data of all the log records created up to that point.
At 108, the process inquires as to whether the log page to be written is
filled or not. If the log page is not filled (i.e., the log page is less
than the page size), a partial page is indicated and the process proceeds
to 114 wherein it is determined whether that page has already been
written. If not, the process proceeds to 116 and the page is written at
location n+1. Next the page write counter is set to 1 at 118, and there is
a return 120 to the caller of this routine which started at 106 and which
actually completes the algorithm to write a page out.
In block 114, if the page has already been written, exiting on the Yes
branch occurs to 122 where the page counter is incremented. If the page
counter is even (as determined at 124) then the process proceeds to 128,
causing the nth log page to be written at location n+2. If at 124 the page
write counter is odd, then the procedure goes to 126 where the nth log
page is written at location n+1. In either case, return is then effected
at 130 or 132 to the requester or caller of this routine. Returning to
108, if the log page size is not less than the page size (that is to say a
complete page has been filled) the process moves to step 110 wherein the
nth log page is written at location n, and thence the process returns at
112 to the caller.
FIG. 7 is a flow chart illustrating the processing of log pages and the
handling of an error detected during database restart processing. 134 is
the start of this processing wherein at 136 the nth log page is being read
in. If, after reading in the log page, it is determined at 138 that there
is an I/O error or a mismatch of the header and trailer LSNs, then at 172
the next log page n+1 is read in. If there is I/O error or mismatched LSN,
determined at 174, then an error status is returned, 180. In this
particular case at 180 the database could not be recovered and must thence
be restored from a backup copy. If, at 174, there is no I/O error or
mismatched header-trailer LSN's, then decision block 176 is reached. If
the LSN of page n+1 is less than the start LSN of the log, then it is
known that this page has not previously been written out and the process
proceeds to 178. At 178 end of log file status is returned, indicating
processing of all pages has been completed.
Continuing in FIG. 7 with the "No" branch of 176, if the LSN of page n+1 is
greater than or equal to the start LSN of the log, then processing
continues at 148 wherein the LSN at page n+1 is checked to see whether it
is less than the start LSN of the log. If it is, then this process of FIG.
6 is completed and an end of log file status is returned at 154. In 148,
if the LSN of page n+1 is not less than the start LSN of the log, at 152
the next log page n+2 is read. After reading that page, at 156 it is
determined whether an I/O error or mismatched LSN header/trailer has
occurred. If so, then the log records from page n+1 are used and we return
to the caller at 162. Returning back to 156, if there is no I/O error or
mismatch, at 158 it is determined whether the LSN for page n+2 is less
than the LSN for page n+1. If so, again the process returns page n+1 at
162. If not, the process returns page n+ 2 as shown at 160. At this stage
the process is simply determining the page with the highest LSN and
returning that page. So long as there is a valid LSN, i.e., one which is
greater than the beginning of the log, the process will continue returning
the highest page of the two alternate pages which could be written out if
an I/O error or mismatch is encountered. It will be noted that in
referring to returning pages actually specific page numbers are returned
and then the appropriate records from the numbered page.
Returning to decision block 138, after reading in the nth log page at 136 a
check is made to determine whether an I/O error or mismatched
header/trailer LSN has occurred. If not, at 140 a check is made of whether
the LSN page n is less than the start LSN of the log. If so, this
indicates that work has occurred only on pages n+1 or n+2, and flow to 144
then occurs wherein the next log page n+1 is read. However, if the LSN of
page n is greater than or equal to the start LSN of the log, this
indicates that the page has been read in successfully and is the correct
page to use, and consequently a read successful status is returned per
142.
Returning back to 144, after reading in the next log page n+1, a check for
I/O error or mismatched header/trailer LSN is made at 146. If mismatching
or error is detected then the next log page n+2 is read in per 150. Next,
I/O error or mismatch | | |