|
Claims  |
|
|
What is claimed is:
1. In a multiplexed data area computer complex including processors, main
storage having program address spaces, at least one non-volatile storage
device connected to said main storage, at least one subsystem in one of
said program address spaces in said main storage and shared status data
stored in fields in a predetermined area in said main storage for each of
said processors, the apparatus for communicating subsystem status among
said processors comprising:
a single, shared to-be-read-from version of said shared status data
including a first storage area formatted on said storage device, said
first storage area including records for containing changed data fields,
a single, shared to-be-written-to version of said shared status data
including a second storage area formatted on said storage device, said
second storage area including records for containing changed data fields,
and
a program in one of said processors or address spaces for updating
subsystem status on said to-be-written-to version by performing output
operations on said to-be-written-to version and input operations on said
to-be-read-from version when one of said processors solely controls and
has access to said shared data, said operations comprising:
means for reading from said first storage area and overwriting portions of
said shared status data in said predetermined area with data in said
changed data fields read from said first storage area,
means for reading only changed ones of said records in said to-be-read-from
version and overwriting said shared status data in said predetermined area
with said changed ones of said records,
means for writing said one of said changed records in said shared status
data in said predetermined area to said to-be-written-to version,
means for recording in-progress updates to said data fields in said
predetermined area corresponding to one of said processors that solely
controls and has access to said shared status data in said predetermined
area,
means for writing in-progress updates that have been written to said shared
status data in said predetermined area to said to-be-written-to version,
means for flip-flopping said to-be-read-from version to said
to-be-written-to version and said to-be-written-to version to said
to-be-read-from version following said input and output by said solely
controlling processor, and
means for repeating said operations with a subsequent one of said
processors controlling and having access to said shared status data.
2. In a multi-access spool computer complex including processors, main
storage, at least one direct access storage device (DASD) connected to
said main storage, a subsystem in an address space in said main storage
and a checkpoint data set stored in a predetermined area in said main
storage containing control blocks for each of said processors, the
improvement comprising:
a single, shared to-be-read-from version of said checkpoint data set
including a first journal formatted on a track of a DASD, said first
journal including records for containing changed control blocks,
a single, shared to-be-written-to version of said checkpoint data set
including a second journal formatted on a track of a DASD, said second
journal including records for containing changed control blocks, and
a channel program in said address space for performing input/output
operations on said versions when one of said processors solely controls
and has access to said checkpoint data set, said operations comprising:
means for reading from said first journal and overwriting portions of said
checkpoint data set in said predetermined area with data read from said
first journal,
means for reading only changed ones of said records in said to-be-read-from
version and overwriting records in said checkpoint data set in said
predetermined area with said changed ones of said records,
means for writing said ones of said changed records in said checkpoint data
set in said predetermined area to said to-be-written-to version,
means for recording the status of in-progress updates to control blocks in
said checkpoint data set corresponding to said one of said processors that
solely controls and has access to said checkpoint data set in said
predetermined area,
means for writing in-progress updates that have been written to said
checkpoint data set in said predetermined area to said second journal in
said to-be-written-to version,
means for flip-flopping said to-be-read-from version to said
to-be-written-to version and said to-be-written-to version to said
to-be-read-from version following said input/output operations by said
solely controlling processor, and
means for repeating said operations with a different one of said processors
controlling and having access to said checkpoint data set.
3. In a multi-access spool computer complex including processors, main
storage, at least one direct access storage device (DASD) connected to
said main storage by a channel subsystem, a subsystem in an address space
in said main storage and a checkpoint data set stored in a predetermined
area in said main storage containing control blocks for each of said
processors, the method for communicating subsystem status among said
processors comprising the steps of:
creating a single, shared to-be-read-from version of said checkpoint data
set including a first journal formatted on a track of a DASD, said first
journal including records for containing changed control blocks, and a
single, shared to-be-written-to version of said checkpoint data set
including a second journal formatted on a track of a DASD, said second
journal including records for containing changed control blocks, and
executing a channel program in said address space for performing
input/output operations on said versions when one of said processors
solely controls and has access to said checkpoint data set, said
operations comprising the steps of:
reading from said first journal and overwriting portions of said checkpoint
data set in said predetermined area with only changed data read from said
first journal,
reading only changed ones of said records in said to-be-read-from version
and overwriting records in said checkpoint data set in said predetermined
area with said changed ones of said records,
writing said ones of said changed records in said checkpoint data set in
said predetermined area to said to-be-written-to version,
recording status information of in-progress updates to control blocks in
said checkpoint data set corresponding to said one of said processors that
solely controls and has access to said checkpoint data set in said
predetermined area,
writing in-progress updates that have been written to said checkpoint data
set in said predetermined area to said second journal in said
to-be-written-to version,
flip-flopping said to-be-read-from version to said to-be-written-to version
and said to-be-written-to version to said to-be-read-from version
following said input/output operations by said solely controlling
processor, and
repeating said operations with a different one of said processors
controlling and having access to said checkpoint data set.
4. The multi-access spool computer complex of claim 3 in which said status
of in-progress updates is recorded in a change area on said address space.
5. The multi-access spool computer complex of claim 2 in which control
bytes having bits each corresponding to a processor in said complex
control said I/O operations on said versions. |
|
|
|
|
Claims  |
|
|
Description  |
|
|
BACKGROUND OF THE INVENTION
1. Field of the Invention
This invention generally relates to a multi-access spool computer complex
in which the checkpoint mechanism for providing communication among
processors in the complex has been improved. In particular, this invention
relates to the reduction in input/output to a subsystem checkpoint data
set and the increase in the amount of time that a precessor in the complex
can update the checkpoint data set.
2. Prior Art
The checkpoint (data set) function has existed since the HASP (Houston
Automatic Spooling Program) in the late 1960s. In HASP, and in early
versions of the job entry subsystem (JES) following HASP, the first
several tracks of direct access storage devices were formatted into
special checkpoint records. Periodically, the storage copy of the job and
output queues were written to the direct access storage device (DASD),
hence the name "checkpoint". Originally the sole purpose of the checkpoint
was to maintain a copy of the work queues on the DASD so that the system
could be restarted. When the processor was restarted after a failure or
normal shutdown, HASP could read the checkpoint from the spool and
continue processing as if it was never interrupted. (The "spool" is a
space allocated by the JES for containing data related to the jobs to be
run and job output.).
Subsequent changes were made such that the checkpoint data set now resides
on a shared DASD which is accessible to all the processors in a
multi-access spool (MAS) complex. By reading and writing data from/to the
checkpont data set, a processor maintains an up-to-date copy of the
checkpoint information in its storage. When a processor changes data in
the queues, it must reflect the change to other processors via the
checkpoint data set.
The MAS complex was introduced with the release of MVS and JES Release 3 in
order to permit loose coupling of from two to seven processors (members).
In the MAS complex, the checkpoint now serves as the communication pathway
bewteen processors (or multiple JES address spaces on one processor) each
running asynchronously but cooperatively. Since hardware control over the
updating of information was needed, the checkpoint information was moved
to a separate MVS data set. In a JES MAS complex, all processors share
access to the same set of job and output queues. A job can be read in by
any processor, can execute on any processor, its output can be printed or
punched by each of the processors, and operators can control jobs anywhere
in the complex. Each processor in a MAS complex maintains an in-storage
copy of these job and output queues. All processors in the complex are
therefore equal in control and in processing responsibility.
As the main mechanism for communication between processors in a MAS
complex, the checkpoint data set contains much of the information that JES
needs in order to control its functions. Aside from the job queues and
outpout queues from which work is selected, the checkpoint also contains a
record of member values describing the overall configuration of the MAS
environment and specific characteristics and information describing the
current status of each member. All of this data is used to control the
operation of the checkpoint and related aspects of JES. The checkpoint
data set also contains data that the processors in the complex use to
"share" the spool space. This includes extent information about the spool
volumes which are currently mounted and the bit map from which spool space
is allocated. In fact, structures in the job and output queues contain
pointers through which all data on the spool volumes are actuatlly
located. The checkpoint data set is shared by the members using a
time-slicing protocol. Each member is insured the privilege of exclusively
updating the data set during the time that it has control of and access to
the data set.
As installations grew, the size of their checkpoint data sets also grew and
some MAS installations began to experience performance problems related to
the amount of time required to perform checkpoint I/O (read/write)
operations. JES had to read the entire data set because there was no way
of determining which records had been changed by other processors in the
complex. As a result, JES could become essentially serialized until
checkpoint I/O operation completed. This lead to remote job entry timeouts
and slow responses to operator commands. (It was not anticipated that the
speed of processors in a MAS would be increased more rapidly than the
speed of checkpoint I/O operations.) Also, much of the I/O for the
checkpoint data set was unnecessary because many of the records in the job
queue and job output table were unchanged. Thus the records being read
overlaid identical data already in storage.
The problems were alleviated to some extent with the implementation of
control bytes which reside on the checkpoint data set itself for
identifying records that were changed. As a result, only the changed
records had to be read. The physical format of the data set was changed to
consist of fixed length 4K blocks. JES also began using the services of
MVS to manage the real storage associated with checkpoint I/O buffers.
Nevertheless, problems with respect to unrecoverable checkpoint failures
continued to occur and the amoutn of time required to perform I/O
operations was still too great. Clearly JES was incapable of providing
adequate error recovery for its checkpoint data set and did not provide
efficient communication among processors within a complex. The slowness of
checkpoint I/O operations and the loss of the checkpoint data set itself
are still major problems in large MAS installations.
Special error detection mechanisms, including unusual channel programming
and a philosophy of minimal dependence on hardware error reporting, were
implemented. An auxiliary locking mechanism to supplement the shared DASD
REVERSE/RELEASE locking facility was invented. A copy of the checkpoint
data set was maintained as a backup in case of media damage or other
failures. The mechanism for recovery from severe failures if the
availability of a backup (duplex) coy of the checkpoint data set. The
duplex data set must reside on a separate DASD device from the (primary)
checkpoint data set.
JES CHECKPOINT CYCLE
There are four stages in a checkpoint cycle. The first stage beings with a
read operation to update the in-storage queues and to ensure exclusive
control of the checkpoint data set. The first part of the read operation
issues a RESERVE for the checkpoint device and reads control data from the
first DASD track. The second part of the read operation reads all of the
records that were changed by other members. In the second stage, a member
is said to "own" the checkpoint. During this "useful time" it can perform
processing on and make updates to the checkpoint data. During this stage,
a write operation causes the transfer, to the duplex data set, of all
updated records that were not yet recorded on the duplex data set. This is
a duplex write. Also during this stage, the member (processor) can cause
intermediate update levels (intermediate WRITE) of the checkpoint data to
be written to DASD without losing its ownership. At the completion of the
"ownership" stage or HOLD interval, the final write of the data set is
made and a RELEASE operation is performed which enables the next member to
begin the checkpoint cycle. This is the third stage. The fourth stage is a
dormant period during which a processor makes no attempt to access the
data set and thereby allows other processors time to complete their active
phases.
Since the JES checkpoint data set can be accessed by several processors,
the RESERVE/RELEASE feature of shared DASD is used to control access to
it. When a particular JES system wishes to access the checkpoint, a
RESERVE is issued. This allows the member to update the data set until it
issues a RELEASE. If another processor attempts to access the device on
which the checkpoint data set resides while the first processor is still
holding the RESERVE, the second processor will be returned a busy
condition to its I/O operation. When the processor holding the RESERVE
finally issues the RELEASE for the DASD, an interrupt (device end) is
signalled to all processors which experienced the busy condition. At this
point, the DASD is unlocked, and the other members can again try to
RESERVE it.
There are two basic modes of operating a MAS complex. The most prevalent is
in a controlled environment where each processor "gets its turn" owning
the checkpoint data set. The other, contention mode, is not recommended.
In contention mode, it is possible to allow a processor to compete for the
checkpoint at all times, essentially by eliminating the "dormant" period.
That is, all processors attempt the RESERVE simultaneously.
CHECKPOINT DATA SET FORMAT
The first track on DASD contains three control records: an 8-byte CHECK
record, a LOCK record composed of an 8-byte key field and an 8-byte data
field, and the MASTER record. The job queues and output queues are
segmented into 4K records which reside on the remainder of the data set.
There is a possibility that some portion of the last 4K record in the job
queue and in the job output table may not be used, since the sizes of
checkpoint structures are rounded to fit into 4K boundaries. The
checkpoint data set is a non-sta ndard data set, containing both keyed and
non-keyed records.
The CHECK record is an 8-byte record at the beginning of the first track of
the checkpoint data set. It contains a check value used to help determine
whether the remainder of the data in the checkpoint is valid. That is, it
will be used in conjunction with a companion value in another record to
indicate whether the previous update operation completed successfully.
The LOCK record is the only keyed record in the checkpoint data set. It
consists of an 8-byte key portion and an 8-byte data portion which are
identical. It is used as a software lock in addition to the normal
hardware RESERVE/RELEASE mechanism. This record is used to control access
to the data set.
The JES MASTER record is on the first track of the checkpoint data set. It
contains data such as initialization parameters which affect the
configuration of the MAS complex, the shared queue control element (QSE)
data areas that represent the status of each processor in the complex and
the checkpoint control bytes (CTLBs). A copy of the "check" value is in
the first part of this record. This copy is compared to the value in the
CHECK record during "read" operations to determine whether the records in
the checkpoint data set are all at the same update level or state.
CHECKPOINT DATA SET LOCK
The checkpoint data set lock is used as a backup for the RESERVE/RELEASE
feature of a shared DASD. RESERVE/RELEASE, by itself, is not an adequate
mechanism to guarantee that simultaneous updates will not occur, because
it has a tendency to open the lock, unintentionally, when failures occur.
The checkpoint data set lock provided by JES, on the other hand, tends to
lock closed under these conditions. It will always ensure that the data in
the checkpoint data set is good by prohibiting simultaneous updates under
any circumstances. Because of this characteristic of locking closed when a
failure occurs, this lock requires a manual operation to reset it.
When a processor gets control of the shared checkpoint, it will write a
value into the key and data portions of the LOCK record. When the lock is
not held by any processor, a value of zero will be recorded. Processors in
the MAS complex can determine whether the shared data set is available by
using a predetermined channel "search" command (CCW) with a zero data
field. If this command is successful, the remainder of a channel program
(CCWs) will set the key field to the appropriate value of the processor.
The channel program basically performs a "compare and swap" operation on
the lock record. See U.S. Pat. No. 3,886,525 which is assigned to IBM, and
which discloses and claims the "compare and swap" concept.
The operation to obtain the lock will be done as part of the initial "read"
channel program which is executed as each processor's checkpoint interval
begins. The initial "read" channel program begins by locating the LOCK
record in the checkpoint data set. It then ensures that the lock is
currently unowned. If the key of the LOCK record is currently 0 (lock
unowned), it then goes on to "read" the MASTER record and the CHECK
record. Finally, the initial "read"operation sets the lock by writing the
ID of the member that will be owning the lock into the key and data of the
LOCK record. If the lock record is currently owned (key of the LOCK record
is currently non-zero), the channel program reads the value of the lock
record in order to determine which member (processor) owns the checkpoint
lock.
When this situation occurs, JES will attempt its own error recovery. If the
system which lost the RESERVE is still running, it will eventually clear
the key value, allowing the looping members to proceed. If it is not
running, a JES operator command will be used by one of the other members
to reset the lock value, on behalf of the failed member. Thus this lock
can be used as a backup for the RESERVE/RELEASE hardware mechanism, since
it tends to lock "closed", rather than "open", when hardware or software
failures occur.
CHECKPOINT DATA SET CHECK RECORD
JES is capable of detecting when the checkpoint data read at the beginning
of the checkpoint cycle is invalid. If another member fails during a
checkpoint update, the data in the primary (as opposed to the backup)
checkpoint data set may have been partially updated and is therefore not
valid. This occurrence can be detected by using the check value in the
CHECK record of the checkpoint data set. Each time that data is written to
the checkpoint, an incremented counter (which ranges from 1 to 127) will
be recorded in both the MASTER checkpoint record and the CHECK record. All
checkpoint write operations will write the counter value in the MASTER
record first. Then, after all of the changed queue records have been
written, the CHECK record will be written as the last write operation
(just before the I/O completion verification). When a member is to "read"
the checkpoint data, it first compares the counter value with the check
value to determine the integrity of the checkpoint data. If the values are
unequal, then one of the other members in the complex must have failed
during a checkpoint write operation and error recovery actionis necessary,
and always involves a warm start i.e., a restart of JES that retains the
work already in the queues.
CHECKPOINT DATA SET I/O OPERATIONS
As discussed above, the checkpoint cycle begins with a read operation to
update the in-storage queues (control blocks). Each rad operation is
divided into two separate I/Os. The first issues a RESERVE for the
checkpoint device and reads control data including the control bytes from
the first DASD track. The control bytes (CTLBs) are used to build a
channel program to read all of the records that were changed by other
members.
In addition to the storage actually used to contain the checkpoint data,
JES maintains a checkpoint I/O buffer. During normal checkpoint I/O,
actual CCWs transfer data to this buffer. Using the control bytes, JES
fixes the real frames associated with the pages in the buffer that will be
used. After a channel program has been built and executed, JES moves each
of the "read pages" to the appropriate area of the actual checkpoint
storage. JES then releases the "old" data in the associated frames (or
storage slots on DASD) before "replacing" the data. JES then releases the
frames used as an I/O buffer. The above operation is performed in a loop,
a page at a time, so that the real frame requirement for the I/O never
exceeds by more than one the total number of pages read.
The write operatins performed by the checkpoint processor are substantially
the reverse of the read operations. Each page of the I/O buffer that will
be used are fixed. For each page in the checkpoint data set that was
changed, the storage in the actual checkpoint area is moved to the fixed
page in the I/O buffer, and real addresses in the channel program are
adjusted. The changed pages are then written to DASD. The pages in the I/O
buffer that were fixed are released. It is important to note that entire
records are read and written regardless of the number of bytes of data
that were changed or modified within the records.
A rather complete discussion of direct access storage devices is found in a
book entitled "Introduction to IBM Direct Access Storage Devices" written
by Marilyn Bohl and published by Science Research Associates in 1981. This
book is incorporated herein by reference. Additional pertinent references
include U.S. Pat. Nos. 4,507,751 and 4,310,883, both assigned to IBM.
Faster central processors tend to require more JES services per second than
slower ones. This means that the number of checkpoint records updated per
second increases with processor speed. This, in turn, increases the unmber
of records that must be read by each member of the configuration before
useful work can begin. During the time that the data set is being read or
written, (except for `intermediate` WRITES), all updates are prevented.
This, in turn, inhibits almost all JES functions because most JES
processing requires the ability to update the checkpoint data set.
Some installations find that faster processors are constrained by the
inability to complete all of the requested JES functions during the time
that the data set is "owned" by a processor. This causes poor response
time and inability to fully utilize the processing power of the computer.
It is therefore an object of this invention to reduce I/O to a checkpoint
data set and to increase the amount of time that a processor can update
the checkpoint data set in order to speed up checkpoint I/O operations
thereby increasing the performance of a multi-access spool complex.
It is also an object of this invention to improve communication among
processors by speeding up the data set I/O operations.
A further object of the invention is to improve the reliability,
availability and serviceability characteristics of the data set.
An object of the invention is to reduce the dependency of the processor
executing the subsystem on I/O operations to the data set.
SUMMARY OF THE INVENTION
A method and apparatus for providing improved communication among
processors in a multi-plexed data area computer complex is disclosed and
claimed. The complex includes processors, main storage with program
address spaces, at least one non-volatile storage device connected to the
main storage, at least one subsystem in one of the program address spaces
and shared data stored in fields in an area in main storage for each
processor and program address space.
The improvement is directed to:
a to-be-read-from version of the shared data including a storage area,
which has records for containing changed data fields, formatted on a
storage device;
a to-be-written-to version of the shared data including another storage
area, which also has records for containing changed data fields, formatted
on a storage device; and
a program in one of the processors or address spaces that performs
input/output operations on the versions when one of the processors solely
controls and has access to the shared data. The versions then flip-flop
following the performance of the input/output operations.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1a is a schematic block diagram of the computer complex showing the
dual data set configuration;
FIG. 1b is a schematic block diagram of the dual data set configuration
showing the data set versions and each corresponding journal;
FIG. 2 shows the records in a journal in a data set version as well as
other records in the data set version on a direct access storage device;
FIG. 3 shows the events of checkpoint cycle operation involving the dual
data set configuration;
FIG. 4 shows the format of a data set to include a journal;
FIG. 5 shows the format of journal records;
FIG. 6 is a schematic diagram showing the records in a data set in storage;
FIGS. 7a-d show the progression of data set versions, update levels and
in-storage data set copies in a four-processor complex during the
operation of the checkpoint cycle;
FIG. 8 shows the events of the checkpoint cycle involving the dual data set
configuration as each of three processors gains control of and has access
to the data set verions;
FIG. 9 shows the update level changes to each data set version and to the
in-storage copy of the data set for one processor during a checkpoint
cycle operation;
FIGS. 10A-D shows the events that occur on the direct access storage device
and in virtual storage during READ 1 processing, READ 2 processing,
primary WRITE processing and intermediate and final WRITE processing
during a checkpoint cycle operation;
FIG. 11 is a schematic block diagram of lists in the change area in virtual
storage having entries that are updated each time a corresponding control
block is updated by a processor in the complex;
FIG. 12a is a flow chart showing the READ 1 processing operation;
FIG. 12b is a flow chart showing the READ 2 processing operation;
FIG. 12c is a flow chart showing the primary, intermediate and final WRITE
processing operation;
FIG. 13 is a schematic block diagram of the control blocks, channel command
words and service routine provided for the subsystem in the virtual
storage area;
FIG. 14 is a schematic block diagram that shows the mapping of the storage
configuration of the logical areas of the data set by checkpoint
information tables; and
FIG. 15A-D shows how a control byte corresponding to a 4K page that
contains control blocks is changed as processor access to the data set
changes during a checkpoint cycle operation.
DESCRIPTION OF THE PREFERRED EMBODIMENT
In accordance with the invention, a journal is implemented in order to
reduce input/output (I/O) to a subsystem's shared data area (such as a job
entry subsystem (JES) checkpoint data set) on a non-volatile shared
storage device such as a direct access storage device (DASD), and to
increase the amount of time that is available for data update access to
the shared data in fields of main storage by a given member, i.e. a
processor, of a multi-plexed data area computer complex as shown in FIG.
1a. To simplify the discussion of the preferred embodiment, the shared
data area is a checkpoint data set, such as the one used by the job entry
subsystem to provide for communication among processors in a multi-access
spool computer complex, and the shared storage device is DASD.
Multi-plexed data area means that each member owns the checkpoint at
different times. The journal (or change log) is intended to alleviate the
overall complex checkpoint constraint by increasing the amount of time
that a member may "own" (or "hold" or "have control of and access to") the
shared data, i.e. may actually be permitted to update the data residing in
the checkpoint data set, relative to the time spent acquiring and
relinquishing "ownership" of it. Most useful work can be accomplished only
during periods that allow update access, i.e. when a member "owns" and
controls the checkpoint data set, excluding the periods of time for
acquiring and releasing data set "ownership". Also, in accordance with the
invention, a dual data set configuration is implemented (created) to
provide the necessary data integrity in case of a DASD or member failure
while the checkpoint data set is being updated with the data in the
journal. The dual data set configuration (10) is made up of two data sets
or two versions (12 and 14) of the same data set on DASD as shown in FIG.
1a. One is a to-be-written-to version and the other is a to-be-read-from
version and are further discussed below. There are three coherent copies
of the data set, representing its three most recent update levels
(discussed below), that can be reconstructed from the versions and
associated journals. Two versions allow the writing of updated 4K records
to take place at the time of a primary WRITE operation, i.e. during
"useful time", rather than during a final WRITE (non-useful time). This
increases the useful time available for updates to the checkpoint data
set.
"Useful time", in the context of this invention, is the time during which
one of the members in the complex is permitted to update the checkpoint
data set, and thus to perform functions that require such updates.
"Non-useful time" is defined herein to be the period of time during which
no member in the complex "owns" the data set (all are "dormant") or a
member "owns" the data set but is in the process of performing initial
READ or final WRITE operations.
Also in accordance with the invention, the checkpoint data set is read and
written in a way that shortens the length of time required for a
checkpoint cycle to occur. Fewer bytes of data are read and written during
a checkpoint cycle. A checkpoint cycle is defined to begin at the time a
member acquires "ownership" and control of the data set. This includes the
time that the member takes to read the checkpoint and "overwrite" journal
updates in storage, as well as the time that the member takes to write the
new "update level" of the data and to reqlinquish "ownership" of the data
set. There is reduced I/O to the checkpoint data set during substantially
all update accesses by a given member of the complex. Also, the dual data
set configuration eliminates the need for a backup (or duplex) copy. Since
performance is a primary consideration with respect to checkpoint I/O, the
improvement disclosed and claimed herein increases performance
substantially over the prior art. For example, in the prior art, if a two
member complex had 1.8 million calls to the execute channel program
service routine per week, then approximately 16.5 hours of channel time
was used per week. Of these 16.5 hours, approximately 3.9 hours were
non-useful time due to unnecessary rotations of DASD. This non-useful time
has been substantially eliminated. There is no requirement that the
versions reside on separate DASD devices, although this is recommended for
increased performance.
Reduction in the checkpoint cycle time and I/O time is due, in large part,
to the new journal portion of each version of the data set shown in FIGS.
1a and 1b. The journal (16) of version 1 (12) and the journal (18) of
version 2 (14) are shown in FIG. 1b. As a result of implementing the
journal, entire records (each 4096 bytes in length and otherwise referred
to as 4K pages) will not be read or written during the checkpoint cycle.
Instead, only the changed elements of each record, e.g. a control block or
data field or data byte(s), will either be read from or written to the
journal portion of the data set on DASD. The journal will be composed of
journal records which, like the checkpoint records, are 4096 bytes in
length. These records will contain only changed control blocks (each with
an average size of less than 100 bytes) and identifying information. The
implementation of versions of the checkpoint data set with journal, and
the dual data set configuration, does not depend upon current or future
DASD technology or features beyond that required for the prior art. FIG.
1b shows conceptually the checkpoint cycle (20) to include a journal (16,
18) in each version (12, 14) of the data set in accordance with the
invention.
The main journal area (records which contain the journal data ) is located
(formatted) on DASD track 1, immediately following the MASTER record, as
shown in FIG. 2, so that it can be read or written in a single I/O
operation thereby eliminating the rotational and head positioning delay
incurred by the "reads" and "writes" of scattered 4K pages. That is, by
placing the journal on the first (DASD) track, the DASD will rotate past
the journal's records (22a, 22b, 22c) on the first track in the process of
reading or writing records on that track. As shown in FIG. 2, additional
journal records (22d, 22e) may be placed on track 2 (and subsequent tracks
of the data set) if required. The journal substantially reduces the need
to read and write to the checkpoint records that are potentially scattered
across multiple tracks and, possibly, cylinders throughout the data set
during the non-useful time. As a result, the journal provides for a
greater proportion of useful checkpoint update time by a member in the
complex. Furthermore, the journal reduces the number of bytes of data that
must be read and written to DASD during non-useful time (over the prior
art). Therefore, a further reduction in I/O time will be realized even if
future DASD technology eliminates rotational delays.
FIGS. 2 and 4 show the checkpoint data set format (data records) in which a
journal begins at the end of the MASTER record on track 1 (and may
continue to other tracks). In FIGS. 2 and 4, a single block represents an
non-keyed data record, and a double block represents a keyed record. The
size of the journal is kept in the MASTER record (28), as is the "active"
size of the journal. The active size of the journal is given in bytes and
is divided by 4096 and rounded up to compute the number of journal records
actually in use. The MASTER record also contains a byte of data
information corresponding to each record in the checkpoint data set. These
bytes are called control bytes (CTLBs) and identify the records in the
checkpoint that have been changed or updated since a member relinquished
ownership of the data set. Also there are control bytes called CLCBs
residing in the subsystem virtual storage address space which identify the
changed journal records. The first track of the checkpoint data set
includes the CHECK record (30), the keyed LOCK record (32) and the MASTER
record (28) as well as all or part of the journal. As shown in FIG. 4 the
CHECK record (30) includes the data set name (dsn) and volume number
(volser) of a version. The LOCK record key (32a) contains an indication of
several new states ("me" "1v") associated with the configuration of the
checkpoint data set in dual mode of operation. The LOCK record key is used
as a means of preventing data set access in the case of a DASD hardware
lock (RESERVE) failure and as a means of determining whether or not to
read the entire contents of track 1. ("Ownership" of the authority to
change the checkpoint data is controlled using the RESERVE/RELEASE feature
of shared DASD. REVERSE/RELEASE also ensures the consistency of the
checkpoint data by preventing concurrent updates.)
New records (22a, 22b, 22c) which contain the journal data are formatted
onto track 1 following the MASTER record. FIGS. 2 and 4 show the records
that are formatted onto the first track of a checkpoint data set in the
order in which they are read. The format of journal records is shown in
FIG. 5. Journal records are composed of updated control blocks (34, 36,
38, 40) preceded by identifying information in an address list entry
(ALE). The control blocks can be variable in length and can span records
as indicated by updated control block 38 in FIG. 5. The content of the ALE
is identical to the content of the change log address list entry CALE in
the change area in storage (which is discussed below).
As indicated above, the checkpoint data set is divided into two versions,
CKPT1 and CKPT2, each having a corresponding journal (16, 18) as shown in
FIG. 1b. In this embodiment, each version and its corresponding journal
resides on a DASD volume or separate storage device. Each member of the
complex (shown in FIG. 1a) retains a copy of a "state", i.e. update level,
of the data set in the subsystem address space, in this case, the virtual
storage address space belonging to JES. The update level of the checkpoint
retained by any one member of the complex is based on the status of the
data in the checkpoint during the most recent member ownership period. The
integrity of the data set is maintained by either of the two versions
because it is always possible to reconstruct at least two coherent update
levels of the checkpoint from one of the versions and its corresponding
journal, and either one or two coherent update levels of the checkpoint
from the other versions. Furthermore, if both DASD versions are damaged,
it is possible to dynamically allocate new versions and recover from the
member's copy of the in-storage data set. The data set is converted to a
more recent update level by "overwriting" the data in a version of the
data set with the updated data in the corresponding journal. In other
words, 4K pages in this version of the checkpoint data set are updated (to
an update level) when a member overwrites the journaled changes to the
journal's corresponding checkpoint data set version. This occurs
immediately following the reading of the data in the previous update level
of this version of the data set (on DASD) by a member in the complex. The
actual "overwriting" is done to the I/O buffer (shown in FIGS. 10A-10D)
whose changed records are written as the "new" update level of the data to
the version of the checkpoint data set on DASD during the primary WRITE
portion of the checkpoint cycle which is discussed below.
Further, in accordance with the invention, one version of the data set is
used on an equal basis as the other version, with the members
"flip-flopping" between the two versions (CKPT1 and CKPT2) on alternating
checkpoint cycles. Flip-flopping is controlled by using appropriate fields
in the data on track 1 of the versions as explained herein. Therefore,
this mode of operation is referred to as "dual mode" and the versions are
alternately referred to as the "to-be-read-from" and "to-be written-to"
versions. That is, if a member uses CKPT1 as the to-be-written-to data set
and CKPT2 as the to-be-read-from data set, then the next member to "own"
the checkpoint data set will use CKPT1 as the to-be-read-from data set and
CKPT2 as the to-be-written-to data set. In essence, the one data set that
has the most recent updates will become the to-be-read-from data set. As
is shown in FIG. 4 a level number (LEVEL) in the data area of the CHECK
record (30) is used to indicate which data set version has the most recent
updates, i.e. which data set is to-be-read-from. The other data set
becomes the to-be-written-to version. The level number is incremented
during each primary WRITE operation. (Primary WRITEs are described below.)
The higher the number, the more recent the update. "Flip-flopping" between
data set versions eliminates the need for a backup (duplex) copy of the
data set.
CHECKPOINT CYCLE OPERATION
(See FIG. | | |