|
Description  |
|
|
FIELD OF INVENTION
This invention relates to distributed file systems and more particularly to
a method and system for reconciling different versions of files, in which
the files are stored in computers at two or more separate locations or
sites.
BACKGROUND OF THE INVENTION
There is a problem, especially with the portability of computers and floppy
disks that a given file, for instance, in a lap top may not reflect the
same information or data as the same file at a desktop or fixed work
station.
This is because work is frequently taken from location to location. As
frequently happens, a file created at a fixed work station at the office
may be modified at a remote location, such as one's home, by merely
transporting a disk or diskette containing the file and modifying it at
the remote location. Multiple versions of the same file can also exist in
distributed networks when files are modified or manipulated by multiple
users.
Problems thus arise when the versions of the file at two sites, such as
home and office, do not agree because they have not been identically
updated. This can occur by accident when one forgets to transport a floppy
disk from one location to the other; or when one forgets to load the disk
altogether.
It is of course desirable to have some synchronization between versions of
the same file when created or modified at two different sites. For
instance, it is possible to have the same version of a file at two sites
and only access one at a time. When, however, versions of a file are
created at two sites, it is important to be able to update or reconcile
the files at both sites so as to appropriately update both files, or only
one file.
In the past, systems have compared the times that a file was updated at
different sites, have automatically selected the most recent version, and
have copied this version into the appropriate file at both sites. Such
systems include the Novell, Netware, Sun Microsystems Network File System
(NSF) and Andrew File Systems. All of these systems have problems with
their automatic updating procedures.
It is also a feature of NFS, Andrew File System, and, Netware that they
automatically alter files immediately after they are modified. This
results in significant performance problems as new versions of files are
transmitted. Moreover all updates are distributed throughout the network,
exposing raw work product to all on the system. It can also be an
embarrassment because of the automation process, where those connected to
the distributed system immediately have knowledge of new unedited data and
changes.
It will of course be appreciated that when there are multiple users or
contributors to a single file, such as in writing software, or as in
editing documents, it is very important to alert all users of the same
file as to what others are doing so that at some point there is control in
each of the users as to what updating or reconciling of multiple versions
of the file will be permitted. It is particularly annoying for the writer
of software to have someone else edit his software without his knowledge.
Likewise, it is equally unfortunate for the word processing public to have
one user edit a work without giving adequate notice to the other user.
More specifically, an inadequate solution to the problem of multiple
versions of the file at different locations exists in distributed file
system technology as represented by the NFS, Andrew, Apple Share, Novell,
and Research Systems software such as Coda and Ficus. All of these systems
give the impression of being a single global file system. The advantages
of having a single global file system are automatic updating, sharing, and
familiar time sharing systems semantics. However, the problems with such
systems are that they fail or degrade when disconnected, are unpredictable
in performance, are unacceptable in that updates are at the system's
convenience and not at the user's, and that they require a modified
operating system, often requiring a single vendor.
Another inadequate solution to the problem of multiple revisions of a file
is found in the explicit file transfer technology associated with
diskette/tape, E-mail, Lap-Link and file transfer protocols. What these
systems attempt to do is copy files and carry or mail them. While the
advantages are complete user control, flexible transport, and conversion
between different systems, the disadvantages include complicated and
error-prone protocols, in which overwriting of useful data can occur
accidentally and in which there are no "merges" of different versions.
In all these systems, the most recent version of the file in one computer
is automatically copied to the other. Thus, current programs seek to
establish which file is correct by date and time, a technique called "time
stamping". However, these types of systems are far from failsafe. For
instance, assuming one wishes to delete a file on a lap top, deleting the
file at the lap top may not result in deleting the file at the fixed work
station, but rather in restoration of the obsolete file found at the work
station. Thus automatic reconciling systems are error-prone.
More generally, if some work is to be accomplished on a file in more than
one place, then it is possible that neither supercedes the other. Time
stamp based reconciliation thus will possibly result in over-writing
relevant information. As a result, user's work embodied in the older
version may be lost without any warning. It is also possible that this
will only happen when one forgets to hook up the computers for the
reconciliation between versions of the file.
What is important is to know when a file has been edited in two places,
what has been done, whether or not to authorize a merge of the two
versions, and on what basis. It is therefore important to devise a system
by which a merge is done in a safe way. It is also important to provide a
system in which conflicts are recognized, with the conflict not
necessarily being resolved automatically, but rather at the option of an
individual operator who has been alerted to the fact of a conflict.
Note that one prior art way of determining a conflict is the so-called
"journaling" technique which is to keep a record of what has transpired at
one central location. Using a single centralized computer, a forward log
or journal type of reconciliation may be accomplished.
SUMMARY OF THE INVENTION
However, rather than keeping a centralized journal, it is a feature of the
present invention that each computer or system keep its own journal. The
journal, which is a history of file versions, indicates the file which is
edited and its date/time stamp. Optionally the journal may also keep a
detail of the type of editing that was involved should a conflict be
determined.
For reconciliation, if the files are the same and the journals agree, there
is no conflict.
On the other hand, when one works on one computer but not the other, and
the resulting files are subsequently to be merged together, the Subject
System first compares the two journals to see if one has more journal
entries in one than the other. Note the comparison may be facilitated by
in a merge operation. Once having determined that there are differences in
some of the journal entries, then the system automatically copies those
files for which the journal indicates no conflict, and alerts the user so
that actions can be taken to resolve any conflict found.
Different versions of the same file are thus reconciled by each computer
maintaining its own journal and by the comparison of the two journals at
times specified by the user, with the reconciling system automatically
updating file revisions when appropriate, or providing the user with an
indication that such automatic updating is inappropriate.
Specifically, the Subject System can be configured to either delete a file
which has been determined to be the non-desired file, or to copy the most
recent file, in a replacing operation, into the computer which does not
have the most recent file. At some point the journals of both of the
computers will be in synchronization. Thereafter if no journals change,
there need be no indication made to the user that a conflict exists. If
one of the journals changes at only one site, then it is possible to
simply instruct the machine at the other site on command to do the same
actions. However, if both journals are changed, it is very important to
alert the user that a conflict cannot be resolved.
Note, in the Subject Invention, not only does reconciling include the
concept of copying or deleting, one can increase the level of detail of
the individual entries in the logs that are filed to alert the user that a
simple merge/purge performed on a time stamp basis will not work. For
example, if the user is warned, the user may run a program called DIFF
which highlights the differences between the two files. At that point, the
user may decide which of the two files he prefers or which changes should
be made in what file.
Thus, in the Subject Invention, in a distributed file system, instead of
giving the user the impression that there is only one set of files, the
system provides user with the impression that there are different versions
of a file which must be occasionally reconciled, although only at the
convenience of the user. The Subject System solves the problem of multiple
versions of the same file by reconciling on demand. Each computer has a
local version of the same data, reconciled by comparing journals of local
changes, with user intervention being called into play if conflicting
changes are discovered.
Applications for the subject reconciling system include file cataloging and
reconciliation, office applications and database management systems.
Hardware can involve organizers, palm tops, pen based tablets and
notebooks.
Further applications of the Subject System include merging records within
files. Moreover, it is possible to batch update by exchanging journals.
It is an important feature of the subject invention that the reconciliation
may be invoked under user or application control, either at the beginning
or end of a working session or overnight, for instance.
In summary, in a distributed file environment, a system for safely updating
a file without risk of losing work performed at one site due to work
performed on the file at another site includes maintaining a journal or
log at each site which is updated after a file is modified. This journal
is compared with the logs from other sites before a file is used at any
one site, so that new versions can be propogated automatically and safely
to out-of-date sites, with the user immediately alerted if conflicting
versions of the file exist at different sites. Different versions of the
same file are thus reconciled by each computer maintaining its own journal
and by the comparison of the two journals at times specified by the user,
with the reconciling system automatically updating file revisions when
appropriate, or providing the user with an indication that such automatic
updating is inappropriate. The reconciliation can be applied to
collections of files, automatically updating only those files for which it
is safe to and necessary do so. Since reconciliation occurs at times
selected by the user, inconsistent or partially completed versions of
files need not be propogated to other sites. Additionally, logs may be
built incrementally by occasionally observing the state of the systems in
terms of the files and their time stamps and creating additional log
entries reflecting appearance, disappearance and changes of files.
Furthermore, logs may be purged of obsolete entries by including
additional log entries indicating the most recent time each site has
participated in a reconciliation and deleting obsolete entries that all
sites have seen.
BRIEF DESCRIPTION OF THE DRAWINGS
These and other features of the Subject Invention will be better understood
taken in conjunction with the detailed description in conjunction with the
drawings of which:
FIG. 1 is a diagrammatic representation of the transfer of versions of a
file from an office computer to a remote location, for instance, in the
home;
FIG. 2 is a diagrammatic representation of the problem of generating two
different versions of the same file through the modification of the file
at two different locations;
FIG. 3 is a diagrammatic representation of a method for reconciling
versions of a file including the creation of logs at various sites and the
comparison of the logs prior to permitting either automatic updating or
manual updating when a comparison between the logs indicates a
discrepancy;
FIG. 4 is a block diagram illustrating the generation of journals at two
sites, forming a combined journal, detecting a conflict, and providing
actions based on conflict resolution; and
FIG. 5 is a series of diagrams indicating journals and situations where no
update is needed, an update is needed, a delete is needed, or a conflict
is to be indicated.
DETAILED DESCRIPTION
Referring now to FIG. 1, in a typical operational setting, Version #1 of a
file may be created at a fixed terminal 10 at an office 12 in which the
version created at terminal 10 is stored in a storage device 14. The
information in this file may be transferred as via a floppy diskette 16 to
a computer 18 at a home site 19, with the file being modifiable so as to
produce a Version #2 which is stored in a storage device 20.
Referring now to FIG. 2, it will be seen how it is possible to modify files
at two different sites or locations such that work performed in the file
may not necessarily be on the most recently updated file. As can be seen,
Work 1 in storage device 22 contains the Version #1 which is copied onto a
diskette 24 that is then copied into storage device 26 at a remote site or
location. The information in storage device 26 may be modified so as to
produce a Version #2 at 26 which is then copied, for instance, onto a
diskette 28 that is copied into storage device 22 at the first site as
Version #2. This version may be further modified and placed in storage as
Version #3. Thereafter this version may be copied onto diskette 30 which
is intended to be downloaded to storage device 26. However in the process
either the disk is lost or not downloaded, at which point the modification
as illustrated at V.sub.x is copied onto a diskette 32. It will be
appreciated that there is now a problem in that Version #3 is different
than Version V.sub.x, which was created by modifying Versions V.sub.2 as
opposed to Version V.sub.3. This creates an error which is difficult to
rectify and may be unnoticed.
The problem, of course, is that these are two versions of the same file.
The first version will be Version Three and the second version will be
Version X. Merely updating one of the computers with one version or the
other will not solve the problem of reconciliation, because Version X does
not have the updates of Version Three. Thus it is impossible to
automatically update either of the versions at either of the different
locations; and it is for this reason that time stamp based systems for
reconciliation fail.
Thus, in terms of a typical scenario, considering the case of writing a
book using personal computers at office and home, carrying files back and
forth on a diskette, the normal procedure is to copy all the working files
from the diskette to the computer about to be used, edit one or more
chapters, and copy the edited files back to the diskette when done.
As a result one has three different copies of the files, one stored in the
office computer, one at home, and one on the diskette. Even though there
are really three copies, one thinks of them as being different versions of
the same files.
If one forgets to copy the files edited at the office, one can then go home
with an out-of-date diskette. The diskette carried home is then copied to
the home computer and editing continues, not noticing that one is starting
with stale information. The next day the updated files are copied back to
the office computer, losing the previous work.
There are some things one can do to help protect against this common error.
For example, some file copying programs have an option to check dates and
refuse to replace a newer version of a file with an older one. This helps
considerably, but is not perfect. It does not detect the error described
above, for example, since the versions of the files edited at home in the
evening do have a later date than the versions edited yesterday at work.
It also fails to handle the case of deleting obsolete files.
Referring now to FIG. 3, the subject file reconciling system solves the
above problems by embedding a program called RECONCILE in a system which
detects conflicting updates, so one can use it to update files safely. The
system will replace a file with a later version only if it is sure that
the later version was derived from the one being replaced. If the file to
be replaced is not an earlier version, the system will report an error so
that one can resolve the conflict.
More specifically, assuming that a Version #1 of a file is stored at a
storage device 40, the Subject System creates a log, Log.sub.1 here
illustrated at 42. When Version V.sub.1 is copied onto a diskette 44,
Log.sub.1 also appears on the diskette. This diskette may then be loaded
into a storage device 46 at a remote location, where the Version #1
storage may be modified to produce Version #2, which is again stored at
storage device 46. Concommitant with the modification of Version #1 to
Version #2, a further log is created, Log.sub.2 as illustrated at 48. When
this file is to be transferred to the work location, it is downloaded to a
diskette 50 which contains not only Version V.sub.2, it also contains
Log.sub.1 +Log.sub.2, as illustrated. This is downloaded back to storage
device 40 at the original work location which may be modified as
illustrated by Version V.sub.3 again stored at storage device 40. Upon
modification of the V.sub.2 version, the system creates an additional log,
Log.sub.3 as illustrated at 52.
Again, when this version of the file is to be transferred to the remote
location, it is downloaded to a diskette 54 which then contains not only
Version V.sub.3, but also Log.sub.1 +Log.sub.2 +Log.sub.3. This diskette,
however, in the example given is not downloaded to storage device 46.
Rather, as inadvertently sometimes happens, V.sub.2 is modified to produce
Version V.sub.x. At the same time that V.sub.x is formed, a log,
Log.sub.x, is created as illustrated at 56. Version V.sub.x may ultimately
be transferred to a diskette 58. This diskette will have Version V.sub.x
downloaded to it plus Log.sub.1 +Log.sub.2 +Log.sub.x. If diskette 58 is
then to be loaded back into the storage device 40, upon accessing of this
file a unit 60 compares the logs previously generated at the work site,
with the logs associated with diskette 58 which has been loaded at the
worksite. The result of the comparison step is either to alert the
operator at 62 to a difference in the logs for this file which will not
permit automatic updating, or permit automatic updating as indicated by
merge 64.
This being the case, a system is provided through the comparison of logs to
either permit automatic updating or to alert the user that automatic
updating is inappropriate.
In the scenario of FIG. 3, neither the office nor the evening versions of
the files were derived from the other, so RECONCILE will prevent the
system from overwriting them. Note the versions were both derived from the
same earlier version, but not from each other.
The system knows when one version of a file was derived from another by
keeping a history of past versions of files. If one history indicates that
a file has gone through Versions #1, 2 and 3 while the other has only
Version #1 and 2, it is safe to copy Version #3. But if one history shows
Versions #1, 2 and 3 while the other shows Versions #1, 2 and 4, there is
a conflict since neither Version #3 nor Version #4 was derived from the
other.
More specifically, FIG. 4 shows how two journals are reconciled. Starting
with the two separate journals, for Sites X and Y, here illustrated at 66
and 68 respectively, each journal or log contains entries describing the
history of five files, named A, B, C, D and E. In addition to the file
name, the journal entries indicate the action which was taken, either
Create, Update or Delete, and the time which that action was taken, at
that particular site. For example, the journal of Site X shows that file E
was created at 10:55 and deleted at 10:56. Note, only times are shown for
convenience, since the log typically indicates both time and date.
Note that the journals are ordered by file name: A, B, C, D and E; and by
timestamp for the same file name. Journals are combined by merging them
according to this rule: Identical entries (including the action taken) are
combined into a single entry during the merge. The combined journal as
illustrated at 70 also records which sites had each entry. This could be
X, Y, or both X and Y. For example, the combined journal shows that E was
created at 10:55 known to both X and Y, and deleted at 10:56, known only
to Site X.
The goal of reconciliation is to bring the individual journals up to date
by performing missing actions. Thus, in FIG. 4, the missing creation of
file B at Site Y can be fixed by copying file B from Site X to Site Y. The
missing update of C at X can similarly be fixed by copying C from Site Y.
In the case of file E, the missing action is a deletion, which can be
corrected by deleting the copy of file E at Site Y. As these
reconciliation actions are taken, the missing journal entries are filled
in and the individual journals updated.
There is a conflict in the case of file D. Both sites agree that the file
was originally created at 10:33, but they show independent updates
occurring at different times, and neither site knows about the other
site's update. The automatic reconciliation procedure reports this
conflict rather than replacing either version, leaving it up to the user
to perform whatever correction or merge of the individual files is
necessary.
By way of further description, the following definitions are useful in
understanding the Subject Invention.
For purposes of this invention, a file is a body of closely related
information stored in a computer. Typical examples of files would be
documents edited with a word processor, or spreadsheets, or messages. Each
individual memorandum, letter, or book chapter is kept in its own file. In
addition to its contents, a file has a name and a timestamp. The name
identifies the file in general and the timestamp indicates when the file
was created or changed. As time passes, a file with the same name will
have different versions, which can be distinguished by their different
timestamps.
A directory is defined as a collection of files. Usually the files in a
directory have some loose relationship, for example that they are all part
of some larger body of information like the chapters in a book; or that
they were created by the same person, relate to the same topic, or are
owned by the same organization. Directories also have names, and may also
have timestamps, although directory timestamps are not very useful.
Most computer systems allow files and directories to be arranged in a
hierarchy or tree, which means that directories can contain
subdirectories. An advantage of subdirectories is that more closely
related files can be grouped together. To find a file one works one's way
into the successive subdirectories until the desired file is reached.
A working session is defined as a period of work on a single computer.
During the course of a working session, files may be in an incomplete or
inconsistent state. One ordinarily doesn't want to make a permanent record
of these files or to send copies elsewhere. Usually one tries to finish a
day's work by cleaning up the inconsistencies before ending the working
session, although occasionally a session may last several days. Note, a
session can be anything one chooses. It is, however, important to note
that one doesn't use RECONCILE to copy files during a working session, but
only at the beginning and/or end.
A site is a specific storage location for a directory hierarchy. As
understood herein, several sites are considered as all containing versions
of the same directory hierarchy. These versions may be the same or
different. The basic purpose of the Subject System is to combine
hierarchies at different sites, making them all the same by safely
updating versions of individual files.
One should not think of a site as being the total disk storage on any one
computer. Usually a site would contain a number of unrelated hierarchies
defined according the user's convenience. A personal computer, for
example, might contain separate hierarchies for system software, installed
applications, and one or more individuals' working files. While actual
systems often glue these into a single super-hierarchy, it is easier to
think of them as being separate.
A site may also be nothing more than a diskette. In fact, the way one
copies files to and from the diskette in the above scenario is to
reconcile the diskette version with the computer at home or office that
one is copying to or from. At the beginning of a working session, the
Subject System will detect newer files on the diskette and copy them to
the computer. At the end of the session, the Subject System will detect
newer files on the computer and copy them to the diskette.
By way of definition, a journal is a history of file versions. To do its
work, the Subject System creates a journal for each site, merges them to
look for missing versions, and either updates by copying more recent
non-conflicting versions, or else reports errors if there are conflicts.
As with database journals, the journals used by the Subject System contain
not only names and timestamps but also actions. For this system these are
very simple: either "update" or "delete", which can be inferred from the
fact that a previously-present file has disappeared.
Including deletion operations in journals means that reconcile can safely
propagate deletions to other sites, again checking for conflicts.
There are actually two kinds of journals: internal and external. An
internal journal is stored as a special file within the directory it
describes. In each hierarchy, each directory has its own internal journal.
An external journal contains the same information, but has been extracted
into a separate file, and stored somewhere else. Although this system can
use both kinds, it can only update to or from internal journals. External
journals may be used as sources of information about necessary updates,
but the actual files and directories involved are not directly accessible.
One implementation of the Subject System is described in the version of
RECONCILE attached hereto as Appendix A.
The simplest and standard way to use RECONCILE is to apply it to several
directly accessible sites such as mounted disks or diskettes. For example
the command
reconcile . a:.backslash.
would reconcile the current working directory (named "." in most systems)
with the diskette in drive A. The order of the two parameters doesn't
matter. In this scenario, RECONCILE would be run when one begins using
either the office or the home computer, and again at the end. So long as
one never forgets to do this, all updating is automatic. One can even
delete obsolete files without having them "come back" at the other
computer.
Suppose one does happen to forget to reconcile at the beginning or end of a
session, and one then updates some file. The next time one reconciles with
the two conflicting versions of the file, one will obtain the error
message:
reconcile: Conflicting Versions, ./oops and a:oops
At this point the two users should consult their memories of what the
conflicting updates were, or use a tool such as diff to find and display
the differences between the two versions. One now edits one or the other
to merge changes, if necessary. Finally, the resulting good version of the
file is copied to the other site, replacing the bad version there, usually
a copy program which copies the timestamp as well as the contents. This
will leave a record of the conflict in the journals, but since there is a
more recent, non-conflicting version at both sites, RECONCILE will not
indicate any conflict.
In one embodiment, the Subject System builds journals by comparing the
actual directories with the previous versions of its own journals each
time it is run. This means that it makes sense to run RECONCILE even for a
single site:
reconcile .
This updates the internal journal of the current working directory. If one
makes several successive versions of a file, RECONCILE will only see the
last one since the last time it was run. This can actually be an advantage
since the other versions are of no particular significance as long as they
are not transmitted to any other site.
One can choose how often one wants to run RECONCILE. Even if one forgets to
run RECONCILE at the end of a working session, one will not lose anything
permanently. The cost of forgetting a reconciliation will be an increased
probability of conflicting updates, needing manual intervention at a later
time.
Other applications for the Subject System are as follows:
Supposing the joint writing of a research paper with a colleague, one
stores the various sections of the paper in a directory to which each has
access. Ordinarily both users communicate directly to avoid conflicting
updates, but sometimes one of the users forgets. This is handled with
RECONCILE. Each user makes a private copy of the entire directory.
Assuming the directories are named -tom/paper, -dick/paper, and
-public/paper, and that Tom is the user in question, before beginning a
working session, Tom performs the command
reconcile -tom/paper -common/paper
At this point there may be conflicts. If there are, Tom may need to give
Dick a call to resolve them. Having done Tom is sure that his working
version of the paper is in agreement with the shared version. During the
course of the work various sections might be temporarily wrong, or
inconsistent with each other, but since this is just a working copy and
not the public version, Tom is not concerned. Eventually Tom will be happy
with the final version having proofread it, and checks it back in with
exactly the same command as above.
COMMAND SYNTAX
The syntax of the "RECONCILE" command is
reconcile [options] [[-mode] (directory file)] . . .
If no directories or files are provided, the current working directory
(".") is used.
______________________________________
directory
names a directory containing an internal journal
and files
file names an external journal describing some
remote site
______________________________________
("--" refers to an external journal on standard input or output) mode is
one or more of the following letters:
______________________________________
r read the journal but don't write it
w write the journal but don't read it
o do not update files, only the journal
______________________________________
Option parameters are:
__ | | |