WikiPatents - Community Patent Review
Create Free Account  |  License or Sell Your Patent  |  WikiPatents Marketplace  |  WikiPatents Blog
Username:  Password:  
    
Advanced Search
System and method for reducing storage requirement in backup subsystems utilizing segmented compression and differencing    
United States Patent5574906   
Link to this pagehttp://www.wikipatents.com/5574906.html
Inventor(s)Morris; Robert J. T. (Los Gatos, CA)
AbstractIn a client/server environment, a method and means for reducing the storage requirement in the backup subsystem and further reducing the load on the transmission bandwidth where base files are maintained on the server in a segmented compressed format. When a file is modified on the client, the file is transmitted to the server and compared with the segmented compressed base version of the file utilizing a differencing function but without decompressing the entire base file. A delta file which is the difference between the compressed base file and the modified version of the file is created and stored on a storage medium which is part of the backup subsystem. Alternatively, a copy of frequently accessed base files are maintained on the client in a compressed format. Whenever the client detects that a frequently accessed file has been modified, the modified version of the file is differenced against the base version of that file without decompressing the entire base file and a delta file is generated. The delta file is then transmitted to the server to be stored at the server for storage medium to be utilized either immediately or at a later time to update the base version of the modified file on the server.
   














 Title Information Submit all comments and votes
 
Patent Text Patent PDF Print Page Summary File History
Plain text PDF images Print Summary File History
Drawing from US Patent 5574906
System and method for reducing storage requirement in backup subsystems

     utilizing segmented compression and differencing - US Patent 5574906 Drawing
System and method for reducing storage requirement in backup subsystems utilizing segmented compression and differencing
Inventor     Morris; Robert J. T. (Los Gatos, CA)
Owner/Assignee     International Business Machines Corporation (Armonk, NY)
Patent assignment
All assignments
Publication Date     November 12, 1996
Application Number     08/328,204
PAIR File History     Application Data   Transaction History
Image File Wrapper   Patent Term   Fees
Litigation
Filing Date     October 24, 1994
US Classification     707/1 707/203
Int'l Classification     G06F 017/30
Examiner     Black; Thomas G.
Assistant Examiner     Wang; Peter Y.
Attorney/Law Firm     Saber; Paik
Address
Parent Case    
Priority Data    
USPTO Field of Search     395/600 395/200
Patent Tags     reducing storage requirement backup subsystems utilizing segmented compression differencing
   
Enter a comma (,) or semicolon (;) between multiple tag words/phrases.
Describe this patent:
 Amusing   
 Clever   
 Complex   
 Efficient   
 Historic   
 Important   
 Innovative   
 Interesting   
 Practical   
 Simple   
[no votes]
Patent WIKI

Share information and news about this patent, including information and news about the technology, inventors, company, ligation and licensing.

 References Submit all comments and votes
 
*references marked with an asterisk below are user-added references
 U.S. References
 
Add a new US reference:  
ReferenceRelevancyCommentsReferenceRelevancyComments
5347653
Flynn
707/203
Sep,1994

[0 after 0 votes]
5278979
Foster
707/203
Jan,1994

[0 after 0 votes]
5276860
Fortier
714/6
Jan,1994

[0 after 0 votes]
5263154
Eastridge
714/6
Nov,1993

[0 after 0 votes]
5133065
Cheffetz
714/2
Jul,1992

[0 after 0 votes]
5089958
Horton

Feb,1992

[0 after 0 votes]
4912637
Sheedy
707/203
Mar,1990

[0 after 0 votes]
4809170
Leblang
717/122
Feb,1989

[0 after 0 votes]
4686620
Ng
707/10
Aug,1987

[0 after 0 votes]
4646229
Boyle
707/203
Feb,1987

[0 after 0 votes]
 Foreign References
 Other References
 Market Review Submit all comments and votes
   
Market Size
Estimate the gross annual revenues of the relevant market sector:
> $10B
$5B - $10B
$2B - $5B
$500M - $2B
$100M - $500M
$10M - $100M
$1M - $10M
$500K - $1M
$100K - $500K
< $100K
[No votes]
$0
 
$0   $2.5B   $5B   $7.5B   $10B
Market Share
Estimate the percentage of the relevant market sector this invention will capture:
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Reasonable Royalty
What percentage of gross sales should the inventor or assignee be paid?
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Public's "Guesstimation" of Royalty Value
Market SizeN/A[No votes]
xMarket ShareN/A[No votes]
xReasonable RoyaltyN/A[No votes]

N/A

License Availablity
If you are NOT the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
License Availablity
If you ARE the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
Competitive Advantage
Does this invention have a significant competitive advantage over similar technologies?
Yes

No



[No votes]
Most helpful competitive advantage comment
[No comments]

Commercial Alternatives
Are there viable commercial alternatives for this invention?
Yes

No



[No votes]
Most helpful commercial alternative comment
[No comments]

 Technical Review Submit all comments and votes
 Claims Submit all comments and votes
 


We claim:

1. A method for storing information in a clinet-server environment having a client and a backup subsystem, the backup subsystem comprising a backup server, a server backup program, a storage medium, and a copy of a versioned sequence of a file stored in the storage medium, said versioned sequence comprising a base version of a file in multiple compressed segments and a predetermined number of delta files, the client having a client backup program and a storage medium, the client and the backup server are connected to each other by a communication link, comprising the steps of:

establishing a backup session between the client and the backup server;

detecting, using the client backup program, a changed version of said file at the client;

transmitting a changed version of said file in an uncompressed format, using the communication link, from the client to the backup subsystem; and

differencing using the backup subsystem, the changed version of said file and the multiple compressed segments of the base version of said file, said differencing further including the steps of:

compressing the changed version of said file at the backup server one line at a time;

comparing each compressed line of the changed version of said file with the corresponding line of the base version of said file one line at a time;

determining whether a difference between the compressed line of the changed version of said file and the compressed line of the base version of said file has been detected;

decompressing the segment of the base version of said file, known as the current segment, if a difference has been detected;

comparing the uncompressed current segment of the base version of said file and the corresponding segment of the changed version of said file where the difference has been detected;

creating a delta file which includes the differences detected;

determining whether the differencing procedure is back in synch at the end of the current segment of the base version of said file; and

storing the changed version of said file in the compressed segmented format.

2. A method for storing information in a client-server environment having a client and a backup subsystem, the backup subsystem comprising a backup server, a server backup program, a storage medium, and a copy of a versioned sequence of a file stored in the storage medium, said versioned sequence comprising a base version of a file in multiple compressed segments and a predetermined number of delta files, the client and the server are in communication with each other by a communication link, comprising the steps of:

establishing a backup session between the client and the backup subsystem;

detecting a changed version of said file at the client;

differencing, using the client, the changed version of said file and the multiple compressed segments of the base version of said file at the client to create a delta file, said differencing further including the steps of:

compressing the changed version of said file at the client one line at a time;

comparing each compressed line of the changed version of said file with the corresponding line of the base version of said file one line at a time;

determining whether a difference between the compressed line of the changed version of said file and the compressed line of the base version of said file has been detected;

decompressing the segment of the base version of said file, known as the current segment, if a difference has been detected;

comparing the uncompressed current segment of the base version of said file and the corresponding segment of the changed version of said file where the difference has been detected;

creating a delta file which includes the differences detected;

determining whether the differencing procedure is back in synch at the end of the current segment of the base version of said file; and

storing the changed version of said file in the compressed segmented format;

transmitting said delta file, using the communication link, to the backup subsystem: and

storing said delta file in the backup subsystem.

3. In a client-server environment having a client and a backup subsystem, the backup subsystem comprising a backup server and a backup storage medium, and where the client comprises a client storage medium, the client and the backup server connected to each other by a communication link, a method for reducing the storage requirements and transmission loads in said client-server environment, comprising the steps of:

storing a versioned sequence of a file, using the backup storage medium, in the backup subsystem, said versioned sequence comprising a base version of a file in multiple compressed segments and a predetermined number of delta files;

detecting a changed version of said file at the client;

differencing, using the client, the changed version of the said file and the multiple compressed segments of the base version of said file to create a delta file at the client, said differencing further including the steps of:

compressing the changed version of said file at the client one line at a time;

comparing, at the client, each compressed line of the changed version of said file with the corresponding line of the base version of said file one line at a time;

determining whether a difference between the compressed line of the changed version of said file and the compressed line of the base version of said file has been detected;

decompressing the segment of the base version of said file, known as the current segment, if a difference has been detected;

comparing the uncompressed current segment of the base version of said file and the corresponding segment of the changed version of said file where the difference has been detected;

determining whether the differencing procedure is back in synch at the end of the current segment of the base version of said file; and

storing the changed version of said file in multiple compressed segments;

transmitting, using the communication link, said delta file to the server; and

storing said delta file in the backup subsystem.

4. In a client-server environment having a client and a backup subsystem, the backup subsystem comprising a backup server and a storage medium, and where the client comprises a client storage medium, the client and the server connected to each other by a communication link, said client-server environment comprising:

mean for storing a versioned sequence of a file in the backup subsystem, said versioned sequence comprising a base version of a file in multiple compressed segments and a predetermined number of delta files;

means for storing a copy of the base version of said file in the client;

means for detecting a changed version of said file at the client;

means for differencing the changed version of said file and the multiple compressed segments of the base version of said file to create a delta file at the client, said means for differencing further including:

means for compressing the changed version of said file at the client one line at a time;

means for comparing, at the client, each compressed line of the changed version of said file with the corresponding line of the base version of said file one line at a time;

means for determining whether a difference between the compressed line of the changed version of said file and the compressed line of the base version of said file has been detected;

means for decompressing the segment of the base version of said file, known as the current segment, if a difference has been detected;

means for comparing the uncompressed current segment of the base version of said file and the corresponding segment of the changed version of said file where the difference has been detected;

means for determining whether the differencing procedure is back in synch at the end of the current segment of the base version of said file; and

means for storing the changed version of said file in multiple compressed segments;

means for transmitting the delta file to the backup subsystem; and

means for storing said delta file in the backup subsystem so the base version of said file can be updated at a predetermined time.
 Description Submit all comments and votes
 


CROSS-REFERENCE TO RELATED APPLICATION

U.S. patent application Ser. No. 08/328,633, now pending, entitled SYSTEM AND METHOD FOR REDUCING STORAGE REQUIREMENT IN.sub.-- BACKUP SUBSYSTEMS UTILIZING DIFFERENCING, was filed on the same day, owned by a common assignee and having the same inventor as the present invention.

BACKGROUND OF THE INVENTION

1. Technical Field

This invention relates in general to improvements in the field of computer systems having backup/restore or archive/retrieve subsystems. More particularly, this invention relates to a method and system for reducing the storage requirements of backup subsystems in client-server environments.

2. Description of the Background Art

In a data processing system, a backup/restore subsystem, usually referred to as backup subsystem, is typically used as a means to save a recent copy or version of a file, plus some number of earlier versions of the same file, on some form of backup storage devices such as magnetic disk drives, tapes, or optical storage devices. The backup subsystem is used as a means of protecting against loss of data in a given data processing system. For example, if an on-line version of a file is destroyed or corrupted because of power failure, hardware, or software error, user error or some other type of problem, the latest version of that file which is stored in a backup subsystem can be restored and therefore the risk of loss of data is minimized. Another important use of backup subsystems is that even if failures do not occur, but files or data are deleted or changed (either accidentally or intentionally), those files or data could be restored to their earlier state thus minimizing the loss of data.

Therefore, it can readily be apparent that backup subsystems are and will remain an important part of the field of data processing.

A closely related concept to the backup subsystem is a method and system called archive/retrieve, usually referred to as an archive subsystem. Archiving refers to making copies of files on lower cost storage such as tape so that files can be deleted from more expensive technology such as disk storage. Since disk storage is frequently being updated, an archival copy also allows the state of a collection of data to be captured for later reference, even if the primary copy of the data is not going to be deleted. An example would be the archiving of a set of financial data at the end of a fiscal period. Although the improved method of carrying out the backup disclosed in this application is primarily described for a backup system, it will be obvious to the person of ordinary skill in the art of data processing that the systems and methods described herein are also applicable to archive systems or other related storage management systems.

At the present time, the majority of backup systems run on host systems located in a data processing environment. Typically, a new version (also referred to as changed version) of a file is backed-up based on a predetermined schedule such as, at the end of each day, or after each time that a file has been updated and saved.

Backup systems generally consume large amount of storage media because multiple versions of large amounts of data are being backed up on a regular basis. Therefore, those engaged in the field of data processing and especially in the field of backup/restore systems are continuously striving to find improved methods and systems to reduce the storage demand in backup systems. Current backup systems typically utilize one or both of the following methods to enable the storage of and retrieval of multiple versions of a given file. These are: (1) the full backup method and (2) the incremental backup method.

The full backup method is the most basic method used which requires the backup of an entire collection of files, or a file system, regardless of whether individual files in that collection have been updated or not. Furthermore, in the full backup method multiple full versions of each file are maintained on a storage device. Since maintaining multiple full copies of many files consumes substantial amount of storage, some type of compression technique is sometimes used to reduce the amount of data stored. Compression techniques basically rely on the presence of redundancy within the file, so called intra-file redundancy, in order to achieve this reduction. The most common method is the use of a method of file compression known as Lempel-Ziv method (also known as Adaptive Dictionary Encoder or LZ coding) described in a book by T. C. Bell et. al., titled Text Compression, pp 206-235. The essence of Lempel-Ziv coding is that phrases are replaced with a pointer to where they have occurred earlier in the text, thereby saving the storage space associated with multiple occurrence of any given phrase. This is a general method which can be applied to any file and typically results in compression ratios of the order of between 2 and 3.

Incremental backup method is an alternative to full backup used in backup systems where only those files, in any given collection of files, are backed up which have been changed since the previous incremental or full backup.

It is apparent to those skilled in the art that in any given backup system, the higher the backup frequency, the more accurately the backup copy will represent the present state of data within a file. Considering the large volume of data maintained and continuously generated in a typical data processing system, the amount of storage, time, and other resources associated with backing up data are very substantial. Thus, those skilled in the art are continuously engaged in searching for better alternatives and more storage and time efficient systems and methods for backing up data.

Aside from the compression technique which is heavily utilized to reduce storage requirement in a backup system, there exists a quite different method of achieving reduction in file size, known as delta versioning. Delta versioning has never been used in any backup system.

Delta versioning which is also referred to as "differencing" or "deltaing" relies on comparison between two files where multiple version of a file is saved in a form of a "base" file, also called a "base version" of a file, together with predetermined number of small files which represent only the changes to the base file. The small files, also referred to as "delta" files or "difference" files, contain the difference or delta from the base file. Delta files are generated as a result of comparing the base file with a later (newly arrived) or an earlier version of the base file. Thus this method of storage reduction exploits redundancy between files, or "inter-file" redundancy, in order to achieve reduction in storage requirement. This method which is used in the software art of Source Code Control Systems, discussed in a reference below, can provide substantial storage saving in backup systems, since frequently the selection of a file for incremental backup occurs after a small change has been made to that file. Therefore, since many copies are frequently made in backup systems to files that differ only slightly from one another, the differencing method offers great potential for substantial reduction in the amount of data stored in backup subsystems.

At the present time none of the backup systems that use compression techniques utilize delta versioning. Moreover, no one has ever invented a method and system allowing the use of compression and delta versioning together in the same backup system.

Delta versioning falls into two general classes: one is where the base file is the oldest version of a file and the delta files represent newer versions. This method is referred to as "forward" deltas. The other is where the base file is the latest version of a file and the delta files represent older versions. This method is referred to as "reverse" deltas. The "reverse" delta is the more common method because usually the most utilized version of a file is the last version created.

A technical paper by M. J. Rochkind, titled "The Source Code Control System", IEEE Transaction on Software Engineering, Vol. SE-1, No. 4, December 1975, PP 364-370, teaches a software tool, known as source code control system (SCCS) which is designed to help managing changes to a source code (source program) in the field of software development tools. In SCCS environment, every time a module (file) is changed the change is stored as a discrete delta where the space required to store a delta is only slightly greater than the amount of text inserted by that delta. However, Rochkind does not teach or suggest the use of delta files in a backup and archiving subsystem in either a central or a client-server environment as a means for reducing the storage requirements of such subsystems.

U.S. Pat. No. 4,912,637 issued on Mar. 27, 1990 to C. R. Sheedy et al., teaches a system for preserving, generating, and merging various versions of the same file by a modified delta method. Sheedy teaches using an indexed line file where every line active in any version of a given file is stored, together with a variant history file where the history of the status of each line in various versions is recorded. Using these two files, any desired version of a program may be generated directly without the need for creating any of the intermediate versions. However, Sheedy does not teach or suggest the use of this modified method in backup and archiving systems in either a central processing or a client/server environment as a means for reducing the storage requirement of a backup system.

U.S. Pat. No. 5,263,154, issued on Nov. 16, 1993 to L. E. Eastridge et al., teaches a method and system for incremental backup copying of a file in a data processing system which minimizes the suspension of the data processing system during such backup copying. This is done by first physically backing up a data set on a storage subsystem on a scheduled or opportunistic basis. Thereafter, creating side-files of the data set modified. The side-files are then used in the next scheduled or opportunity to update the backed-up data set. However, Eastridge does not teach or suggest the use of delta files as a means of minimizing storage requirement in a backup and archiving subsystem in either a central processing environment or a client-server environment.

U.S. Pat. No. 5,278,979 issued on Jan. 11, 1994 to R. D. Foster, et al., teaches a method and system in the field of software code development known as Single Entity Versioning where by creating and maintaining a unique version identification and a control data file, multiple versions of source data is efficiently stored in a single entity. However, Foster does not teach or suggest the use of delta versioning or Single Entity Versioning as a means of minimizing storage requirement in a backup and archiving subsystem in either a central processing environment or a client-server environment.

Considering that the amount of the data generated on the daily basis by the computers is growing at a very fast rate, there is a need for an improved and innovative method and system to reduce the storage requirements of backup systems in central data processing systems and further in client-server environments which as will be discussed below present unique backup issues.

Backup Subsystems in a Client-Server Environment

Recently, the emergence of low cost local area networking, personal computer, and workstation technology has promoted a new type of data processing architecture known as the "client-server" system or environment. A client-server system 10, as shown in FIG. 1, typically consists of: (1) client computers (also referred to as clients) 11 such as personal computers or workstations with their own local storage medium 12 such as disk storage devices; (2) a local area network (also referred to as LAN or network) 13 such as an Ethernet or a Token Ring which links the clients to the LAN server(s); and (3) one or more LAN server computers 14 such as a personal computer or perhaps a workstation with its own local storage mediums 15 such as disk storage devices, tape storage and/or optical storage devices.

In a client-server environment, the majority of the data processing is usually carried out at the clients which are connected by a local area or other network to a LAN server. The LAN server usually contains various programs or data which are commonly used by many of the clients. Computer users which usually use clients to carry on their data processing tasks, are generally in control of the client computers whereas the LAN server(s) is usually administered by an expert administrator of a data processing (computing) center.

The client-server environment presents a number of major issues as relates to data processing, integrity, and backup of such data. One major concern in the client-server environment is that a substantial amount of important data is located on client subsystems which lack the security, reliability or care of administration that is typically applied to the server machine(s). There is a further concern that data may accidentally be lost from a client computers, because the users of such computers do not take time and necessary care to back up the data on a regular basis. There is yet another concern that the amount of data residing on the clients are so substantial that even if a client-server backup subsystem could be developed to attempt to backup all these data, the amount of backup storage required to save all the data on the clients would be inordinate and impractical. The lack of an efficient backup system and method has been a major barrier to the adoption and rapid growth of client-server technology despite its many attractive features.

Recently a number of client-server backup systems have been developed to alleviate some of the concerns listed above. An example is an IBM's ADSM (ADSTAR Distributed Storage Manager) product. This technology overcomes some of the deficiencies mentioned above by making backup copies of the client data on a backup server. The client copies are made automatically without user involvement and are stored on storage devices which are administered by the backup server.

A typical client-server backup subsystem such as ADSM operates as follows. In the client computer a program exists, known as the client backup program, which at pre-specified or periodic times is activated and makes contact with a program residing on the backup server, known as the server backup program. After establishing contact and establishing authentication, the server backup program then consults "policy data" which instructs the server backup program as to what sort of a backup operation should occur and which files on the client computer are the subjects of the current backup. It then searches all or a subset of files on the client computer, determining which files should be backed up. For example, a data file which has changed since the backup program was last run may cause that file to be selected for the backup operation. After selecting the files to be backed up, the client backup program transmits those files, using the LAN, to the server backup program. The server backup program then makes an entry in a "backup catalog" for each file received and then stores those files on storage devices attached to the backup server.

The server backup program also carries out several other important operations. One such operation is the maintenance of its storage pools. For example, backup copies of files that were made many months ago may be moved from disk storage to tape storage in order to reduce storage costs. Another important function of the client and server backup programs occurs when the user requests the restoration of a file. The client backup program contacts the server backup program which consults its backup catalog to establish the location of the backup copy of the file. It then returns that file across the network to the client computer which in turn makes it available to the user.

Hardware which is typically needed for implementing a backup system in a client-server system includes: one or more server computers such as PC or workstations and storage mediums such as IBM 3390 magnetic storage system, IBM 3494 tape storage library or IBM 3595 optical library. These libraries which provide automated mechanical mounting and demounting of tape or optical cartridges into read/write drives and retrieve them from or replace them within the storage shelves are sometimes referred to as "jukeboxes".

Despite the recent improvements made in the field of client-server backup systems, several shortcomings have remained in all client-server backup systems including ADSM. One of the shortcomings, as mentioned earlier, is that the very large number of files on the clients now being regularly backed up tend to generate very large amounts of data resulting in large storage requirements and therefore substantially more cost in backing up data. Although systems such as ADSM compress this data on the storage devices, the amount of data remains very large. A second difficulty that is being observed is that the local area network technology is frequently unable to complete transmission of all of the changed files, even in only an incremental backup, to the backup server during the designated period for backup operations (e.g., a night shift). This is due to the bandwidth limitation of the communication network (which might include low speed remote telephony data links) and large amount of data that has to be transmitted from numerous clients to the backup server.

It is apparent now that implementation of an efficient backup subsystem in a computer processing environment is a formidable task and in a client-server environment poses significant challenges of its own. Therefore, there is a need for a new and novel backup method and system in a client-server environment that not only substantially reduces the storage requirement of backup subsystem but also minimizes the burden on the communication link between the clients and the backup server. The present invention addresses these two major deficiencies currently present in all client-server backup subsystem by providing alternative methods and systems which can be used to reduce the amount of data storage required in a client-server backup subsystem and reduces the burden on the bandwidth of the transmission network.

SUMMARY OF THE INVENTION

It is therefore an object of the present invention to provide improved backup and archiving methods and subsystems in a data processing environment.

It is a further object of the present invention to provide improved backup and archiving methods and subsystems in a client-server environment.

It is a further object of the present invention to reduce network transmission cost in a client-server environment.

It is another object of the present invention to reduce data processing loads on a backup server in a client-server environment.

It is another object of the present invention to reduce the transmission bandwidth requirement in a client-server environment.

It is another object of this invention to provide a method and system for combining compression and differencing so that both intra-file and inter-file compression can be exploited together in a backup subsystem.

It is yet a further object of the present invention to provide a method and system for combining the difference method with compression methods in a client-server subsystem, thereby reducing the network transmission cost.

It is another object of the present invention to provide a method for combining the difference method with compression methods in a client-server subsystem, thereby reducing the storage requirement in the backup subsystem of a client-server system.

It is another object of the present invention to provide a method for combining the difference method with segmented compression methods in a client-server subsystem, thereby reducing the storage requirement in the backup subsystem of a client-server system.

The foregoing objects are achieved by the invention disclosed herein. Briefly stated, in one embodiment of the invention, a file, called the "new" file (also referred to as new version of the base file or changed version of the base file), is recognized to have been changed at the client, and is then transmitted to the backup server (referred to as server). At the server the new file is differenced against the previous latest version of the file, called the "previous" file (also referred to as "base" file) which is in the compressed form to produce a file referred to as new delta. That is, the differencing is carried out on the compressed version of the base file and the changed version of the base file without decompressing the entire base file. This new delta can simply be stored along with the "new" file or the "base" file depending on whether reverse or forward delta versioning is used, respectively. Based on controlling policy management, if a fixed limit on the number of delta files is enforced, then the oldest delta is deleted.

In another embodiment of the present invention, in a client-server environment, the differencing is carried out on the compressed version of the base file and the changed version of the base file without decompressing the entire base file by the client rather than the server. However, in order for the client to carry out the differencing operation, it needs to keep a copy of the base file at the client. When the base file is modified on the client, then a new delta file is created by the client and transmitted to the server. The server may simply save the new delta and then use it at a later time to modify the compressed base file in the backup subsystem, or may apply it immediately against the base file to create a copy of the new file.

Alternatively, instead of storing a copy of the compressed base file at the client for the purpose of differencing, the compressed base file may simply be transmitted to a client whenever the client needs to modify the base file. Once the compressed base file is modified on the client, a new delta file is created and transmitted back to the server to be used and stored by backup subsystem. Note that by carrying out the differencing at the client rather than the server, the size of the file that has to be sent back to the server is substantially reduced, thereby substantially reducing the burden on the transmission network.

By using differencing and segmented compression substantial reduction in storage requirement, network transmission bandwidth, and CPU efforts expended on compression/decompression of the base file can be achieved. The use of differencing method and compression can be carried out together either at clients or server or at both the clients and the server.

In the past, the use of compression and differencing together on the same file was considered not possible since small change to a file could result in large changes to the compressed version of the file. This is because a change of one byte in a file can cause arbitrary changes in the remainder of the compressed file. Thus, if two files differ in the middle by only one character, the second half of compressed version of each file could turn out to be totally dissimilar. Furthermore, such a difference could result in an arbitrary change in the length of compressed file. Because of this problem, compression and differencing has never been used together in any backup or storage subsystem. However, this problem, according to the invention presented here, is solved by compressing the base file into multiple segments.

According to this embodiment of the present invention, the base file is compressed into multiple segments. A new file which is detected is also compressed, and as is being compressed, the base file and the newly compressed file are then compared without decompressing. As soon as a difference between the two files is detected, the appropriate segment of the base file is decompressed and compared with the corresponding uncompressed portion of the new file so a delta file can be generated. When the comparison between the appropriate segments are completed, if the differencing method is "back in synch" (i.e., in the state of not detecting any more differences), then the comparison of the compressed version of the base file and the new file is continued. If the differencing method is not back in synch at the end of the segments under comparison, comparison of the uncompressed segments of data in the base file and the new file is continued possibly to the end of the files.

The innovative embodiment stated in the previous paragraph is applicable in computing the deltas when differencing is carried out either at the server or client or both. Furthermore, this embodiment teaches a novel method and system to use both segmented compression and differencing together in a backup and archiving system. This is a significant breakthrough in the art of backup systems since current backup systems such as ADSM only make use of file compression but have not been able to implement a method that uses compression and differencing together.

It should be noted that whereas compression typically results in a space saving factor of two or three, the space saving when utilizing differencing can be much larger. For example, if n versions of a file are saved and they contain only small differences (e.g., a few lines are changed or appended in each file), then the space saving factor can approach n.

BRIEF DESCRIPTION OF THE DRAWINGS

For a fuller understanding of the nature and advantages of the present invention, as well as the preferred mode of use, reference should be made to the following detailed description read in conjunction with the accompanying drawings.

FIG. 1 is a schematic diagram of a typical client-server environment;

FIG. 2 is a schematic diagram of a client-server environment having a backup subsystem;

FIG. 3A is a flow chart illustrating the backup operation at the client of the preferred embodiment of the present invention;

FIG. 3B is a flow chart illustrating the backup operation at the server of the preferred embodiment of the present invention;

FIG. 4 is a flow chart illustrating implementation of diff(,,) operation;

BEST MODE FOR CARRYING OUT THE INVENTION

The following description is the best mode presently contemplated for carrying out the invention. This description and the number of alternative embodiments shown are made for the purpose of illustrating the general principle of the present invention and is not meant to limit the inventive concepts claimed herein.

With reference now to FIG. 2, there is shown a client-server system 20 having a backup subsystem. System 20 typically includes a plurality of client computers 21 with their own local storage medium 22 such as disk storage devices. The client computers (clients) 21 may typically be personal computers of the type having a system unit (not shown) which includes CPU (processor), I/O control, and semiconductor and magnetic memories and DOS, OS/2, or Apple Macintosh operating systems. The client computers 21 may further be workstations of the type having AIX, UNIX or equivalent operating systems. These operating systems are well known to those skilled in the art of data processing and need no further description.

Still referring to FIG. 2, the client-server system 20 further includes a local area network (LAN) 23 such as Ethernet or a Token Ring which provides the communication link between the clients 21 to the backup server(s) 25.

Backup server computer 25 may typically be a personal computer of the type having a system unit (not shown) which includes CPU (processor), I/O control, and semiconductor and magnetic memories and DOS, OS/2 or Apple Macintosh operating system. It may also be a workstation having a system unit and UNIX or AIX or equivalent operating system. It may also be a large system running the AS/400, VM or MVS operating systems. Computer 25 further has its own local storage mediums 26 such as disk storage devices 27, optical library (storage) devices 28, or tape library (storage) devices 29. In a client-server system 20 shown in FIG. 2, backup system usually resides at the backup server 25. A typical backup system that resides at the server is IBM Advanced Distributed Storage Manager (ADSM) which has been explained in detail in the background section of this application. The operation and physical implementation of personal computers, workstations, disk storage devices, optical library, tape library and their constituents are well known to those skilled in the art of data processing and requires no further description.

General Notation

We now state a general notation for describing "base file" and "delta file" storage which will be applicable to all the embodiments described herein.

We will assume that a versioned sequence of files which are stored can be represented in the general form:

[d.sub.-- 1, d.sub.-- 2, . . . , d.sub.13 n, F, d