WikiPatents - Community Patent Review
Create Free Account  |  License or Sell Your Patent  |  WikiPatents Marketplace  |  WikiPatents Blog
Username:  Password:  
    
Advanced Search
Distributed file access structure lock    
United States Patent5175852   
Link to this pagehttp://www.wikipatents.com/5175852.html
Inventor(s)Johnson; Donavon W. (Georgetown, TX); Shaheen-Gouda; Amal A. (Austin, TX); Smith; Todd A. (Austin, TX)
AbstractA distributed file management system (DFS) with a plurality of nodes and a plurality of files is disclosed. The DFS uses the UNIX operating system tree structure employing inodes (data structures containing the administrative information of each file) to manage the local files and surrogate inodes (s.sub.-- inode) to manage access to files existing on another node. In addition, the DFS uses a file access structure lock (fas.sub.-- lock) to manage multiple requests to a single file. The primary reason for the addition of the fas.sub.-- lock for each file is to avoid the problem of deadlocks. The inodes and s.sub.-- inodes use the fas.sub.-- lock to synchronize their accesses to a file and avoid a deadlock situation where both s.sub.-- inode and inode await the use of a file that is locked by the other.
   














 Title Information Submit all comments and votes
 
Patent Text Patent PDF Print Page Summary File History
Plain text PDF images Print Summary File History
Drawing from US Patent 5175852
Distributed file access structure lock - US Patent 5175852 Drawing
Distributed file access structure lock
Inventor     Johnson; Donavon W. (Georgetown, TX); Shaheen-Gouda; Amal A. (Austin, TX); Smith; Todd A. (Austin, TX)
Owner/Assignee     International Business Machines Corporation (Armonk, NY)
Patent assignment
All assignments
Publication Date     December 29, 1992
Application Number     07/418,750
PAIR File History     Application Data   Transaction History
Image File Wrapper   Patent Term   Fees
Litigation
Filing Date     October 4, 1989
US Classification    
Int'l Classification    
Examiner     Clark; David L.
Assistant Examiner     Von Buhr; Maria N.
Attorney/Law Firm     Whitham & Marhoefer
Address
Parent Case     This is a continuation of application Ser. No. 014,900 filed Feb. 13, 1987, now abandoned.
Priority Data    
USPTO Field of Search    
Patent Tags     distributed file access lock
   
Enter a comma (,) or semicolon (;) between multiple tag words/phrases.
Describe this patent:
 Amusing   
 Clever   
 Complex   
 Efficient   
 Historic   
 Important   
 Innovative   
 Interesting   
 Practical   
 Simple   
[no votes]
Patent WIKI

Share information and news about this patent, including information and news about the technology, inventors, company, ligation and licensing.

 References Submit all comments and votes
 
*references marked with an asterisk below are user-added references
 U.S. References
 
Add a new US reference:  
ReferenceRelevancyCommentsReferenceRelevancyComments
4819159
Shipley
714/19
Apr,1989

[0 after 0 votes]
4527237
Frieder
709/253
Jul,1985

[0 after 0 votes]
4414624
Summer, Jr.
712/21
Nov,1983

[0 after 0 votes]
4224664
Trinchieri
714/25
Sep,1980

[0 after 0 votes]
4104718
Poublan
707/8
Aug,1978

[0 after 0 votes]
 Foreign References
 Other References
 Market Review Submit all comments and votes
   
Market Size
Estimate the gross annual revenues of the relevant market sector:
> $10B
$5B - $10B
$2B - $5B
$500M - $2B
$100M - $500M
$10M - $100M
$1M - $10M
$500K - $1M
$100K - $500K
< $100K
[No votes]
$0
 
$0   $2.5B   $5B   $7.5B   $10B
Market Share
Estimate the percentage of the relevant market sector this invention will capture:
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Reasonable Royalty
What percentage of gross sales should the inventor or assignee be paid?
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Public's "Guesstimation" of Royalty Value
Market SizeN/A[No votes]
xMarket ShareN/A[No votes]
xReasonable RoyaltyN/A[No votes]

N/A

License Availablity
If you are NOT the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
License Availablity
If you ARE the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
Competitive Advantage
Does this invention have a significant competitive advantage over similar technologies?
Yes

No



[No votes]
Most helpful competitive advantage comment
[No comments]

Commercial Alternatives
Are there viable commercial alternatives for this invention?
Yes

No



[No votes]
Most helpful commercial alternative comment
[No comments]

 Technical Review Submit all comments and votes
 Claims Submit all comments and votes
 


Having thus described our invention, what we claim as new and desire to secure by Letters Patent is set forth in the following claims:

1. A distributed system having data in a file residing at a server data processing system, said file data being accessible by a plurality of processes in at least one client data processing system and said server data processing system, said server data processing system and said at least one client data processing system connected by a communication means, said distributed system comprising:

a first lock means, in said server data processing system, for serializing access to the data in the file by processes at the servor data processing system;

a second lock means, in said client data processing system, for serializing access to data in a cache in the client data processing system corresponding to said data in said file by processes at the client data processing system;

a third lock means in said server data processing system for serializing access to a file access structure list containing descriptions of locks granted by said first lock means at said server data processing system; and

means for using said third lock means, to lock said file access structure list in said server data processing system, instead of using said first lock means to lock said data in said file, by an operation which is capable of causing a remote procedure call to be generated between the server data processing system and the client data processing system, said remote procedure call requiring access to the data in the cache at the client data processing system and to the data in the file at the server data processing system, thereby avoiding a lock simultaneously existing on both said first lock means and said second lock means during a remote procedure call.

2. The distributed system of claim 1 wherein said third lock means serializes use of said first lock means when one of said plurality of processes closes said file.

3. The distributed system of claim 1 wherein said third lock means serializes use of said first lock means when one of said plurality of processes opens said file.

4. The distributed system of claim 1 wherein said third lock means serializes use of said first lock means when a list of processes having current access to said file is interrogated.

5. The distributed system of claim 1 wherein said third lock means serializes use of said first lock means during a write to said file when said file is open for write access in more than one of a plurality of data processing systems.

6. The distributed system of claim 1 wherein said first lock means for serializing access to said data in said file at said server data processing system is locked by a process during an operation to said file in said server data processing system.

7. The distributed system of claim 1 wherein said first lock means for serializing access to said data in said file at said server data processing system is locked by a process during a read operation to said data in said file at said server data processing system.

8. The distributed system of claim 1 wherein said first lock means for serializing access to said data in said file at said server data processing system is locked by a process during a write operation to said data in said file at said server data processing system.

9. The distributed system of claim 1 wherein said first lock means for serializing access to said data in said file at said server data processing system is unlocked by a process before said remote procedure call is sent to said client data processing system from said server data processing system if the remote procedure call requires a lock on said second lock means for serializing access to said data in said cache in said client data processing system.

10. The distributed system of claim 1 wherein said first lock means for serializing access to said data in said file is unlocked by a process at said server data processing system when said server data processing system originates said remote procedure call to the client data processing system, thereby allowing the server data processing system to accept read and write requests after said remote procedure call is sent.

11. The distributed system of claim 1 wherein said second lock means for serializing access to said data in said cache at said client data processing system is locked by a process during an operation at said client data processing system requiring access to said data in said cache in said client data processing system.

12. The distributed system of claim 1 wherein said second lock means for serializing access to said data in said cache at said client data processing system is unlocked by a process if an operation at said client data processing system requires a lock on said third lock means.

13. The distributed system of claim 1 wherein said second lock means for serializing access to said data in said cache at said client data processing system is locked by a process during an operation at said client data processing system requiring access to said data in said cache until said remote procedure call is sent from client data processing system to said server data processing system.

14. The distributed system of claim 1 wherein said second lock means for serializing access to said data in said cache is unlocked at one of a plurality of client data processing systems by a process when said remote procedure call is sent to said server data processing system from said one of said plurality of client data processing systems.

15. A method, in a data processing system, of preventing a deadlock between a first lock that serializes access to data in a file at a server data processing system and a second lock that serializes file access within a client process in a client data processing system, said method comprising:

locking a third lock for serializing access to a list of data corresponding to at least said first lock, representing client data processing systems having current access to said file;

locking said first lock for serializing access to said data in said file at said server data processing system; and

unlocking said first lock at said server data processing system before sending, by said server data processing system, a remote procedure call to said client data processing system to lock said second lock, by an operation executing at said client data processing system requiring access to the file.

16. A method, in a data processing system, of preventing a deadlock between a first lock that serializes access to data in a file at a server data processing system and a second lock that serializes file access in a cache in a client data processing system, said method comprising:

locking said second lock at said client data processing system by an operation executing at said client data processing system and accessing data in said cache corresponding to said file;

unlocking said second lock by an operation at said client data processing system before a remote procedure call request is sent from said client data processing system to said server data processing system;

locking a third lock in said server data processing system, by said remote procedure call request, for serializing access to a list of client data processing systems having current access to said file;

locking said first lock by an operation executing at said server data processing system and requiring access to said file at said server data processing system; and

unlocking said first lock by said server data processing system while maintaining said third lock before sending said remote procedure call to said client data processing system if said remote procedure call requires a lock on said second lock, thereby allowing said server data processing system to accept read and write operations requiring a lock on said first lock after said remote procedure call is sent.

17. A method, in a data processing system, of preventing a deadlock between a first lock that serializes access to data in a file at a server data processing system and a second lock that serializes access to data, corresponding to said file, in a cache in a client data processing system, said method comprising:

locking said first lock by an operation executing at said server data processing system and requiring access to said file at said server data processing system;

locking said second lock by an operation executing at the client data processing system requiring access to said data in said cache;

unlocking said second lock by an operation at said client data processing system if said operation generates a remote procedure call from said client data processing system to said server data processing system; and

a locking a third lock for serializing access to a list of files representing client accesses to said files in said server data processing system by a second operation executing in said server data processing system in response to said remote procedure call received by said server data processing system from said client data processing system, thereby avoiding locking of both said first lock and said second lock during said remote procedure call.

18. A method of preventing deadlocks in a distributed data processing system of the type having at least one server data processing system having at least one file physically residing at the server data processing system and having a first data structure representing said file at said server data processing system, and at least one client data processing system having access to data from said file by a communications link between said server data processing system and said client data processing system, said client data processing system having access to said data from said file from a cache at the client data processing system, said client data processing system further having a second data structure representing said cached data at said client data processing system, said method of preventing deadlocks in a distributed processing system comprising:

locking said first data structure during execution of an operation from a process, at said server data processing system, requiring access to at least a portion of said file at said server data processing system;

locking said second data structure during execution of an operation from a process, at said client data processing system, requiring access to at least a portion of said file in the cache at said client data processing system; and

unlocking said second data structure before controlling access to at least a portion of said file residing at said server data processing system by locking a third data structure, in said server data processing system, instead of locking said first data structure by said at least one client data processing system, thereby maintaining the control of the locking of said first data structure by said server data processing system and eliminating said first data structure as a critical locking resource between said client data processing system and said server data processing system.
 Description Submit all comments and votes
 


CROSS REFERENCE TO RELATED APPLICATIONS

This application is related in subject matter to the following applications filed concurrently herewith and assigned to a common assignee:

Application Ser. No. 07/014,899 filed by A. Chang, G. H. Neuman, A. A. Shaheen-Gouda, and T. A. Smith for A System And Method For Using Cached Data At A Local Node After Re-opening A File At A Remote Node In A Distributed Networking Environment; now U.S. Pat. No. 4,897,781, issued Jan. 30, 1990.

Application Ser. No. 07/014,884 filed by D. W. Johnson, L. W. Henson, A. A. Shaheen-Gouda, and T. A. Smith for Negotiating Communication Conventions Between Nodes in a Network; now abandoned.

Application Ser. No. 07/014,897 filed by D. W. Johnson, G. H. Neuman, C. H. Sauer, A. A. Shaheen-Gouda, and T. A. Smith for A System And Method For Accessing Remote Files In A Distributed Networking Environment; now U.S. Pat. No. 4,887,204, issued Dec. 12, 1989.

Application Ser. No. 07/014,891 filed by L. W. Henson, A. A. Shaheen-Gouda, and T. A. Smith for Distributed File and Record Locking; now abandoned.

Application Ser. No. 07/014,892 filed by D. W. Johnson, L. K. Loucks, C. H. Sauer, and T. A. Smith for Single System Image; Uniquely Defining an Environment For Each User In a Data Processing System; now abandoned.

Application Ser. No. 07/014,888 filed by D. W. Johnson, L. K. Loucks, A. A. Shaheen-Gouda for Interprocess Communication Queue Location Transparency now U.S. Pat. No. 5,133,053, issued Jul. 21, 1992.

Application Ser. No. 07/014,889 filed by D. W. Johnson, A. A. Shaheen-Gouda, and T. A. Smith for Directory Cache Management In a Distributed Data Processing System.

The disclosures of the foregoing co-pending applications are incorporated herein by reference.

DESCRIPTION

Field of the Invention

This invention generally relates to improvements in operating systems for a distributed data processing system and, more particularly, to an operating system for a multi-processor system interconnected by a local area network (LAN) or a wide area network (WAN). IBM's System Network Architecture (SNA) may be used to construct the LAN or WAN. The operating system according to the invention permits the accessing of files by processors in the system, no matter where those files are located in the system. The preferred embodiment of the invention is disclosed in terms of a preferred embodiment which is implemented in a version of the UNIX.sup.1 operating system; however, the invention could be implemented in other and different operating systems.

BACKGROUND OF THE INVENTION

Virtual machine operating systems are known in the prior art which make a single real machine appear to be several machines. These machines can be very similar to the real machine on which they are run or they can be very different. While many virtual machine operating systems have been developed, perhaps the most widely used is VM/370 which runs on the IBM System/370. The VM/370 operating system creates the illusion that each of several users operating from terminals has a complete System/370 with varying amounts of disk and memory capacity.

The physical disk devices are managed by the VM/370 operating system. The physical volumes residing on disk are divided into virtual volumes of various sizes and assigned and accessed by users carrying out a process called mounting. Mounting defines and attaches physical volumes to a VM/370 operating system and defines the virtual characteristics of the volumes such as size, security and ownership.

Moreover, under VM/370 a user can access and use any of the other operating systems running under VM/370 either locally on the same processor or remotely on another processor. A user in Austin can use a function of VM/370 called "passthru" to access another VM/370 or MVS/370 operating system on the same processor or, for example, a processor connected into the same SNA network and located in Paris, France. Once the user has employed this function, the files attached to the other operating system are available for processing by the user.

There are some significant drawbacks to this approach. First, when the user employs the "passthru" function to access another operating system either locally or remotely, the files and operating environment that were previously being used are no longer available until the new session has been terminated. The only way to process files from the other session is to send the files to the other operating system and effectively make duplicate copies on both disks. Second, the user must have a separate "logon" on all the systems that are to be accessed. This provides the security necessary to protect the integrity of the system, but it also creates a tremendous burden on the user. For further background, the reader is referred to the text book by Harvey M. Deitel entitled An Introduction to Operating Systems, published by Addison-Wesley (1984), and in particular to Chapter 22 entitled "VM: A Virtual Machine Operating System". A more in depth discussion may be had by referring to the text book by Harold Lorin and Harvey M. Deitel entitled Operating Systems, published by Addison-Wesley (1981), and in particular to Chapter 16 entitled "Virtual Machines".

The invention to be described hereinafter was implemented in a version of the UNIX operating system but may be used in other operating systems having characteristics similar to the UNIX operating system. The UNIX operating system was developed by Bell Telephone Laboratories, Inc., for use on a Digital Equipment Corporation (DEC) minicomputer but has become a popular operating system for a wide range of minicomputers and, more recently, microcomputers. One reason for this popularity is that the UNIX operating system is written in the C programming language, also developed at Bell Telephone Laboratories, rather than in assembly language so that it is not processor specific. Thus, compilers written for various machines to give them C capability make it possible to transport the UNIX operating system from one machine to another. Therefore, application programs written for the UNIX operating system environment are also portable from one machine to another. For more information on the UNIX operating system, the reader is referred to UNIX.TM. System, User's Manual, System V, published by Western Electric Co., January 1983. A good overview of the UNIX operating system is provided by Brian W. Kernighan and Rob Pike in their book entitled The Unix Programming Environment, published by Prentice-Hall (1984). A more detailed description of the design of the UNIX operating system is to be found in a book by Maurice J. Bach, Design of the Unix Operating System, published by Prentice-Hall (1986).

AT&T Bell Labs has licensed a number of parties to use the UNIX operating system, and there are now several versions available. The most current version from AT&T is version 5.2. Another version known as the Berkeley version of the UNIX operating system was developed by the University of California at Berkeley. Microsoft, the publisher of the popular MS-DOS and PC-DOS operating systems for personal computers, has a version known under their trademark as XENIX. With the announcement of the IBM RT.sup.2 PC (RISC (reduced instruction set computer) Technology Personal Computer) in 1985, IBM Corp. released a new operating system called AIX.sup.3 (Advanced Interactive Executive) which is compatible at the application interface level with AT&T's UNIX operating system, version 5.2, and includes extensions to the UNIX operating system, version 5.2. For more description of the AIX operating system, the reader is referred to AIX Operating System Technical Reference, published by IBM Corp., First Edition (Nov. 1985).

The invention is specifically concerned with distributed data processing systems characterized by a plurality of processors interconnected in a network. As actually implemented, the invention runs on a plurality of IBM RT PCs interconnected by IBM's Systems Network Architecture (SNA), and more specifically SNA LU 6.2 Advanced Program to Program Communication (APPC). SNA uses as its link level Ethernet.sup.4, a local area network (LAN) developed by Xerox Corp., or SDLC (Synchronous Data Link Control). A simplified description of local area networks including the Ethernet local area network may be found in a book by Larry E. Jordan and Bruce Churchill entitled Communications and Networking for the IBM PC, published by Robert J. Brady (a Prentice-Hall company) (1983). A more definitive description of communications systems for computers, particularly of SNA and SDLC, is to be found in a book by R. J. Cypser entitled Communications Architecture for Distributed Systems, published by Addison-Wesley (1978). It will, however, be understood that the invention may be implemented using other and different computers than the IBM RT PC interconnected by other networks than the Ethernet local area network or IBM's SNA.

As mentioned, the invention to be described hereinafter is directed to a distributed data processing system in a communication network. In this environment, each processor at a node in the network potentially may access all the files in the network no matter at which nodes the files may reside. As shown in FIG. 1, a distributed network environment 1 may consist of two or more nodes A, B and C connected through a communication link or network 3. The network 3 can be a local area network (LAN) as mentioned or a wide area network (WAN), the latter comprising a switched or leased teleprocessing (TP) connection to other nodes or to a SNA network of systems. At any of the nodes A, B or C there may be a processing system 10A, 10B or 10C, such as the aforementioned IBM RT PC. Each of these systems 10A, 10B and 10C may be a single user system or a multi-user system with the ability to use the network 3 to access files located at a remote node in the network. For example, the processing system 10A at local node A is able to access the files 5B and 5C at the remote nodes B and C.

The problems encountered in accessing remote nodes can be better understood by first examining how a standalone system accesses files. In a standalone system, such as 10 shown in FIG. 2, a local buffer 12 in the operating system 11 is used to buffer the data transferred between the permanent storage 2, such as a hard file or a disk in a personal computer, and the user address space 14. The local buffer 12 in the operating system 11 is also referred to as a local cache or kernel buffer. For more information on the UNIX operating system kernel, see the aforementioned books by Kernighan et al. and Bach. The local cache can be best understood in terms of a memory resident disk. The data retains the physical characteristics that it had on disk; however, the information now resides in a medium that lends itself to faster data transfer rates very close to the rates achieved in main system memory.

In the standalone system, the kernel buffer 12 is identified by blocks 15 which are designated as device number and logical block number within the device. When a read system call 16 is issued, it is issued with a file descriptor of the file 5, and a byte range within the file 5 as shown in step 101 in FIG. 3. The operating system 11 takes this information and converts it to device number and logical block numbers of the device in step 102. Then the operating system 11 reads the cache 12 according to the device number and logical block numbers in step 103.

Any data read from the disk 2 is kept in the cache block 15 until the cache block 15 is needed. Consequently, any successive read requests from an application program 4 that is running on the processing system 10 for the same data previously read from the disk is accessed from the cache 12 and not the disk 2. Reading from the cache is less time consuming than accessing the disk; therefore, by reading from the cache, performance of the application 4 is improved. Obviously, if the data which is to be accessed is not in the cache, then a disk access must be made, but this requirement occurs infrequently.

Similarly, data written from the application 4 is not saved immediately on the disk 2 but is written to the cache 12. This again saves time, improving the performance of the application 4. Modified data blocks in the cache 12 are saved on the disk 2 periodically under the control of the operating system 11.

Use of a cache in a standalone system that utilizes the AIX operating system, which is the environment in which the invention was implemented, improves the overall performance of the system disk and minimizes access time by eliminating the need for successive read and write disk operations.

In the distributed networking environment shown in FIG. 1, there are two ways the processing system 10C in local node C could read the file 5A from node A. In one way, the processing system 10C could copy the whole file 5A and then read it as if it were a local file 5C residing at node C. Reading the file in this way creates a problem if another processing system 10B at node B, for example, modifies the file 5A after the file 5A has been copied at node C. The processing system 10C would not have access to the latest modifications to the file 5A.

Another way for processing system 10C to access a file 5A at node A is to read one block at a time as the processing system at node C requires it. A problem with this method is that every read has to go across the network communications link 3 to the node A where the file resides. Sending the data for every successive read is time consuming.

Accessing files across a network presents two competing problems as illustrated above. One problem involves the time required to transmit data across the network for successive reads and writes. On the other hand, if the file data is stored in the node to reduce network traffic, the file integrity may be lost. For example, if one of the several nodes is also writing to the file, the other nodes accessing the file may not be accessing the latest updated file that has just been written. As such, the file integrity is lost, and a node may be accessing incorrect and outdated files. Within this document, the term "server" will be used to indicate the processing system where the file is permanently stored, and the term client will be used to mean any other processing system having processes accessing the file. The invention to be described hereinafter is part of an operating system which provides a solution to the problem of managing distributed information.

Other approaches to supporting a distributed data processing system in the UNIX operating system environment are known. For example, Sun Microsystems has released a Network File System (NFS) and Bell Laboratories has developed a Remote File System (RFS). The Sun Microsystems NFS has been described in a series of publications including S. R. Kleiman, "Vnodes: An Architecture for Multiple File System Types in Sun UNIX", Conference Proceedings, USENIX 1986 Summer Technical Conference and Exhibition, pp. 238 to 247; Russel Sandberg et al., "Design and Implementation of the Sun Network Filesystem", Conference Proceedings, Usenix 1985, pp. 119 to 130; Dan Walsh et al., "Overview of the Sun Network File System", pp. 117 to 124; JoMei Chang, "Status Monitor Provides Network Locking Service for NFS"; JoMei Chang, "SunNet", pp. 71 to 75; and Bradley Taylor, "Secure Networking in the Sun Environment", pp. 28. The AT&T RFS has also been described in a series of publications including Andrew P. Rifkin et al., "RFS Architectural Overview", USENIX Conference Proceedings, Atlanta, Ga. (June 1986), pp. 1 to 12; Richard Hamilton et al., "An Administrator's View of Remote File Sharing", pp. 1 to 9; Tom Houghton et al., "File Systems Switch", pp. 1 to 2; and David J. Olander et al., "A Framework for Networking in System V", pp. 1 to 8.

One feature of the distributed services system in which the subject invention is implemented which distinguishes it from the Sun Microsystems NFS, for example, is that Sun's approach was to design what is essentially a stateless machine. More specifically, the server in a distributed system may be designed to be stateless. This means that the server does not store any information about client nodes, including such information as which client nodes have a server file open, whether client processes have a file open in read.sub.-- only or read.sub.-- write modes, or whether a client has locks placed on byte ranges of the file. Such an implementation simplifies the design of the server because the server does not have to deal with error recovery situations which may arise when a client fails or goes off-line without properly informing the server that it is releasing its claim on server resources.

An entirely different approach was taken in the design of the distributed services system in which the present invention is implemented. More specifically, the distributed services system may be characterized as a "statefull implementation". A "statefull" server, such as that described here, does keep information about who is using its files and how the files are being used. This requires that the server have some way to detect the loss of contact with a client so that accumulated state information about that client can be discarded. The cache management strategies described here, however, cannot be implemented unless the server keeps such state information. The management of the cache is affected, as described below, by the number of client nodes which have issued requests to open a server file and the read/write modes of those opens.

SUMMARY OF THE INVENTION

It is therefore a general object of this invention to provide a distributed services system for an operating system which supports a multi- processor data processing system interconnected in a communications network that provides user transparency as to file location in the network and as to performance.

It is another, more specific object of the invention to provide a technique for providing a distributed file management system (DFS) with a file access control structure lock (fas.sub.-- lock) for preventing the problem of deadlocks.

According to the invention, these objects are accomplished by creating a fas.sub.-- lock for each file accessed from a remote system. The fas.sub.-- lock is used to lock instead of locking the file's inode. This makes it possible for the DFS to regulate accesses to files and avoid the problem of a deadlock occurring.

BRIEF DESCRIPTION OF THE DRAWINGS

The foregoing and other objects, aspects and advantages of the invention will be better understood from the following detailed description of the preferred embodiment of the invention with reference to the accompanying drawings, in which:

FIG. 1 is a block diagram showing a typical distributed data processing system in which the subject invention is designed to operate;

FIG. 2 is a block diagram illustrating a typical standalone processor system;

FIG. 3 is a flowchart showing the steps performed by an operating system when a read system call is made by an application running on a processor;

FIG. 4 is a block diagram of the data structure illustrating the scenario for following a path to a file operation at a local node as performed by the operating system which supports the subject invention;

FIGS. 5 and 6 are block diagrams of the data structures illustrating the before and after conditions of the scenario for a mount file operation at a local node as performed by the operating system;

FIG. 7 is a block diagram, similar to FIG. 1, showing a distributed data processing system according to the invention;

FIG. 8 is a block diagram of the data structure for the distributed file system shown in FIG. 7;

FIGS. 9A to 9F are block diagrams of component parts of the data structure shown in FIG. 8;

FIGS. 10, 11 and 12 are block diagrams of the data structures illustrating the scenarios for a mount file operation and following a path to a file at a local and remote node in a distributed system as performed by the operating system;

FIG. 13 is a block diagram showing in more detail a portion of the distributed data processing system shown in FIG. 7;

FIG. 14 is a state diagram illustrating the various synchronization modes employed by the operating system which supports the present invention;

FIG. 15 is a block diagram, similar to FIG. 13, which illustrates the synchronous mode operations;

FIG. 16 is a state diagram, similar to the state diagram of FIG. 14, which shows an example of the synchronization modes of the distributed file system; and

FIG. 17 is a diagram showing the control flow of accesses to a file by two client nodes.

FIG. 18 is a diagram showing a deadlock when two operations are currently executing.

FIG. 19 is a diagram showing the execution steps of an open request from a client node.

DESCRIPTION OF THE PREFERRED EMBODIMENT

The following disclosure describes solutions to problems which are encountered when creating a distributed file system in which the logic that manages a machine's files is altered to allow files that physically reside in several different machines to

appear to be part of the local machine's file system. The implementation described is an extension of the file system of the AIX operating system. Reference should be made to the above-referenced Technical Reference for more information on this operating system. Specific knowledge of the following AIX file system concepts is assumed: tree structured file systems; directories; and file system organization, including inodes.

In a UNIX operating system, an individual disk (or diskette or partition of a disk) contains a file system. The essential aspects of the file system that are relevant to this discussion are listed below:

a) each file on an individual file system is uniquely identified by its inode number;

b) directories are files, and thus a directory can be uniquely identified by its inode number;

c) a directory contains an array of entries of the following form:

name--inode number, where the inode number may be that of an individual file or that of another directory; and

d) by convention, the inode number of the file system's root directory is inode number 2.

Following the path "/dir1/dir2/file" within a device's file system thus involves the following steps:

1. Read the file identified by inode number 2 (the device's root directory).

2. Search the directory for an entry with name=dir1.

3. Read the file identified by the inode number associated with dir1 (this is the next directory in the path).

4. Search the directory for an entry with name=dir2.

5. Read the file identified by the inode number associated with dir2 (this is the next directory in the path).

6. Search the directory for an entry with name=file.

7. The inode number associated with file in this directory is the inode number of the file identified by the path "/dir1/dir2/file".

The file trees which reside on individual file systems are the building blocks from which a node's aggregate file tree is built. A particular device (e.g., hard file partition) is designated as the device which contains a node's root file system. The file tree which resides on another device can be added to the node's file tree by performing a mount operation. The two principal parameters to the mount operation are (1) the name of the device which holds the file to be mounted and (2) the path to the directory upon which the device's file tree is to be mounted. This directory must already be part of the node's file tree; i.e., it must be a directory in the root file system, or it must be a directory in a file system which has already been added (via a mount operation) to the node's file tree.

After the mount has been accomplished, paths which would ordinarily flow through the "mounted over" directory instead flow through the root inode of the mounted file system. A mount operation proceeds as follows:

1. Follow the path to the mount point and get the inode number and device number of the directory which is to be covered by the mounted device.

2. Create a data structure which contains essentially the following:

a) the device name and inode number of the covered directory; and

b) the device name of the mounted device.

The path following in the node's aggregate file tree consists of (a) following the path in a device file tree until encountering an inode which has been mounted over (or, of course, the end of the path); (b) once a mount point is encountered, using the mount data structure to determine which device is next in the path; and (c) begin following the path at inode 2 (the root inode) in the device indicated in the mount structure.

The mount data structures are volatile; they are not recorded on disk. The list of desired mounts must be re-issued each time the machine is powered up as part of the Initial Program Load (IPL). The preceding discussion describes how traditional UNIX operating systems use mounts of entire file systems to create file trees and how paths are followed in such a file tree. Such an implementation is restricted to mounting the entire file system which resides on a device. The invention described herein is based on an enhancement, embodying the concept of a virtual file system, which allows (1) mounting a portion of the file system which resides on a device by allowing the mounting of directories in addition to allowing mounting of devices, (2) mounting either remote or local directories over directories which are already part of the file tree, and (3) mounting of files (remote or local) over files which are already part of the file tree.

In the virtual file system, the operations which are performed on a particular device file system are clearly separated from those operations which deal with constructing and using the node's aggregate file tree. A node's virtual file system allows access to both local and remote files.

The management of local files is a simpler problem than management of remote files. For this reason, the discussion of the virtual file system is broken into two parts. The first part describes only local operations. This part provides a base from which to discuss remote operations. The same data structures and operations are used for both remote and local operations. The discussion on local operations describes those aspects of the data and procedures which are relevant to standalone operations. The discussion on remote operations adds information pertinent to remote operations without, however, reiterating what was discussed in the local operations section.

FIG. 4 shows the relationship that exists among the data structures of the virtual file system. Every mount operation creates a new virtual file system (vfs) data structure. The essential elements in this structure are (a) a pointer to the root vnode (virtual node) of this virtual file system (e.g., the arrow from block 21 to block 23), and (b) a pointer to the vnode which was mounted over when this virtual file system was created (e.g., the arrow from block 25 to block 24).

Whenever an inode needs to be represented in the file system independent portion of the system, it is represented by a vnode. The essential elements in this structure are the following:

a) a pointer to the vfs which contains the vnode (e.g., the arrow from block 22 to block 21);

b) a pointer to the vfs which is mounted over this vnode (e.g., the arrow from block 24 to block 25; but note however that not all vnodes are the mount point for a virtual file system, i.e., a null pointer indicates that this vnode is not a mount point);

c) a pointer to either a surrogate inode or a real inode (e.g., the arrow from block 26 to block 32); and

d) a pointer to a node table entry (this is a non-null only when the file is a remote file).

The AIX operating system, in common with other UNIX operating systems, keeps a memory resident table which contains information about each inode that is being used by the system. For instance, when a file is opened, its inode is read from the disk and a subset of this inode information, together with some additional information, is stored in the inode table. The essential elements of an inode table entry are (a) a pointer to the head of a file access structure list and (b) information from the disk inode, the details of which are not relevant here.

The file access structure records information about which nodes have the file open, and about the modes (read only or read.sub.-- write) of these opens. There is a separate file access structure for each node which has the file open. This state information enables the server to know how each client is using the server file.

The file system supports a set of operations which may be performed on it. A process interacts with a file system by performing a file system operation as follows:

1. The user calls one of the operations providing (perhaps) some input parameters.

2. The file system logic performs the operation, which may alter the internal data state of the file

3. The file system logic returns to the calling user, perhaps returning some return parameters. The operations which can be performed on a file system are referred to as "vn.sub.-- operations" or "vn.sub.-- ops". There are several vn.sub.-- ops, but the ones which are important to this discussion are described below:

VN.sub.-- LOOKUP

In the vn.sub.-- lookup operation, the essential iterative step in following a path in a file system is to locate the name of a path component in a directory file and use the associated inode number to locate the next directory in the chain. The pseudo code for the vn.sub.-- lookup operation is listed below:

______________________________________ function lookup input: directory vnode pointer, name to be looked up in directory output: vnode pointer to named file/dir. convert directory vnode pointer to an inode pointer; use private data pointer of vnode lock directory's inode; if( we don't have search permission in directory ) unlock directory inode; return error; search directory for name; if( found ) create file handle for name; use inode found in directory entry; get pointer to vnode for file handle; unlock directory inode; return pointer to vnode; else -- not found unlock directory inode; return error; ______________________________________

VN.sub.-- OPEN

The function vn.sub.-- open creates a file access structure (or modifies an existing one) to record what open modes (READ/WRITE or READ.sub.-- ONLY) to open a file. The pseudo code for the vn.sub.-- open operation is listed below:

______________________________________ function vn.sub.-- open inputs: vnode pointer for file to be opened open flags (e.g., read-only, read/write) create mode -- file mode bits if creating output: return code indicating success or failure get pointer to file's inode from vnode; lock inode; if( not permitted access ) unlock inode; return( error ); get the file access structure for this client; if there is no file access structure allocate one if( couldn't allocate file access structure ) unlock inode; return( error ); update file access structure read-only, read/write, and text counts; if( truncate mode is set ) truncate file: unlock the inode; ______________________________________

LOOKUPPN

The lookuppn operation is the function which follows paths. Its input is a path (e.g., "/dir1/dir2/file"), and its return is a pointer to the vnode which represents the file. Lookuppn calls vn.sub.-- lookup to read one directory, then it checks to see if the vnode returned by vn.sub.-- lookup has been mounted over. If the vnode is not mounted over, then lookuppn calls vn.sub.-- lookup in the same file system. If the vnode has been mounted over, then lookuppn follows the pointer from the mounted over vnode (e.g., block 24 in FIG. 4) to the vfs of the mounted file system (e.g., block 25 in FIG. 4). From the vfs, it follows the pointer to the root vnode (e.g., block 26 in FIG. 4) and issues a new vn.sub.-- lookup giving as input the vfs's root vnode and the name which constitutes the next element in the path. The pseudo code for the lookuppn function is listed below:

______________________________________ function lookuppn input: pathname output: pointer to vnode for named file if( first character of path is `/` ) current vnode for search is user's root directory vnode; else current vnode for search is user's current directory vnode; repeat if( next component of path is ".." ) while( current vnode is root of a virtual file system ) current vnode becomes the vnode that the virtual file system is mounted over; if( there is not mounted over vnode ) return( error ); -- ".." past root of file system use vn.sub.-- lookup to look up path component in current vnode; if( vn.sub.-- lookup found component ); current vnode becomes the vnode returned by vn.sub.-- lookup; while( current vnode is mounted over ) follow current vnode's pointer to vfs structure that represents the mounted virtual file system; current vnode becomes root vnode of the mounted vfs; else -- vn.sub.-- lookup couldn't file component return( error ); -- search failed until( there are no additional path components ); return( current vnode ); ______________________________________

The operation will be illustrated by describing the scenarios of following a path to a file and mounting a directory. First, in following a path to a file, suppose an application process issues a system call (e.g., open) for file "/u/dept54/status". This request is accomplished by the operating system in the following manner with reference to FIG. 4 (operations which are basically unchanged from the UNIX operating system are not explained here in any detail). The following assumptions are made: First, the vfs represented by block 21 is the root virtual file system. Second, the file "/u" is represented by vnode block 24 and inode block 31. Third, a previous mount operation has mounted a device onto the directory "/u". This mount created the vfs represented by block 25. Fourth, all of the directories and files involved are on the same device. Fifth, the following directory entries exist in the indicated directories:

______________________________________ DIRECTORY INODE NUMBER NAME INODE NUMBER ______________________________________ 2 "u" 15 45 "dept54" 71 71 "status" 12 ______________________________________

The code which implements the system call calls lookuppn to follow the path. Lookuppn starts at the root vnode (block 23) of the root virtual file system (block 21) and calls vn.sub.-- lookup to look up the name "u" in the directory file represented by this vnode. Vn.sub.-- lookup finds in the directory that the name "u" is associated with ino