WikiPatents - Community Patent Review
Create Free Account  |  License or Sell Your Patent  |  WikiPatents Marketplace  |  WikiPatents Blog
Username:  Password:  
    
Advanced Search
Journalling optimization system and method for distributed computations    
United States Patent5371889   
Link to this pagehttp://www.wikipatents.com/5371889.html
Inventor(s)Klein; Johannes (San Francisco, CA)
AbstractA protocol analysis system is provided with data specifying the defined states of processes participating in a distributed computation. State transitions between states are specified as being enabled by (A) receiving a message, (B) unreliably sending a message, or (C) performing an external action such as reliably sending a message. The specification data also identifies process states known to be final states, and all other states are initially denoted as intermediate states. The protocol analysis system determines if any intermediate states can be re-categorized as final states. Then it determines if any state transitions initially identified as unreliable send operations must be treated as derived external actions, and thus made reliable. Thirdly, for each derived external action, the states of the affected application process must be re-evaluated so as to determine if derived final states need to be converted into intermediate states. The resulting determinations as to which states are final states and which messages must be reliable sent are recorded and used to govern execution of the application process. When executing the application process, state transitions entering and leaving intermediate states are normally recorded on stable storage before the state transition is carried out and reliably sent messages are normally recorded on stable storage before being sent. A number of run-time journal optimization techniques reduce the number of state transitions and messages that need to be stored on stable storage.
   














 Title Information Submit all comments and votes
 
Patent Text Patent PDF Print Page Summary File History
Plain text PDF images Print Summary File History
Drawing from US Patent 5371889
Journalling optimization system and method for distributed computations - US Patent 5371889 Drawing
Journalling optimization system and method for distributed computations
Inventor     Klein; Johannes (San Francisco, CA)
Owner/Assignee     Digital Equipment Corporation (Maynard, MA)
Patent assignment
All assignments
Publication Date     December 6, 1994
Application Number     08/051,523
PAIR File History     Application Data   Transaction History
Image File Wrapper   Patent Term   Fees
Litigation
Filing Date     April 22, 1993
US Classification     718/106 718/108 719/313
Int'l Classification     G06F 015/16 G06F 009/40
Examiner     Kriess; Kevin A.
Assistant Examiner     Katbab; A.
Attorney/Law Firm     Cefalo; Albert P. Young; Barry N. ,
Address
Parent Case    
Priority Data    
USPTO Field of Search     395/650 364/DIG. 1 364/285 364/285.1 364/285.2 364/285.3
Patent Tags     journalling optimization distributed computations
   
Enter a comma (,) or semicolon (;) between multiple tag words/phrases.
Describe this patent:
 Amusing   
 Clever   
 Complex   
 Efficient   
 Historic   
 Important   
 Innovative   
 Interesting   
 Practical   
 Simple   
[no votes]
Patent WIKI

Share information and news about this patent, including information and news about the technology, inventors, company, ligation and licensing.

 References Submit all comments and votes
 
*references marked with an asterisk below are user-added references
 U.S. References
 
Add a new US reference:  
ReferenceRelevancyCommentsReferenceRelevancyComments
5214780
Ingoglia
718/106
May,1993

[0 after 0 votes]
5157779
Washburn
714/37
Oct,1992

[0 after 0 votes]
5086386
Islam

Feb,1992

[0 after 0 votes]
 Foreign References
 Other References
 Market Review Submit all comments and votes
   
Market Size
Estimate the gross annual revenues of the relevant market sector:
> $10B
$5B - $10B
$2B - $5B
$500M - $2B
$100M - $500M
$10M - $100M
$1M - $10M
$500K - $1M
$100K - $500K
< $100K
[No votes]
$0
 
$0   $2.5B   $5B   $7.5B   $10B
Market Share
Estimate the percentage of the relevant market sector this invention will capture:
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Reasonable Royalty
What percentage of gross sales should the inventor or assignee be paid?
75% - 100%
50% - 74.99%
25% - 49.99%
10 - 24.99%
5 - 9.99%
2 - 4.99%
1 - 1.99%
< 1%
[No votes]
0.0%
 
0%   25%   50%   75%   100%
Public's "Guesstimation" of Royalty Value
Market SizeN/A[No votes]
xMarket ShareN/A[No votes]
xReasonable RoyaltyN/A[No votes]

N/A

License Availablity
If you are NOT the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
License Availablity
If you ARE the owner or assignee, answer here:
Yes, license is available for purchase

No, license is not currently available



[No votes]
Competitive Advantage
Does this invention have a significant competitive advantage over similar technologies?
Yes

No



[No votes]
Most helpful competitive advantage comment
[No comments]

Commercial Alternatives
Are there viable commercial alternatives for this invention?
Yes

No



[No votes]
Most helpful commercial alternative comment
[No comments]

 Technical Review Submit all comments and votes
 Claims Submit all comments and votes
 


What is claimed is:

1. In a computer system executing application processes that interactively perform a distributed computation, a method performed by said computer system of executing the distributed computation, the method comprising:

identifying, using a first set of journalling criteria, for each application process first states of said application process that require journalling of state information and events causing a transition from one of said first states to another state of said application process, said first set of journalling criteria providing criteria for synchronous journalling wherein a message is sent to indicate a transition into an intermediate state after receiving confirmation that said journalling has been successfully completed.

identifying, while executing each application process, and using a second set of journalling criteria that supersede said first set of journalling criteria, second states of each application process that can be journalled asynchronously and events enabling transitions between ones of said states that can be journalled asynchronously, said second set of journalling criteria utilizing state and event information determined during said executing; and

synchronously journalling, while executing each application process, said first states of said application process which have not been determined as being one of said second states; and asynchronously journalling said second states of said application process.

2. The method of claim 1 further comprising:

identifying while executing each application process a sequence of states using predefined criteria, said sequence comprising a first state and a last state and said predefined criteria specifying when journalling may be avoided for all other states of said sequence; and

journalling while executing said application process only said first and last states of said identified sequence of states.

3. The method of claim 2 wherein said sequence of states includes at least one state transition enabled by reliably sending a message, and the method further includes delaying journalling of said last state until receipt of said reliably sent message has been acknowledged by its recipient.

4. The method of claim 3

wherein said sequence of states includes at least one state transition caused by receiving a reliably sent message from another application process, and the method further includes delaying sending an acknowledgement of receiving said reliably sent message to said other application until said last state has been journalled.

5. In a computer system executing processes that interactively perform a distributed computation, a method performed by said computer system of executing the distributed computation, the method comprising:

identifying using a first set of journalling criteria, for each application process first states of said application process that require journalling of state information and events causing a transition from one of said first states to another state of said application process, said first set of journalling criteria providing criteria for synchronous journalling wherein a message is sent to indicate a transition into an intermediate state after receiving confirmation that said journalling has been successfully completed

identifying, while executing each application, using a second set of journalling criteria that supersede said first set of journalling criteria, a sequence of states starting with a first final state and having only one state transition enabled by a reliable send operation and ending in a second final state, said second set of journalling criteria utilizing state and event information to identify said sequence of states that may be reordered to avoid journalling;

reordering, while executing each application process, said sequence of states to produce a reordered sequence of states such that said reliable send operation is executed first and other state transition events are executed later, and when executing said reordered sequence of states avoiding journalling of states and state transitions which follow said first identified final state in said reordered sequence, and delaying execution of said other state transition events until said one reliable send operation is acknowledged by a recipient.

6. The method of claim 1, wherein each state of said sequence of states of each application process is one of an initial state representing a start state of said application process, an intermediate state representing a state from which said application process is unable to terminate its execution, or a final state representing a state from which said application process is able to terminate its execution.

7. The method of claim 6, wherein said first set of journalling criteria comprises journalling to stable storage intermediate state information, final state information for a first final state entered after a sequence of intermediate states, and a message used for process communication prior to being sent if said message is sent repeatedly until a receiver process has acknowledged receipt of said message.

8. The method of claim 1, wherein each of said events causing a transition comprises one of receiving a message, reliably sending a message until received by another process, or unreliably sending a message which is not guaranteed to be received by another process.

9. The method of claim 1, wherein a first of said application processes shares a common journal with another of said application processes, and said first of said application processes uses asynchronous journalling.

10. The method of claim 2, wherein said sequence of states contains no conditional branches from a state belonging to said sequence.

11. The method of claim 2, wherein said predefined criteria uses runtime information obtained during said executing to determine when journalling may be avoided.

12. The method of claim 5, wherein said second set of journalling criteria uses runtime information obtained during said executing to identify said sequence of states that may be reordered to avoid journalling.
 Description Submit all comments and votes
 


The present invention relates generally to distributed processing computer systems in which a distributed application is performed by multiple processes that coordinate their computations by exchanging messages. More particularly, the present invention relates to a system and method for automatically determining which states of each process in a distributed application must be journalled in order to ensure recovery of the distributed application from system failures at any point in time.

BACKGROUND OF THE INVENTION

Referring to FIG. 1, the present invention concerns interactions and interdependencies of agents 102-1 through 102-N cooperating in a distributed processing computer system 100. Depending on the operating system used, each agent 102 may be a thread or process, and thus is a unit that executes a computation or program. Some of the agents 102-1 through 102-N may be executing on one data processing unit while others are executing at remote sites on other data processing units. More generally, agents can be hosted on different computer systems using different operating systems.

In addition to application processes executed by agents 102, the distributed system 100 also includes "external" devices 104 (i.e., external to the agents 102) with which messages are exchanged, journal processes 106 that record state information on stable, non-volatile, storage 108, and at least one restart manager process 109 that restarts other processes in the system after a failure. Application processes use journal services to record state information on stable storage. To further protect from failures, journal processes 106 often store data on two or more non-volatile storage media to compensate for the unreliability of storage devices such as magnetic disks. Typically, data written to a journal service is recorded on stable storage in the order received, and the data stored on stable storage cannot be modified, making write operations to stable storage irrevocable operations.

In this document the terms "journal," "journal process" and "journalling process" are used interchangeably. All refer to a process for storing information on stable storage to enable consistent recovery and completion of a distributed computation after a failure of the computer system, a part of the computer system, or any process running on that computer system.

The restart manager process 109 is used when a computer system is powered on or reset after a system failure. It uses information stored on stable storage 108 to determine the state in which each application process 102 is to be restarted. A communications path or bus 110 interconnects the various processes 102, 108, 109 and devices 104 in the system 100.

Each agent's application program is, in the context of the present invention, considered to be a finite state machine which progresses through a sequence of internal states. Complex computations are mapped into simpler sets of states suitable for synchronization with other computations.

Application processes execute user-defined programs and synchronize their execution by exchanging messages. In any particular application process, a set of protocols defines the types of messages sent, as well as the applicable constraints thereon--i.e., the circumstances under which each message type is to be sent and/or received. Such constraints define order and coexistence requirements between messages.

Computer processes can fail due to software errors or hardware failures. Failures can cause messages and process state information stored in a computer's volatile memory to be corrupted, lost, or otherwise unusable. However, state information recorded on external devices such as disks, terminals, etc. remain in existence independent of process failures. As a result, state transitions are called external actions if they cause information to be recorded on external devices.

If state information has been recorded on external devices, execution of an agent may have to continue after a process or system failure if the computation being performed by the agent was interrupted and the agent has not already entered a final state. To continue processing and consistently complete protocols in the presence of failures, the "process state" of each agent typically needs to be stored on stable (nonvolatile) storage. To compensate for lost messages and to ensure protocol termination, a message may need to be recorded on stable storage and sent repeatedly until received.

It is a premise of the present invention, as well as a premise of most distributed computer processing systems, that processes have to continue or resume execution even after a failure if resources are left in an intermediate state. Such situations arise when processes are interrupted while performing multiple related external actions such as dispensing money at a teller machine, updating secondary storage, or setting machinery. Premature termination of such processes would potentially leave devices in an intermediate, usually inconsistent state, cause machinery to be blocked, allow money to be withdrawn incorrectly, or cause other kinds of inconsistencies. Extensive studies of these types of scenarios have been made in the area of transaction processing systems and database systems.

To ensure that an interrupted process can continue execution, it is common practice to use a "journalling process" to store state information regarding each intermediate state of each constituent process. The problem addressed by the present invention concerns the high cost of journalling state information for the intermediate states of an application process. In particular, each journalling operation uses scarce system resources, and also slows down the progress of the application process because of the requirement that state information be stored on stable storage before the actions associated with a subsequent state transition are performed.

It has been recognized in the past that many protocols can be modified so as to reduce the associated journalling requirements. For instance, there are a number of variations on the so-called "two phase commit" protocol used in transaction processing, designed to avoid journalling one or more states that would otherwise have been considered to require such journalling. For systems handling millions of transactions, avoiding one journalling step per transaction is a significant savings.

In the past, such adjustments to protocols to avoid journalling have been performed manually on an ad hoc basis. The present invention provides an automated system and method for identifying the states of each agent participating in a distributed computation that must be journalled and the states that do not need to be journalled.

In contrast to other techniques used to ensure correction execution of protocols in the presence of failures, the present invention does not require a process state to be checkpointed on each send operation nor does it require processes to execute a snapshot protocol. Rather, the present invention assumes that the behavior derived for each finite state machine ensures correct execution.

SUMMARY OF THE INVENTION

In summary, the present invention is a system and method for automatically determining recoverable states of processes participating in the execution of a distributed application. The protocol analysis system of the present invention is provided with an initial set of data specifying all the states of the finite state machines used in a particular application process, and data specifying whether the state transition from each state to another state is caused by (A) receiving a message, (B) unreliably sending a message, or (C) performing an external action, which is equivalent to reliably sending a message. For each finite state machine, the initial specification data also indicates which state is the initial state, and which other states are known to be final states. Final states are states from which the finite state machine can immediately terminate its execution. All states not initially denoted as final states are initially denoted as intermediate (i.e., non-final) states.

From the initial specification, the protocol analysis system first determines if any intermediate states can be re-categorized as final states. Secondly, the protocol analysis determines if any state transitions initially identified as unreliable message send operations must be treated as "derived external actions", and thus made reliable. Derived external actions are messages that must be sent reliably in order to ensure that local protocols eventually terminate. Thirdly, after derived external actions have been identified, all newly derived final states have to be checked to ensure that they still satisfy the criteria for a final state. If not, the state is removed from the set of final states and added to the set of intermediate states. The resulting determinations as to which states are final states and which messages must be reliably sent are recorded and used to govern execution of the application process.

In accordance with normal application recovery requirements, when executing an agent (i.e., application process) whose behavior is defined by a finite state machine, intermediate states are recorded on stable storage as well as the first final state entered after a sequence of intermediate states. Messages which must be sent reliably are journalled (i.e., recorded on stable storage) before being sent and are re-sent until it is known that the receiving agent terminated or that the receiving agent recorded receipt of that message on stable storage. Before an agent sends a message to another agent, all associated message journalling and state transition journalling has to be completed, which means that the agent must await acknowledgement from the journalling process that the journalling operation was successful.

Upon recovery from a system failure, an agent must continue execution if the last state of the agent recorded on stable storage is an intermediate state. Otherwise execution of the agent is considered to be terminated. Once an agent has reached a final state and all acknowledgments expected by the agent have been received, then all information journalled for that agent can be discarded (i.e., archived).

A number of these normal application recovery requirements are modified by the journalling optimization procedures of the present invention. These optimization procedures detect situations in which journalling actions that would otherwise be required can be avoided or modified, thereby enabling the affected application processes to run faster and more efficiently. The final/intermediate state re-categorization procedure of the present invention and the journalling optimization procedures of the present invention are defined and performed in a general fashion so as to be applicable to any distributed computation, making it possible to optimize journalling operations for any specified distributed computation.

BRIEF DESCRIPTION OF THE DRAWINGS

Additional objects and features of the invention will be more readily apparent from the following detailed description and appended claims when taken in conjunction with the drawings, in which:

FIG. 1 is a block diagram of a distributed data processing system with a number of interdependent agents.

FIGS. 2A and 2B show state transition diagrams demonstrating how intermediate states can be re-categorized as final states.

FIG. 3 depicts a finite state machine specification table used as the starting point of the protocol analysis performed by the present invention.

FIG. 4 is a block diagram of a data processing system for performing the protocol analysis procedures of the present invention and for executing local protocols associated with one or more agents participating in an application process.

FIG. 5 depicts a state transition diagram of a two-phase commit protocol prior to application of the protocol analysis methodology of the present invention.

FIGS. 6 and 7 depict two modified versions of the state transition diagram of FIG. 5 for "presumed commit" and "presumed abort" versions of the two-phase commit protocol, respectively.

FIG. 8A shows the schema of a version of the finite state machine specification table shown in FIG. 3, modified to indicate when asynchronous journalling should be performed. FIG. 8B is a block diagram of a journal process that is utilized or shared by more than one application process.

FIGS. 8C and 8D are flow charts of a procedure for performing a state transition in a system that utilizes both synchronous and asynchronous journalling of state transitions and reliably sent messages.

FIGS. 9A and 9B depict a sequence of state transitions of an application process before and after application of a journalling optimization procedure of the present invention.

FIG. 10 schematically represents an optimization technique for avoiding journalling of all but the first and last of a sequence of states.

DESCRIPTION OF THE PREFERRED EMBODIMENTS

In the context of the present invention, one "finite state machine" is associated with each process or agent participating in a protocol. Each finite state machine is defined in terms of a set of states and state transitions caused by events (such as sending and receiving messages). The states of the finite state machine always include an initial state and a set of final states, and usually include a set of intermediate states. Execution of a state machine starts in an initial state and can be terminated after reaching any of the final states. The occurrence of a specified event causes a state transition and determines a subsequent state. Each "state" represents the status of one finite state machine (i.e., agent).

The term "external action" is used herein to refer to any action that causes state information to be recorded on an external device. For example, storing data on stable storage is an external action, as is displaying a message on a terminal or setting machinery to a particular state.

States are characterized in the present invention as intermediate or final. Intermediate states (which may or may not include the initial state) require the occurrence of further events prior to termination of the process, while final states do not require the occurrence of any further events. To ensure that further events and state transitions occur, despite any failures that may occur, intermediate states must be recoverable, which means that sufficient status information must be stored on stable storage to restart the associated process at that state.

Events identify send or receive operations. As will be discussed in more detail below, certain send operations must be executed reliably such that at least one message is delivered successfully. Other send operations need not be reliable.

For example, referring to FIG. 2A, consider a process which controls operation of a machine. The four states of the process are labelled S1, S2, S3 and S4. The process moves from an initial state S1 to state S2 after a start message is sent to the machine, thereby initiating execution of the machine. Execution of the machine is stopped after receiving an "off" command from a human operator or another process, which moves the process to state S3. To turn off the machine, the process initiates a "stop" operation, which is an external action. Execution of the protocol for this process must continue until the machine has been stopped and the protocol has reached a final state, S4. Since the machine must be switched off by the process, the stop message must be sent reliably. As a consequence the process must receive the off command and therefore the off command must also be sent reliably.

In FIGS. 2A, 2B, 5-7 and 9-11, intermediate states are identified by nodes comprising empty circles, final states are identified by nodes having two concentric circles, and state transitions are identified by directed lines between nodes. In FIGS. 5-7, and 9-11 state transitions associated with external actions (including reliable send operations) are identified by bold directed lines. Consider an alternate situation in which the above mentioned machine is designed to stop automatically due to a time-out. In that case the stop command does not need to be sent reliably, allowing the process to terminate in state S3 without causing serious problems. Even the off command may be omitted without causing problems, thus allowing the process to terminate in state S2 as well as S4. This also relaxes the requirements on the human operator who may now omit the off command without causing a serious problem. From the viewpoint of someone external to the process running the protocol of FIG. 2B, if the machine is off, it is impossible to determine whether the machine was turned off by a timeout or by a stop message. Furthermore, to reach a consistent final state, it is no longer necessary to be able to recover states S2 and S3. Therefore, fewer states must be recorded on stable storage to correctly execute the protocol in the presence of failures.

The above example shows several aspects of recoverable processes that are important to the present invention. External actions identify state transitions whose occurrence, and sometimes also whose non-occurrence, must be known by some real world process to ensure termination of the associated real world process such as a machine performing a particular process. Starting from these operations, other operations, some only visible inside a computing system, may be identified as "derived" external actions. In the above example, the receipt of the "off" command identifies a derived external action because it forces the receiving process to change its stable state to ensure correct execution in the presence of failures. Derived external actions will be explained in more detail below.

Reliable Send Operations

Protocol specifications for each finite state machine must indicate which state transitions are caused by messages sent and which state transitions are caused by messages received. Send operations and external actions identify messages to be sent by the process executing a local protocol. External actions are considered to be reliable send operations. Conceptually, external actions are implemented by repeatedly sending a message until an acknowledgement has been received. More generally, there is a guarantee that at least one message concerning the external action will be received.

Also, a process that reliably sends at least one message cannot terminate until acknowledgments have been received for all its reliably sent messages. It should be noted, however, that the requirement that an acknowledgement must be received in response to a reliably sent message can sometimes be satisfied in a somewhat backhanded manner. In certain circumstances, the receiving process may terminate without sending an acknowledgement of the received message. In that case, when the sending process attempts to resend the message because it has not been acknowledged, the operating system of the computer on which the terminated process resided will be unable to deliver the resent message and will respond to receipt of the re-sent message with a reply message that indicates that the receiving process has already terminated. A reply that the receiving process has terminated is treated by the sending process as an acknowledgement because it indicates that the message was received by the receiving process (which allowed the process to terminate).

In the preferred embodiment, a message status inquiry facility is provided so that any process awaiting receipt of a reliable message may also inquire regarding its status. A negative response to that inquiry indicates that the message will never be sent. Furthermore, repeatedly sending a message either requires the operation caused by the message to be idempotent (i.e., performed only once), or that it is possible to test or infer whether or not the operation caused by the message has been performed. The receipt of a reliable message causes the stable state of the receiving process to change. An acknowledgment is only returned after the state change has been successfully recorded on stable storage.

It is an assumption of the present invention that any individual message sent to a specified agent may be lost. This may be due to failures or other decisions local to the communication service. However, it is assumed that a message sent sufficiently often is eventually received. In particular, after recording on stable storage the attempt to send a message and re-sending the message until acknowledged, there is a guarantee that the message is eventually received.

In order to ensure that a message is reliably sent, the message must be recorded on stable storage prior to sending the message.

Determination of Final States

If a process must perform external actions after reaching a particular state, then that state is an intermediate state which must be journalled in order to ensure that the local protocol associated with the process will be continued after a failure. Thus, each intermediate state identifies a recoverable state whose occurrence must be recorded on stable storage before a message indicating the transition into that state can be sent.

Conversely, if a state machine has entered State A of a local protocol from which it may enter a final state without performing an external action, then State A can be re-categorized as a "derived final state". The reason for this is as follows. If no further external actions will be performed by the state machine, from the viewpoint of an external observer there is no way of distinguishing whether the state machine terminated in State A or terminated after entering the following final state. Thus, if from State A all enabled state transitions are message receiving operations (called receive operations) terminating in final states, State A can be re-categorized as a final state because receive operations do not reveal to an outside observer the current state of a local protocol.

In addition, if from an intermediate State A a final state can be reached by executing an unreliable send operation (i.e., by sending a message that may not reach its destination), State A is re-categorized as a final state because an outside observer cannot distinguish whether execution of the local protocol was terminated before or after sending the message.

Basic Journalling

When executing the application process, a recovery manager can always determine whether or not a local protocol last entered an intermediate state if the following conditions are satisfied:

(1) Intermediate states entered are recorded on stable storage;

(2) The first final state entered after a sequence of intermediate states is recorded on stable storage;

(3) All reliably sent messages must be recorded on stable storage before being sent; and

(4) Messages indicating state transitions into and out of intermediate states are sent only after the next state entered has been recorded on stable storage and the journalling process has confirmed that the next state has been successfully recorded on stable storage.

The above requirements are the primary criteria for determining what state information concerning an application process needs to be journalled and when the journalling actions need to occur. In accordance with the present invention, some of these primary journalling criteria are superceded by one set of optimization criteria that define when certain information can be journalled with less restrictive ordering constraints than those listed above, and by another set of optimization criteria that determine when journalling of certain state information can avoided altogether.

The fourth requirement listed above is herein called "synchronous journalling" because the sending of messages is synchronized with the completion of the journalling of the associated state transition by the sending process. For example, if an unreliable send message M1 causes a transition into intermediate state C2, the state C2 is recorded on stable storage prior to sending the message M1. Standard, basic journalling of state transitions is accomplished using synchronous journalling.

The above stated "rules" for journalling process states and state transition information in a recoverable distributed computation are modified by the run-time journalling optimizations of the present invention, discussed below. However, those optimizations all depend on the detection of special circumstances that allow these rules to be relaxed, while still ensure that the distributed computation is recoverable regardless when or where a failure may occur. Thus, the above stated rules are still the basis for the protocol analysis and optimization procedures of the present invention.

Derived External Actions

Making intermediate states recoverable ensures that execution of a local protocol (i.e., an application process) is always continued correctly. In order to ensure that the application process will be recoverable in a state consistent with external actions already performed by the application process, certain send operations must be made reliable and certain states of the application process must be journalled. It must also be guaranteed that a protocol eventually terminates after entering an intermediate state. As long as a local protocol can make progress without having to wait for messages to arrive, termination of the local protocol can be ensured by local scheduling algorithms. However, if messages must be received to continue, those messages must be sent reliably. These are recorded on stable storage and sent until acknowledged. Reliable send operations are considered to be external actions.

For these reasons, the initial specification of an application process's states and state transitions may need to be modified in order to ensure that the application process is always continued correctly and to ensure that the application process will eventually terminate. As will be described below, the present invention provides a procedure for modifying the initial state and state transition specification to satisfy these requirements.

In the preferred embodiment, agents or processes waiting for one of several possible messages to arrive may issue an inquiry to find out about the current state of a send operation. If the send operation is a reliable send operation, then the inquiring process will receive as a response either the actual message, or a negative acknowledgment (which means that the message will never be sent). If the send operation that is the subject of the inquiry is an unreliable send operation, the response will be either an acknowledgement (meaning that the message was sent), a negative acknowledgment (meaning that the message has not yet been sent), or an indication that the sending process has already terminated.

When receipt of a reliably sent message triggers an external action, the received message is acknowledged only after the triggered external action has been successfully initiated (i.e., after the reliably sent message associated with the external action has been stored on stable storage).

Based on the above criteria, derived external actions are determined as follows. When all enabled state transitions from an intermediate state are receive operations, at least one of which causes a transition to an intermediate state, then the unreliable send message operations which cause a transition (by the receiving process) into an intermediate state must be converted into reliable send operations. Those "derived" reliable send operations are sometimes herein called derived external actions.

As will be discussed below, deriving external actions may require that states formerly categorized as derived final states be re-categorized as intermediate (recoverable) states. As a result the determination that some send operations need to be reliably send operations will cause additional journalling actions to be performed each time the application process is executed.

For convenience, receive operations that receive reliably sent messages will sometimes herein be called reliable receive operations. Thus, a reliable receive operation is one that corresponds to a reliable send operation of another finite state machine.

Note that a single state transition can have more than one receive operation associated with it, for instance when either of two received messages M1 or M2 will cause a state transition from state S1 to state S2. In such cases, all the receive operations associated with a single state transition are either converted into reliable receive operations or all are not converted.

An alternate version of the above rule concerning the generating of derived external actions is that the receive operation(s) of one state transition can be left as an unreliable receive operation. The reason that the receive operation(s) associated with one state transition can be left as unreliable receive operations is that the receiving process can send inquiries for the reliable receive operations of all the other state transitions, and can thereby determine indirectly if the remaining state transition will be the one that moves the finite state machine to a next state. If all inquiries to the processes associated with the reliable receive operations produce negative acknowledgements (meaning that the associated messages will never be sent), then the remaining state transition is guaranteed to occur.

Since the receive operations associated with one state transition do not need to be converted into a derived external action, there needs to be a basis for choosing which receive operations to convert and which not to convert. At least three different selection mechanisms could be used: (1) a random choice, (2) presenting the options to a human operator or engineer for selection of the receive operations to be converted into derived external operations, or (3) determining which choice will cause the least number of external actions (e.g., input/output operations associated with subsequent intermediate states) and selecting that choice. The third option can be implemented by performing a tree search and counting intermediate states and external actions, including intermediate states and external actions by other processes participating in the application execution process. In most practical instances, the number of states in each state machine is not large and therefore the tree search will not be extensive. The second option allows a human to take into account factors such as knowledge regarding which state transitions occur most often, which may affect the average number of external actions that will be associated with each potential choice of receive operations to be converted into reliable receive operations.

Further, as will be understood by those skilled in the art, it is possible to define various strategies, based on the knowledge of the sending process which normally must send a message reliably about the status of the receiving process, to identify situations where such messages do not need to be sent reliably. One such situation is where it is known that the receiving process cannot yet be in the state where it waits for the message to be reliably sent, and another situation is where it is known that the receiving process will never be in the state where it waits for the message to be reliably sent.

OFF-LINE JOURNALLING OPTIMIZATION AND PROTOCOL ANALYSIS METHOD

The goal of the off-line protocol analysis procedure of the present invention is to generate either a "state transition control table" (to be used in conjunction with a predefined program that utilizes the generated table to control the journalling of state transitions) or a state transition control computer program that will control the journalling of state transitions by a process participating in a distributed computation.

The process of analyzing the protocols associated with the finite state machines used in a particular application process begins with generating (or providing) an initial state machine table 112 (also herein called a state transition control table), an example of which is shown diagrammatically in FIG. 3. The state machine table 112, at a minimum, must include for each finite state machine (FSM) that participates in the application process (A) a list of the states for the FSM, (B) data indicating which states are known from the outset to be final states, (C) data concerning each of the state transitions from each state to another state, including (C1) the next state in the FSM after the state transition, (C2) an indication as to whether the state transition is the result of a receive operation, an unreliable send operation, or a reliable send operation), and (C3) the other FSM or External Device to which each send message is directed or from which an receive message is to be received. The state machine table 112 may also indicate which state is the FSM's initial state (e.g., by listing that state first in the table 112).

For example, the initial specification of the set of finite state machines in the computer system, their states and state transitions can be represented by a database table. The format of each record of that database table can be represented as follows:

FSM ID, State ID, State Type, Next State, Transition Event Type, Action

where each record specifies one transition from a first state (identified by the State ID) of a first finite state machine (identified by the FSM ID) to a Next State. The Transition Event Type is equal to "receive", "reliable send", or unreliable send. In some embodiments, the "receive" Transition Event Type can be divided into two event types: unreliable receive and reliable receive. The Action identifies the nature of the message to be sent or received and the external device or finite state machine or process to which a message is to be sent or from which a message is to be received. The State Type is equal to "initial", "intermediate" or "final". If there are N different possible transitions from a particular state, then the database table will have N records having the same FSM ID, State ID and State Type. Of course, the database used to represent an initial state transition specification could be organized in other ways than the one represented here.

From the initial specification, the protocol analysis system first determines if any intermediate states can be re-categorized as final states. Secondly, the protocol analysis determines if any state transitions initially identified as unreliable message send operations must be treated as "derived external actions", and thus made reliable. In particular, if all the enabled state transitions from a particular intermediate state are caused by receive message operations, at least one of which leads to an intermediate state, then a sufficient set of those receive operations must be made reliable to ensure that the finite state machine's process will eventually terminate. The determination of derived external actions was discussed in more detail above. Thirdly, after all derived external actions have been identified, a search is made for derived final states that must be re-categorized as intermediate states. The resulting determinations as to which states are final states and which messages must be reliably sent are recorded in the state machine table, which is then used to govern execution of the application process.

Table 1 provides a pseudocode representation of the protocol analysis process. Tables 1 and 2 contain pseudocode representations of software routines relevant to the present invention. The pseudocode used in those tables are, essentially, a computer language using universal computer language conventions. While the pseudocode employed here has been invented solely for the purposes of this description, it is designed to be easily understandable by any computer programmer skilled in the art.

The above described protocol analysis is an "off-line" process in that it is typically performed prior to execution of the application process. A second embodiment of the present invention, discussed below, performs similar optimizations (as well as some additional optimizations) during run time based on information obtainable only during run time about the other participants in a distributed transaction or computation.

FIG. 4 shows a data processing system 120 for performing the protocol analysis of the present invention and for executing local protocols associated with one or more agents participating in an application process. Data processing system 120 contains the standard computer system components, including a data processing unit (CPU) 122, system bus 124, primary memory 126 (i.e., high speed, random access memory), mass storage 128 (e.g., magnetic or optical disks), virtual memory manager 130, and user interface 132 (e.g., keyboard, monitor, pointer device, etc.). These physical computer components are not modified by the present invention and are therefore not described in detail herein.

TABLE 1 __________________________________________________________________________ PSEUDOCODE FOR PROTOCOL ANALYSIS ______________________________________________________________