New failure detector mechanisms particularly suitable for use in asynchronous distributed computing systems in which processes may crash and recover, and two crash-recovery consensus mechanisms, one requiring stable storage and the other not requiring it. Both consensus mechanisms tolerate link failures and are particularly efficient in the common runs with no failures or failure detector mistakes. Consensus is achieved in such runs within 3.quadrature. time and with 4n messages, where .quadrature. is the maximum message delay and n is the number of processes in the system.
CROSS-REFERENCE TO RELATED APPLICATIONS
This invention is a conversion of U.S. Provisional Application Serial No. 60/130,365, filed Apr. 21, 1999 and U.S. Provisional Application Serial No. 60/130,430, filed Apr. 21, 1999. This patent incorporates by reference the disclosures found in each of the related applications identified above.
The present invention provides a method for detecting a termination of a process within a plurality of processes in a data processing system. A monitoring policy is established, within the plurality of processes, wherein the monitoring policy assigns a first process within the plurality of processes to monitor a second process within the plurality of processes. Responsive to a termination of execution of the second process, a cause of the execution termination is determined by the first process. Responsive to a determination that the second process terminated execution in an abnormal manner, the first process attempts to restart the second process. Furthermore, the present invention provides a method for inserting a process within a plurality of processes containing a first process and a monitoring policy in a data processing system. A request is received from a second process to join the plurality of processes. Responsive to the second process joining the plurality of processes, the first process within the plurality of processes is selected to monitor the second process. The monitoring policy is modified, wherein the monitoring policy assigns the selected first process to monitor the second process for termination of execution.
The monitoring of a worker process by an executive process. A worker process periodically sends a signal to an executive process, such as via a call to a heartbeat interface, which receives the signal and determines whether the worker process is improperly functioning. If the worker process is improperly functioning, the executive process terminates the worker process. The executive process may also examine the worker process for diagnostic purposes before terminating, or returning control to, the worker process.
An embodiment of a method of seeking consensus among computer processes begins with a first step of saving a new timestamp in a timestamp array for a particular process. The method continues with a second step of determining whether a most recent entry in a decision array includes a previously established consensus decision. In a third step, if the most recent entry does not include the previously established consensus decision, the method saves a proposed decision as a consensus decision. Otherwise, in a fourth step, the method saves the previously established consensus decision as the consensus decision. In a fifth step, if a most recent timestamp in the timestamp array continues to be the new timestamp, the method returns the consensus decision. Otherwise, in a sixth step, the method returns an abort indicator.
Epoch numbers are maintained in a pair wise fashion at a plurality of communication endpoints to provide communication consistency and recovery from a range of failure conditions including total or partial node failure and subsequent recovery. Once an epoch state inconsistency is recognized, negotiation procedures provide an effective mechanism to reestablish valid communication links without the need to employ global variables which inherently possess greater transmission and overhead requirements needed to maintain communications. Renegotiation of recognizably valid epoch numbers occurs on a pair wise basis.