A process state management scheme capable of acquiring process states consistently even in a case where a new process is generated from some process, while using the synchronous checkpointing method. This scheme prohibits a new process generation during a process state acquisition, which can be realized by judging whether a process generation request by a first process for generating a second process is prior to a process state acquisition request or not, and generating the second process from the first process accordingly. This scheme also prohibits a process state acquisition during a new process generation, which can be realized by acquiring a process state at each of the first and second processes after a notice of the identifier of the second process from the first process is received, when a notice of the identifier of the second process from the second process is prior to the notice from the first process.
Process migration method includes copying first process context indicative of first processing, transmitting process context to second computer, causing first computer to start generation of first execution record, causing second computer to receive process context, determining, from first execution record, whether first processing should be migrated, if it is determined that first processing should postpone being migrated, finishing generation of first execution record, starting generation of second execution record, transmitting first execution record to second computer, reproducing process context, and determining, from second execution record, whether first processing should be migrated, after reproducing of process context is finished in the second computer.
A system and method for enabling execution stop and re-start of a test executive sequence or hierarchy of test executive sequences. Execution progress of a test executive sequence or test executive sequence hierarchy may be periodically stored. This may comprise performing or taking "snapshots" of the execution at various points during the execution. Performing a snapshot may comprise saving all data needed to restore and re-start the execution at the respective point. The criteria of when and where to perform the snapshots may be any of various criteria and may be specified in any of various ways.
A method for synchronizing a program that is executed on one of a plurality of computers in a distributed computer system by using a reliable ordered multicast, comprising the steps of generating a new process comprising a program and the status in execution on a computer, and transferring the new process through the reliable ordered multicast to the computers, respectively.
One embodiment of the present invention provides a system for recovering a process that is multi-threaded from checkpoint information that was previously stored for the process. During a recovery operation, the system first retrieves the checkpoint information for the process. Next, the system extracts an identifier for a program being run by the process as well as parameters of the program from the checkpoint information. The system also extracts thread identifiers for threads associated with the process from the checkpoint information. Next, the system modifies the program so that executing the program will cause threads associated with the process to be restored. The system then creates a replacement process to replace the process, and causes the replacement process to execute the modified program so that the threads are reconstituted within the replacement process.
A method and apparatus for providing process-pair protection to complex applications is provided. The apparatus of the present invention includes a process-pair manager or PPM. The PPM is replicated so that a respective PPM is deployed on each of two computer systems. Each computer system also hosts a watchdog process that monitors and restarts the PPM in case of PPM failures. Each PPM communicates with a respective instance of an application. The application instances may include one or more processes along with associated resources. During normal operation the primary application provides service and periodically checkpoints its state to the backup application. The backup application functions in a standby mode. The two PPMs communicate with each other and exchange messages as state changes occur. The apparatus also includes in each computer system a node watcher that is the PPM of failures of the remote computer system. This way, each monitor the state of the other application instance and the health of the computer system on which it is resident. If a failure of the primary application or of the computer system where it runs is detected, the PPM managing the backup application takes steps to cause its instance of the application to become primary. The failover operation is faster (between 5 and 20 seconds) than corresponding operations provided by other existing methods (between one and 40 minutes depending on the application initialization time) because the backup application does not need to be started and initialized to become primary. The failover is stateful because the backup application receives periodic updates of the state of the primary application.