A data processing system comprising an instruction control unit for controlling system operation, a storage control unit for storing data used for the system operation, and a machine check interruption portion for carrying out a machine check interruption when an error occurs during the system operation. The instruction control unit has a system control register group for storing system control data. The storage control unit has a copy register group for storing data copied from the system control data. The instruction control unit further comprises a register save state portion for copying the contents stored in the system control register group into the copy register group during the machine check interruption when an error occurs in the copy register group during the system operation. In this manner, a machine stop due to an error only in the copy register group, is prevented.
A method and apparatus for recovery from a fault occurring within a computing system using a hardware recovery module comprising a microprocessor dedicated for recovery control and a memory for storing system states. A recovery counter counts machine instructions executed since a previously recorded initial checkpoint. Each time the CPU transfers information directly from an I/O controller or the cache memory the recovery module stores the data being transferred. Each time an interrupt is made to the CPU, the recovery module is notified of the interrupt, and it thereupon stores the count of machine instructions executed since the previously recorded initial checkpoint and information identifying the interrupt. When a fault is detected, the system is restored to the system state existing at the beginning of the checkpoint, and the processor synthetically executes the machine instructions originally executed after the initial checkpoint in a sequence substantially similar to the original sequence. During synthetic execution, the recovery module simulates the original inputs, suppresses outputs, and records completion of pre-fault I/O requests. Synthetic execution is abandoned when the instruction point at which the fault was detected is reached, true execution resumes, and the recovery module thereafter simulates the completion of pre-fault I/O requests.
A before-image buffer controller is arranged separately from a memory controller and is connected to a system bus. When there is a write access request from a CPU to a cache memory corresponding to this CPU, the before-image buffer controller is automatically started in response to a command issued from this cache memory onto the system bus, and issues a command for reading previous data from a main memory. Since the before-image buffer controller operable independently of the memory controller is arranged in this way, a memory state restore function can be easily realized by using an existing computer system as it is without changing a memory controller.
Recovery circuits react to errors in a processor core by waiting for an error-free completion of any pending store-conditional instruction or a cache-inhibited load before ceasing to checkpoint or backup progress of a processor core. Recovery circuits remove the processor core from the logical configuration of the symmetric multiprocessor system, potentially reducing propagation of errors to other parts of the system. The processor core is reset and the checkpointed values may be restored to registers of the processor core. The core processor is allowed not just to resume execution just prior to the instructions that failed to execute correctly the first time, but is allowed to operate in a reduced execution mode for a preprogrammed number of groups. If the preprogrammed number of instruction groups execute without error, the processor core is allowed to resume normal execution.