A method for managing a fault involves detecting an error, gathering data associated with the error to generate an error event, and categorizing the error event using a hierarchical organization of the error event.
A method and system are provided for automated diagnosis for a system. In one embodiment, the method includes providing a fault tree representation of the system, the fault tree specifying propagations of errors generated in the system by problems to produce error reports. At least some of the propagations have timing information associated therewith. One or more error reports having timing information associated therewith are received and analysed using the fault tree representation to determine a suspect list of problems. The suspect list contains those problems that could have generated errors to produce the received error reports compatible with the propagations in the fault tree, and consistent with the timing information associated with the propagations and the received error reports.
One embodiment of the invention provides apparatus including a data structure representing a fault tree for a system. The data structure comprises a plurality of events linked by propagations. Each event is classified as one of at least three possible event types. A first type of event is a problem event, which represents an underlying cause of misbehavior in the system. A second type of event is an error event, which represents an error in the system comprising an incorrect signal or datum. A third type of event is a report event, representing the formal detection by the system of an error. Each propagation in the fault tree denotes a cause and effect linkage from one event to another event. There are no propagations within the fault tree to a problem event.