A system for performing autonomic monitoring in a computing grid is described. The system includes a plurality of modules, which when implemented into a computing grid, are operable to analyze objects of the grid and identify exception conditions associated with the objects. The system includes a configuration module for receiving information on specified objects to be monitored and exception conditions for the objects, an information collection module to collect job execution data associated with the objects, and an exception module to evaluate the job execution data associated with the objects and identify existing exception conditions. Related methods of performing autonomic monitoring in a grid system are also described.
A self-updating grid mechanism using peer-to-peer platform protocols. A compute node may send another node information about its compute node configuration using peer-to-peer platform protocols. The other node may be a master node configured to manage a grid of one or more compute nodes, another compute node, or some other peer node. In one embodiment, the other node may be a logically nearby node to the compute node. In one embodiment, the compute node may discover the other node using peer-to-peer platform protocols. The other node may determine if the compute node configuration needs to be updated from the compute node configuration information. If the compute node configuration needs to be updated, the other node may send update information to the compute node using peer-to-peer platform protocols. The compute node may then update its compute node configuration according to the update information.
When a new resource is allocated to a particular execution environment within a grid environment managed by a grid management system, then a grid verification service automatically selects and runs at least one functionality test on the new resource as controlled by the grid management system. Responsive to a result of the functionality test, the grid verification system verifies whether the result meets an expected result before enabling routing of the grid job to the new resource, such that the functionality of the new resource is automatically verified before access to the new resource is allowed to maintain quality of service in processing grid jobs.
A grid change controller within a particular grid environment detects an unintended change within that grid environment. In particular, grid change controller monitors potential change indicators received from multiple disparate resource managers across the grid environment, where each resource manage manages a selection of resources within the grid environment. The grid change controller then determines a necessary response to the unintended change within the grid environment and communicates with at least one independent manager within the grid environment to resolve the unintended change, such that the grid environment to maintain performance requirements within the grid environment.
A job is submitted into a first selection of resources in a grid environment from among a hierarchy of discrete sets of resources accessible in the grid environment. Discrete sets of resources may include locally accessible resources, enterprise accessible resources, capacity on demand resources, and grid resources. The performance of the first selection of resources is monitored and compared with a required performance level for the job. If the required performance level is not met, then the discrete sets of resources are queried for available resources to meet the required performance level in an order designated by said hierarchy. Available resources in a next discrete set of resource from the hierarchy of discrete sets of resources are added to a virtual organization of resources handling the job within the grid environment.