Analytic tests are used to detect chaotic (power-tail) behavior in one or more computer system resources in a distributed computing environment. The analytic tests are used to determine if data (indicative of one or more parameters related to computer system resources) exhibit large deviations from a mean, a high variance and other properties consistent with large values in the tail portion of a power-tail distribution. The tests can be performed in any order, and fewer than three can be performed. If all three tests indicate the existence of power-tail behavior, chaotic behavior of the data is likely. If all three tests indicate the lack of power-tail behavior, chaotic behavior of the data is unlikely. If the results are mixed, then more data or analysis may be needed. The results may be used for modeling and/or altering the configuration of the distributed computing environment.
Managing performance metrics includes accessing a metric catalog comprising a number of metrics, where each metric is associated with a threshold value. A selection of a subset of metrics of the number of metrics is received, and a service is defined using the subset of metrics. Metric values describing performance of the service are determined, where each metric value corresponds to a threshold value associated with a metric of the subset of metrics. The metric values and the corresponding threshold values are compared, and the performance of the service is evaluated in accordance with the comparison.
The present invention facilitates identifying applications based on communicated packets between applications. Characteristics of communicated packets are used to identify the packet as being part of a communication between applications. Identification can be accomplished through the use of packet fingerprints or through a K nearest neighbor algorithm.
A system for performing autonomic monitoring in a computing grid is described. The system includes a plurality of modules, which when implemented into a computing grid, are operable to analyze objects of the grid and identify exception conditions associated with the objects. The system includes a configuration module for receiving information on specified objects to be monitored and exception conditions for the objects, an information collection module to collect job execution data associated with the objects, and an exception module to evaluate the job execution data associated with the objects and identify existing exception conditions. Related methods of performing autonomic monitoring in a grid system are also described.
A method of analyzing a resource leak on a first web server uses a second web server. A first HTTP request message is received from a client at a first Web server and includes an identifier of an information component stored on the first Web server which generates a reply to the first HTTP request message including the information component, and sends the reply to the client. Multiple duplicate HTTP request messages for the information component are generated at the first Web server in response to the receiving of the first request, for analyzing a resource leak on the first Web server. Each of the duplicate HTTP request messages includes the identifier of the information component. The duplicate HTTP request messages are transmitted to a second Web server and multiply any existing. resource leak, thereby facilitating detection, diagnosis and/or analysis. The transmitting to the second Web server maintains the first Web server free from receiving the multiple duplicate HTTP request messages.
A system for identifying chronic performance problems on data networks includes network monitoring devices that provide measurements of performance metrics such as latency, jitter, and throughput, and a processing system for analyzing measurement data to identify the onset of chronic performance problems. Network behavior is analyzed to develop a baseline performance level (e.g., computing a mean and variance of a baseline sample of performance metric measurements). An operating performance level is measured during operation of the network (e.g., computing a mean and variance of an operating sample of performance metric measurements). A chronic performance problem is identified if the operating performance level exceeds a performance threshold and a difference between the baseline performance level and the operating performance level is determined to be statistically significant. A t-test based on the baseline mean and variance and the operating mean and variance can be used to determine if this difference is significant.