Semi-automated data center hotspot diagnosis
Suzanne McIntosh, Jeff Kephart, et al.
CNSM 2011
Top-k reports are compound metrics that provide useful information when diagnosing problems in a system, e.g., to identify persistent CPU usage by a process. In large systems, these reports are collected at regular intervals and need to be resampled to reduce the bitrate of the data, to answer user queries for different sampling periods, or to save space and make it possible to keep historical data for long term performance analysis. However, resampling top-k reports, i.e., aggregating several reports collected for small time intervals into a single top-k report can introduce inaccuracies. For example, a process that consistently uses CPU over the aggregation interval but is not included in the short term top-k reports will be missing from the aggregated report. In this paper, we present an algorithm that collects top-k reports at regular intervals and can aggregate them with little or no error. This is accomplished by including residual resource consumption of unreported, but potentially significant entities in the top-k reports, and using these residual values during aggregation. We show different approaches to including residual resource consumption in individual top-k reports, analyze the error introduced, the parameters with which the algorithm's efficiency can be tuned, and demonstrate the effectiveness of the algorithm in a real-world scenario.
Suzanne McIntosh, Jeff Kephart, et al.
CNSM 2011
Cédric Favre, Thomas Gschwind, et al.
BPMDemos 2009
Thomas Gschwind, Metin Feridun
IM 2017
Metin Feridun, Axel Tanner
NOMS 2010