Identifying missed monitoring alerts based on unstructured incident tickets
Abstract
Automatic system monitoring is an efficient and reliable mean for problem detection in enterprise IT infrastructures. The performance of monitoring systems depends on their configurations specified by the system administrators. In dynamic and large IT environments, the IT infrastructures are frequently changed to meet various business requirements, so the configurations may not be always consistent with the updated status. Misconfigurations can lead to false positive (false alarms) and false negative (missing alerts) for the system administrators. The false negatives can cause serious system faults. This paper presents an automatic approach for discovering the false negatives from incident tickets that are created by humans. The discovered results help the system administrators correct the misconfigurations and minimize the false negatives in future. This approach applies a text classification model for analyzing the descriptions of incident tickets and identifying the corresponding system issues. The domain knowledge for describing those issues can be incorporated to assist with this model. Experiments are conducted on real system incident tickets from a large enterprise IT infrastructure. The experimental results demonstrate the effectiveness of the proposed approach. © 2013 IEEE.