LOGAN: Problem diagnosis in the cloud using log-based reference models
Abstract
Problem diagnosis is one crucial aspect in the cloud operation that is becoming increasingly challenging. On the one hand, the volume of logs generated in today's cloud is overwhelmingly large. On the other hand, cloud architecture becomes more distributed and complex, which makes it more difficult to troubleshoot failures. In order to address these challenges, we have developed a tool, called LOGAN, that enables operators to quickly identify the log entries that potentially lead to the root cause of a problem. It constructs behavioral reference models from logs that represent the normal patterns. When problem occurs, our tool enables operators to inspect the divergence of current logs from the reference model and highlight logs likely to contain the hints to the root cause. To support these capabilities we have designed and developed several mechanisms. First, we developed log correlation algorithms using various IDs embedded in logs to help identify and isolate log entries that belong to the failed request. Second, we provide efficient log comparison to help understand the differences between different executions. Finally we designed mechanisms to highlight critical log entries that are likely to contain information pertaining to the root cause of the problem. We have implemented the proposed approach in a popular cloud management system, OpenStack, and through case studies, we demonstrate this tool can help operators perform problem diagnosis quickly and effectively.