Causal Modeling Based Fault Localization in CloudSystems using Golden Signals
Abstract
In cloud-native applications, a large fraction of operational failures (or outages) result from violations of Service Level Objectives (SLOs) defined on either service errors or service latency (two of the, so called, golden signals). Such failures are notified to Service Reliability Engineers (SREs) through monitoring of incoming service requests and associated output service responses (errors and/or latency) when the number of service error exceeds the SLO threshold. However, isolating cause of such failures is exacerbated by the complex and dynamic interactions of the involved application micro-services. A SRE typically has to investigate logs emitted by individual micro-services along with the relevant operational metrics (e.g.,resource utilization) to triage the underlying faulty micro-service and associated components (e.g., pods, etc.). Such manual fault localization process results in substantially longer Mean Time To Resolution (MTTR) for outages. In this paper, we propose a light-weight fault localization system, which can greatly reduce human effort and dependency on domain knowledge for localizing such golden signal based operational failures. Our technique establishes causal relationships among the golden signal service errors and error logs emitted by the constituent micro-services (all modeled as time series data). The proposed framework further leverages PageRank centrality of the derived causal graph to generate a ranked list of faulty micro-services. Our experimental results show that our system can localize operational faults with high accuracy (F1=90.4%) underscoring the effectiveness of using golden signal error rates in fault localization