Haoran Qiu, Weichao Mao, et al.
ASPLOS 2024
Main Message: In the realm of log analytics for software applications, integrating LLMs introduces engineering and performance challenges that affect performance, resource consumption, and cost. This presentation highlights these challenges, presents strategies to address them, and shares insights on optimizing LLM usage to improve efficiency and scale log analytics solutions in an enterprise setting.
Detailed Description: A Site Reliability Engineer's (SRE) role in ensuring availability, performance, efficiency, monitoring, incident response, and capacity planning is increasingly complex due to the scale of software systems and the vast volume of telemetry data (logs, metrics, traces). Even the most experienced SRE teams face challenges, especially with the rapidly proliferating data generated by cloud-native technologies. Artificial Intelligence for IT Operations (AIOps) platforms/tools mine huge volumes of telemetry data to drive insights that may help SREs identify and resolve the problem quickly and efficiently. The advancements in Large Language Model (LLM) in the previous few years has been highly promising, and it is evident that AIOps platforms stand to benefit significantly from their integration. However, integrating LLMs into the AIOps platforms comes with its own set of challenges, particularly in terms of resource consumption, cost, scalability and performance. This presentation shares our experiences in building a Log Analytics tool for AIOps that leverages state-of-the-art LLMs for processing huge volumes of log data by predicting log specific insights - golden signals, fault categories, entity recognition for anomaly detection. Most importantly, we explore running LLM-based log analytics tool on CPUs instead of the typical GPU requirement, highlighting optimizations that ensure efficient processing with minimal execution time, even for large log volumes. We also examine key engineering challenges in developing such tool, particularly those associated with log data, including its diversity, date-time variability, and the complexities of scaling across multiple software applications. To address the primary challenge of running LLMs for log analytics on CPUs, we exploited the inherent nature of log data which is its repeatability, forming clusters/templates out of them, which is generally a fraction of whole log dump. Therefore, the proposed approach uses log templatization for clustering similar logs and select a representative from each cluster as input to the LLM. This significantly reduces the number of LLM calls, as opposed to calling LLM for each log line, optimizing both performance and cost of processing large log volumes efficiently with minimal execution time. However, the quality of log data, particularly the purity of log clusters, can affect the number of LLM calls, leading to higher turnaround time and cost, thereby effecting the overall performance. Additionally, ablation studies demonstrate the benefits of broadcast strategies, which transmit predicted golden signal values from cluster representatives to their constituents (actual log lines), enhancing overall performance. The tool was launched for 10 products in March 2024 and scaled to 76 products over the next three months. By December 2024, it had successfully processed 1376 cases, handling 877 GB of data and 1.04 billion log lines. Feedback was received for 249 cases, where 53.79% of respondents found the tool beneficial. Furthermore, 60.4% of the products reported saving at-least 30 min per trigger, amounting to 8093 minutes of time savings. Finally, this presentation includes a practical case study on leveraging LLMs for large-scale log analysis. Attendees will gain insights into building scalable, high-performance solutions and enhancing model quality with constrained resources, providing actionable strategies to address real-world challenges in building log analysis tools and beyond.
Takeaway Message: