Towards Automating the AI Operations Lifecycle
Matthew Arnold, Jeffrey Boston, et al.
MLSys 2020
For a cloud service provider, the goal is to proactively identify signals that can help reduce outages and/or reduce the mean-time-to-detect and mean-time-to-resolve. After an incident is reported, the Site Reliability Engineers diagnose the fault and search for a resolution by formulating a textual query to find similar historical incidents - this approach is called text-based retrieval. However, it has been observed that the formulated queries are inadequate and short. An alternate approach, presented in this paper, integrates information spread across heterogeneous and siloed datasets, as a ready-to-use knowledge base for metadata-based resolution retrieval. Additionally, it exploits historical problem context for building metadata prediction models which are used at run-time for automatically formulating queries from log anomalies detected by the Log Anomaly Detection module. The query, thus formed, is run against the metadata-based index, unlike the text-based index in text retrieval, resulting in superior performance, in terms of relevancy of the resolution documents retrieved. Through experiments on web application server applications deployed on the cloud, we show the efficacy of metadata-based retrieval, which not only returns targeted results as compared to text-based retrieval but also the relevant resolution document appear amongst the top 3 positions for 60% of the queries.
Matthew Arnold, Jeffrey Boston, et al.
MLSys 2020
Genady Ya. Grabarnik, Filippo Poltronieri, et al.
CASCON 2023
Saurabh Pujar, Luca Buratti, et al.
DAC 2023
Shubhi Asthana, Ruchi Mahindru
Big Data 2022