CONAN: Diagnosing Batch Failures for Cloud Systems
- Liqun Li ,
- Xu Zhang ,
- Shilin He ,
- Yu Kang ,
- Hongyu Zhang ,
- Minghua Ma ,
- Yingnong Dang ,
- Zhangwei Xu ,
- Saravan Rajmohan ,
- Qingwei Lin 林庆维 ,
- Dongmei Zhang
ICSE 2023 SEIP |
Failure diagnosis is critical to the maintenance of large-scale cloud systems, which has attracted tremendous attention from academia and industry over the last decade. In this paper, we focus on diagnosing batch failures, which occur to a batch of instances of the same subject (e.g., API requests, VMs, nodes, etc.), resulting in degraded service availability and performance. Manual investigation over a large volume of high dimensional telemetry data (e.g., logs, traces, and metrics) is labor-intensive and time-consuming, like finding a needle in a haystack. Meanwhile, existing proposed approaches are usually tailored for specific scenarios, which hinders their applications in diverse scenarios. According to our experience with Azure and Microsoft 365 – two world-leading cloud systems, when batch failures happen, the procedure of finding the root cause can be abstracted as looking for contrast patterns by comparing two groups of instances, such as failed vs. succeeded, slow vs. normal, or during vs. before an anomaly. We thus propose CONAN, an efficient and flexible framework that can automatically extract contrast patterns from contextual data. CONAN has been successfully integrated into multiple diagnostic tools for various products, which proves its usefulness in diagnosing real-world batch failures.