Large Language Models Can Provide Accurate and Interpretable Incident Triage
- Zexin Wang ,
- Minghua Ma ,
- Ze Li ,
- Chetan Bansal ,
- Saravan Rajmohan ,
- Qingwei Lin 林庆维 ,
- Dongmei Zhang
2024 International Symposium on Software Reliability Engineering |
Large-scale cloud services frequently experience incidents that can have a significant impact on their stability. Incident triage is a critical process that assigns incidents to dedicated teams for resolution. However, traditional rule-based methods, commonly employed in various systems, have limitations due to a finite set of rules that necessitate continuous updates, leading to suboptimal performance. Current state-of-the-art approaches primarily rely on textual information, utilizing classifiers or unsupervised clustering. Unfortunately, the abundance of textual information, combined with considerable noise, presents a significant challenge to the accuracy of these methods. To tackle these challenges, we introduce COMET, an innovative system that utilizes an AutoExtractor to filter out non-critical logs and employs a Large Language Model (LLM) for keyword extraction. This approach effectively mitigates the complexity arising from disordered textual information. Additionally, COMET incorporates significant domain knowledge during keyword extraction, enhancing the LLM’s comprehension of the text. We deployed COMET on multiple cloud services within Microsoft, where it has operated continuously for over six months. Offline and online evaluations have shown that COMET achieves enhanced accuracy and reduced Time to Mitigation (TTM).