Towards intelligent incident management: why we need it and how we make it
- Zhuangbin Chen ,
- Yu Kang ,
- Liqun Li ,
- Xu Zhang ,
- Hongyu Zhang ,
- Hui Xu ,
- Yangfan Zhou ,
- Li Yang ,
- Jeffrey Sun ,
- Zhangwei Xu ,
- Yingnong Dang ,
- Feng Gao ,
- Pu Zhao ,
- Bo Qiao ,
- Qingwei Lin 林庆维 ,
- Dongmei Zhang ,
- Michael R. Lyu
2020 Foundations of Software Engineering |
Published by ACM
The management of cloud service incidents (unplanned interruptions or outages of a service/product) greatly affects customer satisfaction and business revenue. After years of efforts, cloud enterprises are able to solve most incidents automatically and timely. However, in practice, we still observe critical service incidents that occurred in an unexpected manner and orchestrated diagnosis workflow failed to mitigate them. In order to accelerate the understanding of unprecedented incidents and provide actionable recommendations, modern incident management system employs the strategy of AIOps (Artificial Intelligence for IT Operations). In this paper, to provide a broad view of industrial incident management and understand the modern incident management system, we conduct a comprehensive empirical study spanning over two years of incident management practices at Microsoft. Particularly, we identify two critical challenges (namely, incomplete service/resource dependencies and imprecise resource health assessment) and investigate the underlying reasons from the perspective of cloud system design and operations. We also present IcM BRAIN, our AIOps framework towards intelligent incident management, and show its practical benefits conveyed to the cloud services of Microsoft.