How Incidental are the Incidents? Characterizing and Prioritizing Incidents for Large-Scale Online Service Systems
- Junjie Chen ,
- Shu Zhang ,
- Xiaoting He ,
- Qingwei Lin 林庆维 ,
- Hongyu Zhang ,
- Dan Hao ,
- Yu Kang ,
- Feng Gao ,
- Zhangwei Xu ,
- Yingnong Dang ,
- Dongmei Zhang
2020 Automated Software Engineering |
Published by ACM
Although tremendous efforts have been devoted to the quality assurance of online service systems, in reality, these systems still come across many incidents (i.e., unplanned interruptions and outages), which can decrease user satisfaction or cause economic loss. To better understand the characteristics of incidents and improve the incident management process, we perform the first large-scale empirical analysis of incidents collected from 18 real-world online service systems in Microsoft. Surprisingly, we find that although a large number of incidents could occur over a short period of time, many of them actually do not matter, i.e., engineers will not fix them with a high priority after manually identifying their root cause. We call these incidents incidental incidents. Our qualitative and quantitative analyses show that incidental incidents are significant in terms of both number and cost. Therefore, it is important to prioritize incidents by identifying incidental incidents in advance to optimize incident management efforts. In particular, we propose an approach, called DeepIP (Deep learning based Incident Prioritization), to prioritizing incidents based on a large amount of historical incident data. More specifically, we design an attention-based Convolutional Neural Network (CNN) to learn a prediction model to identify incidental incidents. We then prioritize all incidents by ranking the predicted probabilities of incidents being incidental. We evaluate the performance of DeepIP using real-world incident data. The experimental results show that DeepIP effectively prioritizes incidents by identifying incidental incidents and significantly outperforms all the compared approaches. For example, the AUC of DeepIP achieves 0.808, while that of the best compared approach is only 0.624 on average.