Can We Trust Auto-Mitigation? Improving Cloud Failure Prediction with Uncertain Positive Learning

Haozhe Li; Minghua Ma; Chetan Bansal; Saravan Rajmohan; Qingwei Lin 林庆维; Dongmei Zhang

Can We Trust Auto-Mitigation? Improving Cloud Failure Prediction with Uncertain Positive Learning

2024 International Symposium on Software Reliability Engineering | October 2024

Download BibTex

In the rapidly expanding domain of cloud computing, a variety of software services have been deployed in the cloud. To ensure the reliability of cloud services, prior studies focus on the prediction of failure instances, such as disks, nodes, switches, etc. The mitigation actions are initiated to resolve the underlying issue once the prediction output is positive. However, our real-world practice in Microsoft Azure revealed a decline in prediction accuracy, approximate 9\%, after model retraining. The decrease is attributed to the mitigation actions, which can result in uncertain positive instances. Since these instances cannot be verified after mitigation, they may introduce additional noise into the model updating process. To the best of our knowledge, we are the first to identify this Uncertain Positive Learning (UPLearning) issue in the real-world cloud failure prediction scenario, and we design an Uncertain Positive Learning Risk Estimator (Uptake) approach to address this problem. By utilizing two real-world datasets for disk failure prediction and conducting node prediction experiments in Azure, which is a top-tier cloud provider serving millions of users. We demonstrate that our Uptake method can significantly enhance failure prediction accuracy by an average of 5\%.