Early Bird: Ensuring Reliability of Cloud Systems Through Early Failure Prediction
- Yudong Liu ,
- Minghua Ma ,
- Pu Zhao ,
- Chetan Bansal ,
- Saravan Rajmohan ,
- Qingwei Lin 林庆维 ,
- Dongmei Zhang
2024 International Symposium on Software Reliability Engineering |
As cloud service continues to dominate various sectors, the reliability of cloud infrastructures becomes crucial. Traditional methods of failure prediction often fall short in providing sufficient time for preventative measures. This paper presents a failure prediction framework, Early Bird, designed to address these challenges by integrating novel data handling and prediction strategies. Our approach utilizes enhanced sample generation techniques and a unique adaptive loss function within a unified prediction model, aiming for early and precise failure detection. We present a comprehensive analysis conducted at Microsoft, demonstrating the ability to predict potential failures up to 20 minutes earlier than conventional methods while maintaining accuracy across various prediction models, including LSTM and Transformer.