Diffusion-Based Time Series Data Imputation for Cloud Failure Prediction at Microsoft 365

  • Fangkai Yang ,
  • Wenjie Yin ,
  • Lu Wang ,
  • Tianci Li ,
  • Pu Zhao ,
  • Bo Liu ,
  • Paul Wang ,
  • Bo Qiao ,
  • Yudong Liu ,
  • Mårten Björkman ,
  • S. Rajmohan ,
  • ,
  • Dongmei Zhang

FSE'23 Industry |

Publication | Publication

Ensuring reliability in large-scale cloud systems like Microsoft 365 is crucial. Cloud failures, such as disk and node failure, threaten service reliability, causing service interruptions and financial loss. Existing works focus on failure prediction and proactively taking action before failures happen. However, they suffer from poor data quality, like data missing in model training and prediction, which limits performance. In this paper, we focus on enhancing data quality through data imputation by the proposed Diffusion+, a sample-efficient diffusion model, to impute the missing data efficiently conditioned on the observed data. Experiments with industrial datasets and application practice show that our model contributes to improving the performance of downstream failure prediction.