Mitigating Metastable Failures in Distributed Systems with Offline Reinforcement Learning

Yueying Li; Daochen Zha; Tianjun Zhang; G. Edward Suh; Christina Delimitrou; Francis Y. Yan

Mitigating Metastable Failures in Distributed Systems with Offline Reinforcement Learning

Yueying Li ,
Daochen Zha ,
Tianjun Zhang ,
G. Edward Suh ,
Christina Delimitrou ,
Francis Y. Yan

International Conference on Learning Representations (ICLR ’23) | May 2023

Download BibTex

This paper introduces a load shedding mechanism that mitigates metastable failures through offline reinforcement learning (RL). Previous studies have heavily focused on heuristics that are reactive and limited in generalization, while online RL algorithms face challenges in accurately simulating system dynamics and acquiring data with sufficient coverage. In contrast, our algorithm leverages offline RL to learn directly from existing log data. Through extensive empirical experiments, we demonstrate that our algorithm outperforms rule-based methods and supervised learning algorithms in a proactive, adaptive, generalizable, and safe manner. Deployed in a Java compute service with diverse execution times and configurations, our algorithm exhibits faster reaction time and attains the Pareto frontier between throughput and tail latency.