Mitigating Metastable Failures in Distributed Systems with Offline Reinforcement Learning
- Yueying Li ,
- Daochen Zha ,
- Tianjun Zhang ,
- G. Edward Suh ,
- Christina Delimitrou ,
- Francis Y. Yan
International Conference on Learning Representations (ICLR ’23) |
This paper introduces a load shedding mechanism that mitigates metastable failures through offline reinforcement learning (RL). Previous studies have heavily focused on heuristics that are reactive and limited in generalization, while online RL algorithms face challenges in accurately simulating system dynamics and acquiring data with sufficient coverage. In contrast, our algorithm leverages offline RL to learn directly from existing log data. Through extensive empirical experiments, we demonstrate that our algorithm outperforms rule-based methods and supervised learning algorithms in a proactive, adaptive, generalizable, and safe manner. Deployed in a Java compute service with diverse execution times and configurations, our algorithm exhibits faster reaction time and attains the Pareto frontier between throughput and tail latency.