If At First You Don’t Succeed, Try, Try, Again…? Insights and LLM-informed Tooling for Detecting Retry Bugs in Software Systems

ACM SOSP 2024 |

Retry – the re-execution of a task on failure – is a common mechanism to enable resilient software systems. Yet, despite its commonality and long history, retry remains difficult to implement and test in modern systems. Guided by our study of real-world retry issues, we propose a novel suite of static and dynamic techniques to detect retry problems in software systems. In particular, we find that the ad-hoc nature of retry implementation in software systems poses challenges for traditional program analysis but can be well handled by Large Language Models; we also find that careful repurposing existing unit tests can, along with fault injection, expose various types of retry problems.