If At First You Don’t Succeed, Try, Try, Again…? Insights and LLM-informed Tooling for Detecting Retry Bugs in Software Systems
- Bogdan Alexandru Stoica ,
- Utsav Sethi ,
- Yiming Su ,
- Cyrus Zhou ,
- Shan Lu ,
- Jonathan Mace ,
- Madan Musuvathi ,
- Suman Nath
ACM SOSP 2024 |
Retry – the re-execution of a task on failure – is a common mechanism to enable resilient software systems. Yet, despite its commonality and long history, retry remains difficult to implement and test in modern systems. Guided by our study of real-world retry issues, we propose a novel suite of static and dynamic techniques to detect retry problems in software systems. In particular, we find that the ad-hoc nature of retry implementation in software systems poses challenges for traditional program analysis but can be well handled by Large Language Models; we also find that careful repurposing existing unit tests can, along with fault injection, expose various types of retry problems.