If At First You Don’t Succeed, Try, Try, Again…? Insights and LLM-informed Tooling for Detecting Retry Bugs in Software Systems

Bogdan Alexandru Stoica; Utsav Sethi; Yiming Su; Cyrus Zhou; Shan Lu; Jonathan Mace; Madan Musuvathi; Suman Nath

If At First You Don’t Succeed, Try, Try, Again…? Insights and LLM-informed Tooling for Detecting Retry Bugs in Software Systems

Bogdan Alexandru Stoica ,
Utsav Sethi ,
Yiming Su ,
Cyrus Zhou ,
Shan Lu ,
Jonathan Mace ,
Madan Musuvathi ,
Suman Nath

ACM SOSP 2024 | November 2024

Retry – the re-execution of a task on failure – is a common mechanism to enable resilient software systems. Yet, despite its commonality and long history, retry remains difficult to implement and test in modern systems. Guided by our study of real-world retry issues, we propose a novel suite of static and dynamic techniques to detect retry problems in software systems. In particular, we find that the ad-hoc nature of retry implementation in software systems poses challenges for traditional program analysis but can be well handled by Large Language Models; we also find that careful repurposing existing unit tests can, along with fault injection, expose various types of retry problems.