Finding Non-Terminating Executions in Distributed Asynchronous Programs
Programming distributed and reactive asynchronous systems is complex due to the lack of synchronization between concurrently executing tasks, and arbitrary delay of message-based communication. As even simple programming mistakes have the capability to introduce divergent behavior, a key liveness property is eventual quiescence: for any finite number of external stimuli (e.g., client-generated events), only a finite number of internal messages are ever created.
In this work we propose a practical three-step reduction-based approach for detecting divergent executions in asynchronous programs. As a first step, we give a code-to-code translation reducing divergence of an asynchronous program P to completed state-reachability, i.e., reachability to a given state with no pending synchronous tasks, of a polynomially-sized asynchronous program P’. In the second step, we give a code-to-code translation under-approximating completed state-reachability of P’ by state-reachability of a polynomially-sized recursive sequential program P”(K), for the given analysis parameter K. Following Emmi et al. [8]’s delay-bounding approach, P”(K) encodes a subset of P’, and thus of P, by limiting scheduling nondeterminism. As K is increased, more possibly divergent behaviors of P are considered, and in the limit as K approaches infinity, our reduction is complete for programs with finite data domains. As the final step we give the resulting state-reachability query to an of-the-shelf SMT-based sequential program verification tool.
We demonstrate the feasibility of our approach by implementing a prototype analysis tool called Alive, which detects divergent executions in several hand-coded variations of textbook distributed algorithms. As far as we are aware, our easy-to-implement prototype is the first tool which automatically detects divergence for distributed and reactive synchronous programs.