Reinforcing program correctness with reinforcement learning

Published April 14, 2021

By Suvam Mukherjee , Software Engineer

Share this page

Many of our online activities, from receiving and sending emails to searching for information to streaming movies, are driven behind the scenes by cloud-based distributed architectures. Writing concurrent software—programs with multiple logical threads of execution—is of paramount importance to scale to these growing computing needs. Unfortunately, writing correct concurrent software is challenging. Unit, integration, and even stress testing don’t provide reasonable guarantees about the correctness of a concurrent program. Thus, insidious defects can remain latent in software until the late stages of development, potentially adding cost and stress to already tight timelines designed to help ensure new software is still relevant upon release.

Controlled concurrency testing (CCT), an emerging approach in this space, aims to take over the concurrency in a program so that thread/process interleavings can be directly controlled. Using a variety of strategies, CCT attempts to identify buggy program executions, converting the testing problem into a search problem over the space of all possible program behaviors, which can typically be astronomical in number for concurrent programs.

Under the Coyote project, which has been used to build several mission-critical Microsoft Azure services, we’ve been working on providing effective CCT-based solutions to find complex defects arising from concurrency. Existing state-of-the-art CCT strategies are typically hand-tuned, making it nearly impossible to guarantee that a strategy that worked well in a previous application will work well with your application. This led us to ask an intriguing question: can we use machine learning, with no prior knowledge about an application, to figure out enough semantic details to expose bugs with high probability? We gave this a shot and designed QL. To the best of our knowledge, it’s the first reinforcement learning–based CCT search strategy. Our strategy, even without being taught anything about concurrent programs or interleavings, was able to beat the state-of-the-art human-designed CCT strategies.

Publication Learning-based Controlled Concurrency Testing

Toolkit Coyote

Microsoft research podcast

Collaborators: Silica in space with Richard Black and Dexter Greene

College freshman Dexter Greene and Microsoft research manager Richard Black discuss how technology that stores data in glass is supporting students as they expand earlier efforts to communicate what it means to be human to extraterrestrials.

Listen now

Coyote is a .NET library and tool designed to help ensure that your code is free of concurrency bugs.

QL was presented at the 2020 Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA) (opens in new tab), where it won the Distinguished Artifact Award, and is available open source for experimentation on GitHub as part of Coyote (opens in new tab).

A circular flow chart illustrating a high-level overview of the architecture of the controlled concurrency testing (CCT) framework. The input to CCT is a concurrent program, represented by a rectangle labeled “Concurrent Program under Test. — Figure 1: Controlled concurrency testing (CCT) aims to ease the challenge of writing correct concurrent software by taking over a program’s concurrency. Whenever a concurrent program executes an action, it transitions to a new state. The CCT framework comprises a scheduler, which observes the next state (s), and a set of enabled program actions (A). Using search strategies, the scheduler identifies the next action from a particular state that maximizes the likelihood of the program transitioning to a buggy state.

CCT and the trouble with finding concurrency bugs

The CCT framework (Figure 1) comprises a scheduler—which observes the next state of the concurrent program, such as the value of global variables or the set of inflight messages—and a set of enabled program actions, such as “thread T₁ writes to a global variable” or “thread T₂ sends a message to T₁.” Whenever the program executes an action, it transitions to a new state. The job of the scheduler is to simply select an enabled action from a given state.

Since the search space of program behaviors is often very large, we need effective ways of navigating this space. Schedulers employ search strategies, which identify the next action from a particular state that maximizes the likelihood of the program transitioning to a buggy state, one where some program correctness property is violated. Thus, CCT converts the testing problem into a search problem over the space of all interleavings, looking for buggy executions. As mentioned above, existing search strategies typically use human-tuned heuristics based on previously observed bug patterns, which often can’t provide guarantees that a strategy successful in one application will be as effective in another.

Let us consider an example to illustrate why concurrency bugs can be hard to find. Consensus algorithms, coordinating between multiple concurrently executing nodes to reach an agreement, lie at the heart of many modern distributed systems, and Raft (opens in new tab) is a popular consensus algorithm.

A node in the Raft protocol can assume one of three roles: leader, candidate, or follower. The leader receives client requests and replicates them among the remaining nodes. At any point, one or more candidates can initiate a leader-election round, in which nodes exchange voting messages among themselves. The candidate obtaining the most votes is deemed to be the new leader. Raft maintains an important invariant: there can be at most one leader at a time. Failure of this invariant can result in the system transitioning to a corrupted state and is a serious bug.

Akka Raft 2015 (opens in new tab), an implementation of this protocol, had a tricky bug. Candidate nodes lacked the necessary logic to identify and discard duplicate votes, resulting in a vote getting counted multiple times. Duplicate votes can occur because of delays in the network, with a node timing out for the acknowledgment and voting again. To expose this bug, specific events must occur in a definite order: there needs to be an election round with multiple candidates, a candidate \(A\) must receive the most votes, a follower who voted for candidate \(B\) has to time out and send duplicate votes, the vote counts for \(A\) and \(B\) need to match, and the system must end up with two leaders.

Out of the astronomical number of possible behaviors of Raft, it’s challenging for a testing framework such as CCT to drive the system through this specific one. We turned to RL to make searching that space more effective. Additionally, RL allows for a solution that is customized to the application under test, unlike strategies that are hand-tuned based on other scenarios.

A circular flow chart of the reinforcement learning problem. At the top, from a rectangle labeled “Agent,” an arrow labeled “Action Taken” points to a rectangle below labeled “Environment (unknown a priori).” From the “Environment” rectangle, an arrow labeled “Next State Reward/Penalty” points around to the “Agent” rectangle. — Figure 2: In the reinforcement learning problem, an agent interacts with an unfamiliar environment. Each action an agent chooses to take causes the environment to undergo a state transition, allowing the agent to observe a new state, and results in a reward/penalty for the choice. The agent’s goal is to learn a sequence of actions that maximizes its expected reward.

QL: CCT meets Q-learning

The reinforcement learning (RL) problem (Figure 2) consists of an agent interacting with an environment about which it has no prior knowledge. At each step, the agent chooses an action, which causes the environment to undergo a state transition. In response, the agent observes a new environment state and receives a reward/penalty as feedback for its choice. The agent’s goal is to learn a sequence of actions that maximizes its expected reward. This model has achieved spectacular success in domains such as robotics; game playing, including games like Go and backgammon; and autonomous driving.

A careful look at the CCT and RL architectures reveals some striking similarities. Both have entities (an agent in RL and the search strategy in CCT) that are navigating an unknown search space (the environment in RL and the concurrent program under test in CCT) with a specific objective (maximizing expected reward in RL and maximizing likelihood of finding a concurrency bug in CCT). This was our starting point. We mapped the search strategy component in CCT to an RL agent and the concurrent program under test to the unknown environment.

Our resulting search strategy leverages the classic Q-learning algorithm (opens in new tab) from the RL domain. For each state s and for each action a enabled from that state, QL maintains a quality metric Q(s,a) called the Q-value. QL decides which action to execute next using a Softmax selection function. In response, the program transitions to a new state and presents a penalty signal back to QL, which uses it to adjust the Q-values with the objective of maximizing coverage of the program’s state space.

The key advantage of QL is that instead of focusing on specific bug patterns, you can specify the state space of the program that is relevant to the logic being tested, and QL does the best job of maximizing the coverage of this state space.

Controlled testing of the Raft protocol

We compared QL to two existing state-of-the-art CCT search strategies, Random and probabilistic concurrency testing (PCT). With Raft, Random will, at each step, arbitrarily pick one of the nodes to execute. PCT will assign a set of priorities to all the participating nodes and select the node with the highest priority. It will then sporadically lower the priority of the currently executing node and pass the control to the node that had the second-highest priority. Note that both these strategies are geared toward empirically observed bug patterns.

How can we tell if a strategy is making progress toward uncovering the bug in Raft? Two reasonable metrics include the total number of elected leaders and the total number of election rounds with multiple candidates. A strategy maximizing these metrics is likely to observe more program behaviors in Raft and have a higher likelihood of exposing the bug.

Two line graphs showing the performance of QL, PCT, and Random on the Raft protocol. The x-axis of both denotes individual runs, beginning with 320 runs and ending with 10,240. The y-axis of the graph on the left shows the total number of elected leaders explored, from 0 to 10,000; the y-axis of the graph on the right shows the total number of election rounds with multiple candidates explored, from 0 to 5,000. In the first line graph, as the number of runs increases, QL (represented by a solid blue line with circles for plot points) increasingly explores the most number of leaders elected, followed by Random (represented by a dotted and dashed red line with x’s for plot points) and PCT (represented by a dashed green line with triangles for plot points). In the second line graph, QL increasingly explores the most number of election rounds with multiple candidates, followed by Random and then PCT. — Figure 4: QL outperforms two existing state-of-the-art CCT search strategies, Random and probabilistic concurrency testing (PCT), when run on Raft, a popular consensus algorithm used in modern distributed systems. QL maximizes the total number of elected leaders explored (left) and the total number of election rounds with multiple candidates explored (right), indicators of the amount of coverage. The more behaviors observed by a strategy, the more likely it is to expose a bug.

In the above line graphs, the x-axis denotes individual runs of the Raft protocol. The y-axis of the first graph shows the total number of elected leaders explored by the different strategies, and the y-axis of the second graph shows the total number of election rounds in which there are multiple candidates. QL significantly outperforms Random and PCT on both metrics. Does this mean QL is better at finding the Raft bug? You bet! If you invoke a CCT framework 100 times with the different strategies, QL finds the bug 95 times, Random finds it four times, and PCT never finds the bug. We observed this repeatedly, in benchmarks ranging from complex protocols to production Azure services: QL can glean application-specific semantic information during its exploration, allowing it to consistently outperform state-of-the-art CCT strategies.

A line graph comparing the state coverage of QL, Random, and PCT in Raft. On the x-axis is the number of iterations, beginning with 320 and ending with 10,240; on the y-axis, is the number of unique abstract states, from 0 to 500,000. As the number of iterations increases, the number of unique abstract states observed by each strategy increases, with QL (represented by a solid blue line with circles for plot points) observing the most and experiencing the biggest increase, followed by Random (represented by a dotted green line with squares for plot points) and then PCT (represented by a dashed red line with triangles for plot points). — Figure 5: In more than 10,000 executions of Raft, QL covers nearly three times more states compared with Random and five times more states compared with PCT, a testament to QL’s superior bug-finding ability.

We can gain additional insights into the superior bug-finding ability of QL by comparing its state coverage with Random and PCT. The state of the protocol relevant for testing includes the set of in-flight messages (network state) and the status of each node (its role and whom its voting for currently). Using appropriate APIs, this is the state that we exposed to QL.

As shown in the figure above, in around 10,000 executions of Raft, QL covers nearly three times more states compared with Random and five times more states compared with PCT. The QL strategy is geared toward performing a thorough coverage of the relevant state space, and the superior bug-finding ability is a byproduct.

Find out more

QL is joint work with Microsoft Research Senior Research Software Engineer Pantazis Deligiannis, Harvard University Postdoctoral Fellow Arpita Biswas, and Microsoft Research Senior Principal Researcher Akash Lal.

For more information about QL, read our paper “Learning-Based Controlled Concurrency Testing,” (opens in new tab) and to learn more about Coyote, watch the Tech Minutes: Project Coyote video (opens in new tab). You can also try out QL using this virtual machine with everything installed (opens in new tab). Happy bug finding!