Safe program merges at scale: A grand challenge for program repair research

已发布

作者 , Senior Principal Researcher

Since the computing world began embracing an open-source approach to programming, building software has become increasingly collaborative. Members of development teams with as few as two developers and as many as thousands are simultaneously editing different components in creating software systems and keeping them functioning optimally, and a three-way merge (opens in new tab) is the mechanism for integrating changes from these individual contributors. But with so many people independently altering code, it’s unsurprising that updates don’t always synchronize, resulting in bad merges.

Bad merges can take a range of forms. For example, textual merge conflicts occur when the changes from two branches can’t be integrated by the default text-based merge algorithms used by version control systems such as Git, Concurrent Versions System (CVS), and Subversion. Most such conflicts are spurious from the perspective of their effect on program execution and often originate from the use of a 40-year-old diff3 algorithm (opens in new tab) for merging text that is unaware of the syntax and semantics of programming languages. Such instances prevent developers from checking in their code, requiring them to manually fix the conflict or, if the solution is ambiguous, to consult with other developers. In other cases, bad program merges can be more subtle and costly, introducing semantic merge conflicts that may either fail the compiler, break a test, or—worse—introduce a regression. Bad merges constitute between 10 percent and 20 percent of all merges for large projects (opens in new tab) and collectively result in stalled pull requests, failed continuous integration runs, or bugs in deployment, including exploitable security vulnerabilities (opens in new tab). Coping with bad merges, which could delay development anywhere from hours to days or impact customer trust in a product, is one of the well-known pain points (opens in new tab) in collaborative software development and, additionally, often discourages less experienced developers from making meaningful contributions to large open-source projects.

Over the past few years, we at Microsoft Research—in collaboration with our academic colleagues and informed by recent large-scale studies of merge conflicts—have been revisiting the challenge, focusing on properties of safe program merges that allow harnessing the powers of program verification, program synthesis, and machine learning. First, the safety, or correctness, of a merge can be characterized by crisp formal specifications, making merges suited for verification with mathematical guarantees. Secondly, open-source software is a natural resource for merge conflict and resolution data, which can be leveraged by deep learning approaches. And third, there are project-specific patterns in how developers resolve bad merges that can be capitalized on by program synthesis. Our work in these spaces has produced several new techniques: an automatic, precise differential program verifier for ensuring a correct merge; a deep learning–based sequence-to-sequence model for synthesizing merge conflict resolutions; and a domain-specific language that can learn repeated resolution patterns for textual merge conflicts. Extending our work and finding ways to combine the strengths of these approaches with prior merge work holds the promise of real results—improvements for developer productivity around collaboration at scale.

Safe program merges—a long-standing research problem

Spotlight: Blog post

Research Focus: Week of September 9, 2024

Investigating vulnerabilities in LLMs; A novel total-duration-aware (TDA) duration model for text-to-speech (TTS); Generative expert metric system through iterative prompt priming; Integrity protection in 5G fronthaul networks.

The problem of ensuring safe program merges has been long studied and remains an open challenge in programming languages and software engineering research. Foundational work was done in the late ’80s using program dependence graphs (opens in new tab) to formalize the problem; however, it didn’t produce practical tools because of the complexity of real-world programs. Work in the last two decades has been around using variants of structured merge (opens in new tab), which lift the merge from a line-structured view of code to a tree-structured view of code (provided by abstract syntax trees). Such approaches, though, suffer from scalability issues inherent in tree matching algorithms and unfortunately can’t soundly resolve conflicts that involve program statements with side effects. Thus, a practical solution that can deal with the complexities of real-world merges has been elusive.

The below example illustrates the complexities of dealing with program semantics. The base program has redundant code, a duplicate test for null on the return value of an allocation function malloc. Removal of the test from either of the two locations in which it exists—as performed by Variant A and Variant B, respectively—leads to faster performance. However, the removal of both the checks exposes a null-dereference bug; interestingly, the default merge algorithm in Git creates such a program as a result of the merge! Such a regression may not always be caught by a test since the call to malloc may return a null value only when subjected to an extreme stress test.

An example of a semantic merge conflict produced by running the default merge algorithm in Git. The crossed-out text denotes deletion with respect to the base program. The merge exposes a null-dereference bug not present in the base program or either of the two variants.

Leveraging differential program verification

Program merge is one of the few widespread software engineering tasks where specifying the lack of regressions can be performed in a generic fashion in the form of a notion called semantic conflict-freedom. Intuitively, semantic conflict-freedom ensures that the merge incorporates precisely the behavioral changes introduced by the two branches over their most common ancestor.

A few years back, we developed a technique for formally verifying semantic conflict-freedom; the technique ensures that a merge doesn’t introduce a new regression. Although program verification offers mathematical guarantees, its applicability at scale is limited by a combination of several problems, including the need for formal specifications and complex program-specific invariants; it also scales poorly with the size of a program. Promisingly, we demonstrate that the combination of the generic semantic conflict-freedom specification and automatic inference of relatively simple classes of relational invariants—along with a compositional approach based on differential program verification—makes merge verification scale with the size of the edits instead of the absolute size of programs. This allowed for certifying many real-world merges as correct and finding bugs in merges.

While the approach showed the potential to prevent bad merges, it didn’t immediately solve the problem of synthesizing a safe merge—that is, not only detecting a bad merge but also fixing it. To that end, we started exploring data-driven approaches for the construction of good merges that can be potentially certified by merge verification tools. This naturally led to us exploring deep learning and program synthesis for merges.

Leveraging deep learning

We exploited the fact that millions of merge conflicts and their resolutions can be mined by replaying the version history of thousands of open-source code repositories on GitHub. With the key observation that most merge resolutions consist of rearranging lines and tokens from the conflicted regions, we developed a deep learning–based sequence-to-sequence model for synthesizing merge conflict resolutions. In the process, we shed light on the challenges of building such solutions based on machine learning, including the automatic creation of high-quality ground truth for a supervised machine learning problem and effective encoding of the merge problem for a deep learning algorithm. For example, we found it crucial to localize the changes the user made while resolving a conflict and to select merges that appeared to be semantically conflict-free. Similarly, we required an edit-aware encoding of the conflict, as well as restrictions on the output sequence with a pointer mechanism, to develop an effective deep learning model.

The technique has demonstrated the feasibility of developing practical merge resolution tools for a dynamic language such as JavaScript, for which structured merge algorithms are known to perform poorly. This paves the path for exploiting the recent advances in deep learning for code to the problem of merge conflict resolution at scale.

Leveraging program synthesis

Finally, it has been previously observed that projects with a sufficiently large number of conflicts often have many project-specific repeated resolution patterns. In a recent paper, we’ve demonstrated that one can design succinct domain-specific languages (DSLs) to encode such resolution patterns and leverage advances in program synthesis to learn such patterns from a small number of examples. We show the feasibility of the approach for learning merge resolution patterns for the Microsoft Edge browser code base using the PROSE program synthesis framework.

As a divergent fork of the upstream Chromium repository, Microsoft Edge nicely embodies many of the challenges of modern distributed software development that creates value-added differentiated services on top of open-source software. Engineers deal with several thousand bad merges, ranging from textual conflicts to compiler breaks to test failures, each month while absorbing changes from the upstream repository. The ability to synthesize safe merges will have a considerable impact on the engineering cost to develop and maintain Microsoft Edge and similar projects.  

A promising trifecta and a grand challenge

Each of the above three techniques comes with its own strengths and challenges. For example, the use of program verification can help verify the absence of regressions but is challenging to implement for dynamic languages such as JavaScript and Python. Deep learning–based approaches can be applied at scale with high automation but don’t provide any guarantees about lack of regressions. And, finally, program synthesis can learn project-specific repeated patterns for large projects but requires investment in designing DSLs for different patterns and can’t ensure semantic conflict-freedom. We hope to find synergies between these complementary approaches, in addition to leveraging prior works on structured merge, so that we may develop practical tools for helping developers deal with bad merges at scale. We also hope the wide applicability yet well-defined nature of program merge can serve as an important problem domain for the research community to drive advances in program verification, synthesis, and deep learning. We leave the research community with a grand challenge for program repair: automate the resolution of a million instances of merge conflicts such that the respective merges are safe enough to successfully compile and pass all the quality gates, including tests.

Acknowledgments

This research represents the work of researchers and developers at Microsoft, including Christian Bird (opens in new tab), Pallavi Choudhury (opens in new tab), Sumit Gulwani (opens in new tab), Mike Kaufman (opens in new tab), Vu Le (opens in new tab), Todd Mytkowicz (opens in new tab), former Microsoft researcher Nachi Nagappan (opens in new tab), Gustavo Soares (opens in new tab), Alexey Svyatkovskiy (opens in new tab), and Jessica Wolk (opens in new tab); interns Chungha Sung, (opens in new tab)Rangeet Pan (opens in new tab), and Elizabeth Dinella (opens in new tab); and external collaborators Marcelo Sousa (opens in new tab), Mayur Naik (opens in new tab), and Işil Dillig (opens in new tab).

相关论文与出版物

继续阅读

查看所有博客文章