Experimentation in Generative AI: C++ Team’s Practices for Continuous Improvement
By Sinem Akinci, Microsoft Developer Division and Cindy Chiu, Microsoft Experimentation Platform
Generative AI [1] leverages deep learning models to identify underlying patterns and generate original content, such as text, images, and videos. This technology has been applied to various industries, including customer service, marketing, and software development. A popular example is GitHub Copilot, which generates code based on open-source data.
The generative AI space is undergoing rapid transformation with new updates and changes emerging daily. Products leveraging generative AI must constantly make decisions on the right set of parameters, models, and prompts to find the best combination. Experimentation plays a crucial role in navigating this dynamic landscape, which enables data-driven decision-making and refining generative AI features. As a case study, let’s now explore how the Microsoft C++ team applies this technology in practice, using experimentation to develop and refine GitHub Copilot features.
In this blog post, we will first provide a general overview of best practices for experimenting and evaluating generative AI features. Then we will highlight some of these practices that the C++ team uses to develop GitHub Copilot features with experimentation. We will explain how these best practices benefit the product. Lastly, we will conclude with an example of a new feature we shipped leveraging these practices.
Methods for making data-driven decisions for generative AI products
What are qualitative methods?
Qualitative methods [2] offer valuable insights into the user experience through various approaches such as usability studies, surveys, focus groups, interviews, and diary studies. These methods help uncover the nuances that are hard for quantitative methods to capture, providing an initial understanding of user interactions. However, since qualitative methods often come from smaller sample sizes, they may not provide a complete picture. Instead, these methods enable developers to identify gaps between features and user needs, particularly for generative AI features that involve both model content and user interface.
What are quantitative methods?
Quantitative methods for evaluating generative AI features can be divided into two categories: offline evaluation and online evaluation.
Offline evaluation, which includes techniques like hyperparameter tuning and grid search, assesses model accuracy and feature performance before deployment. This approach works particularly well when there are known ground truth values and clean datasets. By using various datasets and predefined metrics, developers can compare models and benchmarks cost-effectively without exposing them to actual users.
Online evaluation, such as A/B testing, involves exposing the feature to actual customers. It verifies the results observed during offline testing in a real-world context, capturing true user interactions and ensuring the feature performs effectively in production.
Incorporating all methods into your product lifecycle
The generative AI product lifecycle [3] is an iterative approach to preparing, deploying, and improving a generative AI feature over time. During the experimentation and evaluation stage, offline evaluation is used to assess whether the model performs better than other baselines. Although offline evaluation provides an understanding of model accuracy, it does not represent user interactions, making online testing essential.
A/B testing helps validate the results by capturing real user interactions. Once the model is deployed, qualitative methods such as user studies can be used to collect user feedback, particularly for features designed for user interaction. This feedback is then incorporated to further refine and improve the feature, completing the product lifecycle.
Using progressive rollout to test your generative AI feature
What is progressive rollout?
Progressive rollout starts by exposing a new feature to a small percentage of users and gradually rolling it out to more users based on its performance. Traffic as small as a few thousand samples is used to test whether the feature works as expected and observe any movement in user metrics, rather than to make a definitive decision on shipping.
What’s the benefit of progressive rollout?
Mitigating risk of errors or bias: Due to the non-deterministic nature of AI, generative AI features can sometimes produce unexpected or inappropriate content. By gradually rolling out the feature, developers can be assured that the work they have done to address unexpected output holds up broadly, safeguarding against potential harm. This approach also helps in detecting data quality issues, such as Sample Ratio Mismatch (SRM) or inappropriate data movement, ensuring a more reliable deployment.
Learning and Improvement through performance management: Latency is a key component of performance, and it can significantly impact generative AI products. Users may abandon the feature if the response time is too long. Measuring performance and latency is essential to ensure that the user is getting the intended value in a timely manner. By identifying regressions in performance metrics, such as increased response times or higher crash rates, early on, these issues can be addressed promptly. Progressive rollout not only allows the product team to provide hotfixes while the feature is still exposed to a small percentage of users, but also helps predict capacity needs more accurately, ensuring the best user experience as capacity ramps up.
Iterating experiments to optimize your feature
Why run multiple iterations? What are the benefits?
Developers frequently run multiple experiments on the same product. As highlighted in the generative AI product lifecycle, after collecting user feedback or analyzing experiment results, developers can iterate on the experiment to better incorporate the feedback and enhance the product. As generative AI models evolve, various models become available for production. One key question is: which model is best for the users? This varies by feature. For instance, AI-assisted renaming functions require quick response times. Renaming occurs during the natural developer flow, which requires a responsive interaction. If this responsiveness isn’t achieved, the feature’s benefit may decrease as developers might prefer to continue their work stream rather than be delayed by latency. Conversely, features like pull request reviewer benefit from models that are capable of more complex reasoning, where precision is more critical than speed. A/B testing different models helps developers determine whether users prefer faster responses or higher quality.
When iterating over experiments, teams can refine hypotheses, modify treatments, and test new variations. For example, experiments can be conducted on different language models. This iterative method enables an experience in production that maximizes user engagement. Continuous iteration and refinement not only lead to more polished products but also ensure that the product evolves in alignment with user needs and preferences.
Combing best practices to help C++ users: Copilot in Quick Info case study
An example of a feature that the Microsoft C++ team developed using both qualitative and quantitative methods was C++ Copilot integrations in the Quick Info dialog in Visual Studio.
Copilot in Quick Info is an AI-based feature that provides users with an AI-generated summary of a symbol that they are hovering over. Users need to select “Tell me more” on hover to invoke Copilot to provide a summary on this particular symbol. The goal with this feature was to provide users with accurate and quick information of a symbol that may have lacking documentation without switching context.
Progressive rollout of initial design
After initial development, the C++ team ran an A/B experiment to measure the feature’s impact on a series of metrics. They defined metrics to ensure that it would provide value to the customer, while not introducing errors to the product. This first iteration of experimentation revealed that this functionality improved engagement with Copilot Chat for C++ users, while not regressing errors.
Qualitative studies of initial design
In tandem, they ran a user study to validate the design of the feature. Notable feedback from the developers interviewed prioritized quick results and wanted an option to follow-up on the response. This feedback was instrumental in shaping the subsequent iterative A/B experiment.
Iterative experimentation on feature
In response to this feedback, they ran two follow-up quantitative A/B experiments. First, to evaluate how quicker results affected user value, they ran an A/B experiment to swap the model behind the feature to a lightweight but faster model. Second, to evaluate the follow-up prompt, they ran an A/B experiment with a new “Continue in chat window…” option added below results to measure how this affected product value and ensure it did not introduce errors.
Iterative A/B experimentation of AI models can lead to more widespread learnings across product behavior. For example, features that are frequently invoked and close to users’ workflows may benefit from models with faster response times, such as this Quick Info feature. On the other hand, response times may not affect features that provide users with more in-depth levels of information which break user workflow to interpret, such as Fix with Copilot feature. These types of features would benefit more from models that provide more verbosity and accuracy in response.
Putting things together
Determining the effectiveness of our generative AI feature requires a blend of various evaluation methods. We begin by deciding whether to start with quantitative or qualitative approaches. These evaluations are integrated into our product lifecycle to continually enhance our generative AI product. Once our experiment is set up, we progressively roll out the feature to minimize unexpected behavior. We start by testing on a small group before expanding to a broader audience. After obtaining our experiment results, we use them to refine and improve the product through iterative experimentation.
By combining these best practices, we achieve a comprehensive understanding of our generative AI feature’s impact and effectiveness. This holistic approach ensures that our generative AI feature is both user-centric and performance-driven, providing a better user experience and achieving our business goals.
– Sinem Akinci (Microsoft Developer Division), Cindy Chiu (Microsoft Experimentation Platform)
References
[1] Reddington, C. (2024, May 14). How companies are boosting productivity with generative AI. The GitHub Blog. https://github.blog/ai-and-ml/generative-ai/how-companies-are-boosting-productivity-with-generative-ai/#what-is-generative-ai (opens in new tab)
[2] Peckham, S., & Day, J. (2024, July 1). Generative AI. Microsoft Learn. https://learn.microsoft.com/en-us/ai/playbook/technology-guidance/generative-ai/ (opens in new tab)
[3] Stevenson, J., & Ostrowski, S. (2022, February 11). Measurably improve your product by combining qualitative and quantitative methods. Microsoft Research. https://www.microsoft.com/en-us/research/articles/measurably-improve-your-product-by-combining-qualitative-and-quantitative-methods/