Authors:
Matthew Burruss and Shafiul “Jacky” Islam
In an ideal world, machine learning would be a straight path from defining the business use-cases to operationalizing the model, but in reality, the model lifecycle is a continuous loop, as objectives are redefined, models are updated, and the world changes (i.e., concept/data drift).
The process of automating, monitoring, developing, testing, and providing a continuous delivery of models into production is known as machine learning operations or MLOps.
AI models are used in Microsoft Dynamics 365 to provide business insights and drive business outcomes for our customers, including product recommendations, churn risk, and sentiment analysis and to power applications like Microsoft Dynamics 365 Customer Insights ,Microsoft Dynamics 365 Supply Chain Insights, etc.
This document will act as a reference guide to describe common challenges facing MLOps systems and describe the approaches our team at Microsoft has used to address these challenges.
Challenge 1. Scaling to Support Many Models, Infrastructures, and Apps
This section will describe the challenge of standardizing the model lifecycle to support many model types and runtimes. For example, data scientists may use several analytic solutions like Azure HDInsight (opens in new tab), Azure Synapse (opens in new tab), and Azure Machine Learning (opens in new tab) during their experimentation phase while eventually finding the best solution for deployment based on their specific functional requirements (e.g., batch and real-time inference) and nonfunctional requirements (e.g., model selection, scalability, GPU acceleration, and latency requirements).
However, each ecosystem may have different ways to register datasets, register models, provision resources, etc., causing teams to divert money and time into resource management.
Standardization of the platform ecosystems provides a consistent user experience for the development and release of a model even when the underlying technologies or model type may change. This accelerates the model lifecycle by reducing confounding variables like compute configuration while also enabling cross-team and cross-compute development.
We address this challenge by relying heavily on compute abstraction, model abstraction, and declarative testing of the AI model. We will see additional advantages of these design decisions in Challenge 3. Traceability and Monitoring Model Performance.
Compute abstraction is achieved by providing infra-as-code through Azure Pipelines (opens in new tab), allowing customers to declaratively select and configure their compute target(s) for various tests. We deliberately hide the details of the compute deployment and treat the compute as a service to allow data scientists to focus on model development, while also enhancing the reproducibility of their experiments and decreasing costs across the organization.
Model abstraction is achieved by ensuring that models adhere to a common interface. While this abstraction can be high-level, such as a train and score interface, it enables a consistent development-to-release process and allows applications to be built around the models in a backwards and forwards-compatible way.
Finally, Declarative tests verify the behavior and performance of the model. We leverage build verification tests and performance verification tests (e.g., stress and load tests) which are orchestrated by Azure Pipelines (opens in new tab). These tests leverage the compute and model abstraction to surface consistent signals to developers and to act as gates to judge if a model is deployment-ready.
Overall, as shown in Figure 2, incorporating MLOps helps scale delivery of models to multiple Dynamics 365 applications while accelerating AI model development lifecycle, allowing fast onboarding and quick model iteration. The onboarded models are a small subset of the models in Dynamics/Power applications overall and represent the set that have been onboarded to date to the MLOps platform discussed in this post.
Challenge 2. Reproducibility and Versioning
This section will describe the challenge of reproducibility and versioning. Randomness is inherent in many machine learning algorithms. One example is the random initialization of neuron weights in neural networks. However, reproducibility in the experimental design is important. For example, dataset transformations & feature engineering techniques like embeddings should be repeatable and reusable throughout the model lifecycle. Furthermore, it is important to capture the model runtime (model dependencies, python version, etc.) between iterations for change control.
To make the model lifecycle reproducible, we rely on Azure Repos (opens in new tab) to provide version control tools that store a permanent snapshot of the model, and its dataset transformation steps in a Git repository. We leverage Azure Blob Storage (opens in new tab) to house derived artifacts like model weights or explainability outcomes which are checkpointed whenever our testing platform runs (See Challenge 3. Traceability and Monitoring Model Improvements).
Once experiments are reproducible, different iterations of your model can be compared for improvements or regressions. This is where the challenge of versioning comes into play for the models, configurations, and datasets. To version our model, we rely on Azure Artifacts (opens in new tab) and Docker images pushed to Azure Container Registry (opens in new tab) to provide consistent snapshots of our model, configurations, and its runtime environment. We also leverage an internal Data API to supply versioned artifacts, e.g., for pretrained models and we use hashing for dataset versioning when augmenting data. We also allow users to organize data to their liking. We have found that tagging datasets with summary statistics and descriptions also helps in experimentation efforts and dataset readability.
Challenge 3. Traceability and Monitoring Model Improvements
This section will describe the challenge of testing the model and evaluating its performance against a baseline or existing model in production. The MLOps platform should weigh the tradeoffs of time, money, and resources when determining the number and types of tests used to evaluate the model code (e.g., unit, integration, load, and stress tests). It is also important to ensure high quality and secure code through automatic linting and security scanning. For performance monitoring, it is important to practice reproducibility and ensure that metrics can be easily traced and compared against previous models. Traceability also ensures performance numbers, code changes, approvals, and new features can be tracked across the model’s lifecycle. In our platform, we track common metrics like memory consumption, execution times, etc. while allowing models to customize their own metrics like F1 score, accuracy, etc.
For testing of the AI model, we have developed a common set of standard DevOps tools to orchestrate model development, including a verification engine to verify model behavior and performance. We start with high-speed, low-cost unit tests followed by slower endpoint integration tests. These endpoint tests, for example, verify that the train endpoint works as intended on a small dataset. Finally, we run more heavyweight end-to-end integration tests on DEV instances of our production platform to gather performance numbers and verify end-to-end runtime behaviors.
These integration tests use declarative test cases to verify the behavior of the model, allowing model owners to define tests that check metrics, SHAP explainability results, model error codes, etc. The declarative tests also enable model developers to create consistent test cases for different model types like batch vs. real-time as well as different compute types like HDInsight vs. Azure Machine Learning.
For each test, metrics and outputs are tracked, providing a historical snapshot of the model. We leverage mlflow (opens in new tab) to log metrics, parameters, and tags. We persist these artifacts in CosmosDB (opens in new tab). Finally, we consume and visualize this data in PowerBI (opens in new tab), which becomes especially useful for quick model evaluation and comparison.
Challenge 4. Model Packaging and Deployment
Once the model is tested and evaluated, the final challenge is packaging and deploying the model. First, after passing all the testing gates, a new version of the model is pushed to our PyPI and NuGet feed in Azure Artifacts (opens in new tab). How the model is then deployed largely depends on the compute target requirements. For example, batch jobs running on HDInsight through Spark consume a Conda environment whereas our real-time/batch models running on Azure Machine Learning pull the model’s Docker image from an Azure Container Registry (opens in new tab).
For operationalizing the deployment, it is best practice to automate as many parts of the process as possible while having a human in the loop to sign-off on the deployment. Azure DevOps allows the ability to define gated release (opens in new tab)s such that a notification is sent to a group of approvers to manually review/approve the release. Once all the tests and approvals have passed, the model can then be deployed to the customer facing service. One thing that is important to determine is when and how frequently to deploy a model. Common approaches include scheduled triggers, which may be required if a data drift is frequent; but for most of our cases we perform manual deployment at ad-hoc intervals whenever a new feature or improvement is available.
Finally, it may be helpful to ease the model into production and evaluate its performance against an existing model to see if the AI feature improves the end-user experience. Common approaches to this include A/B testing and shadow mode deployments. While such discussions are out-of-scope of this article, we encourage those interested in learning more about techniques to continuously evaluate a model in production.
Conclusions
Thank you for reading; we hope you have learned more about MLOps and can utilize these learnings to improve the continuous release of your models. If you want to talk more about this or join our team (we’re always looking for great people!), contact us at: [email protected]. Happy training!