EzPC: Increased data security in the AI model validation process

已发布

作者 , Principal Researcher , Principal Researcher , Principal Researcher , Principal Researcher

EzCc provides secure AI model validation. In the diagram poses the following question: Is the accuracy of the AI model on the test dataset greater than 70%? First, an AI vendor provides model weights, and a modular compiler takes as input from the model weights the AI model structure written in ONNX code for ML inference. From this, it automatically generates MPC protocol code, which is then compiled into various MPC protocols. Additionally, a suite of highly performant cryptographic protocols securely compute complex ML functions on an organization’s test dataset. The MPC protocol outputs random bits, keeping the data from both parties secure.

From manufacturing and logistics to agriculture and transportation, the expansion of artificial intelligence (AI) in the last decade has revolutionized a multitude of industries—examples include enhancing predictive analytics on the manufacturing floor and making microclimate predictions so that farmers can respond and save their crops in time. The adoption of AI is expected to accelerate in the coming years, underscoring the need for an efficient adoption process that preserves data privacy.

Currently, organizations that want to adopt AI into their workflow go through the process of model validation, in which they test, or validate, AI models from multiple vendors before selecting the one that best fits their needs. This is usually done with a test dataset that the organization provides. Unfortunately, the two options that are currently available for model validation are insufficient; both risk the exposure of data.

One of these options entails the AI vendor sharing their model with the organization, which can then validate the model on its test dataset. However, by doing this, the AI vendor risks exposing its intellectual property, which it undoubtedly wants to protect. The second option, equally risky, involves the organization sharing its test dataset with the AI vendor. This is problematic on two fronts. First, it risks exposing a dataset with sensitive information. Additionally, there’s the risk that the AI vendor will use the test dataset to train the AI model, thereby “over-fitting” the model to the test dataset to show credible results. To accurately assess how an AI model performs on a test dataset, it’s critical that the model not be trained on it. Currently, these concerns are addressed by complex legal agreements, often taking several months to draft and execute, creating a substantial delay in the AI adoption process.

The risk of data exposure and the need for legal agreements are compounded in the healthcare domain, where patient data—which makes up the test dataset—is incredibly sensitive, and there are strict privacy regulations with which both organizations must comply. Additionally, not only does the vendor’s AI model contain proprietary intellectual property information, but it may also include sensitive patient information as part of the training data that was used to develop it. This makes for a challenging predicament. On one hand, healthcare organizations want to quickly adopt AI due to its enormous potential in such applications as understanding health risks in patients, predicting and diagnosing diseases, and developing personalized health intervention. On the other hand, there’s a fast-growing list of AI vendors in the healthcare space to choose from (currently over 200), making the cumulative legal paperwork of AI validation daunting.

EzPC: Easy Secure Multi-Party Computation

We’re very interested in accelerating the AI model validation process while also ensuring dataset and model privacy. For this reason, we built Easy Secure Multi-party Computation (opens in new tab) (EzPC). This open-source framework is the result of a collaboration among researchers with backgrounds in cryptography, programming languages, machine learning (ML), and security. At its core, EzPC is based on secure multiparty computation (opens in new tab) (MPC)—a suite of cryptographic protocols that enable multiple parties to collaboratively compute a function on their private data without revealing that data to one other or any other party. This functionality makes AI model validation an ideal use case for MPC.

However, while MPC has been around for almost four decades, it’s rarely deployed because building scalable and efficient MPC protocols requires deep cryptography expertise. Additionally, while MPC performs well when computing small or simple stand-alone functions, combining several different kinds of functions—which is fundamental to ML applications—is much harder and inefficient if done without a specialized skillset.

EzPC solves these problems, making it easy for all developers, not just cryptography experts, to use MPC as a building block in their applications while providing high computational performance. Two innovations are at the core of EzPC. First, a modular compiler called CrypTFlow takes as input TensorFlow or Open Neural Network Exchange (ONNX) code for ML inference and automatically generates C-like code, which can then be compiled into various MPC protocols. This compiler is both “MPC-aware” and optimized, ensuring that the MPC protocols are efficient and scalable. The second innovation is a suite of highly performant cryptographic protocols for securely computing complex ML functions.

 The EzPC system provides usability, security, and performance. Regarding usability, EzPC provides automatic compilations from TensorFlow or ONNX code to MPC protocols. Also, no cryptography expertise is required. Regarding security, mathematical guarantees ensure that only random bits are exchanged, and sensitive data is demonstrably secured. Regarding performance, real-world benchmarks show that EzPC can run on million-parameter networks and is executable in minutes.

EzPC in practice: Multi-institution medical imaging AI validation

In a recent collaboration with researchers at Stanford University (opens in new tab) and the Centre for Advanced Research in Imaging, Neuroscience & Genomics (opens in new tab) (CARING), the EzPC team (opens in new tab) built a system using EzPC to address the need for secure and performant AI model validation. The team from Stanford University had developed a widely acclaimed 7-million parameter DenseNet-121 AI model (opens in new tab) trained on the CheXpert (opens in new tab) dataset to predict certain lung diseases from chest X-rays, while a team from CARING created a labeled test dataset of five hundred patient images. The goal was to test the accuracy of the CheXpert model on CARING’s test dataset while preserving the privacy of both the model and the test data.

With this test, EzPC enabled the first-ever secure validation of a production-grade AI model, proving that it’s not necessary to share data to accurately perform AI model validation. Additionally, the performance overheads of the secure validation were reasonable and practical for the application. In particular, it took 15 minutes to perform secure inference on a single image from the test data between two standard cloud virtual machines, which was about 3000x longer than the time needed to test an image without the added security that EzPC provides. Running all the images from the test data took a total of five days with a nominal overall cost (<$100). If we had run all the test images in parallel, the entire evaluation would have taken 15 minutes. The details of this real-world case study are published in the paper, Multi-institution encrypted medical imaging AI validation without data sharing.

Microsoft research podcast

Collaborators: Silica in space with Richard Black and Dexter Greene

College freshman Dexter Greene and Microsoft research manager Richard Black discuss how technology that stores data in glass is supporting students as they expand earlier efforts to communicate what it means to be human to extraterrestrials.

Looking ahead: Standardizing privacy technology and applications beyond healthcare

With EzPC, MPC technology is now practical and accessible enough to be run on complex AI workloads, making it a game-changer in data collaboration and enabling organizations in all industries, not only healthcare, to select the best AI models for their use cases while simultaneously protecting data and model confidentiality. We want to encourage the use of EzPC with the awareness that it’s possible to validate AI models without sharing data. In doing so, we can prevent the risk of data exposure and potentially overcome current barriers in data collaboration.

Moreover, this technology has the potential to impact the negotiation of complex legal agreements required for the AI model validation process. It’s our hope that these types of legal agreements as well as legislation that aims to protect sensitive and proprietary information can incorporate the understanding that—when using the latest privacy-preserving technology—it’s not necessary to share this type of information to compute functions on joint data.

In addition to AI model validation, EzPC can be applied to a number of different scenarios where it’s essential to maintain data privacy. We’ve successfully evaluated EzPC to securely compute a variety of algorithms across such domains as phishing detection, personalized radiotherapy, speech to keywords, and analytics.

EzPC is open source under MIT license on GitHub. Discover the latest developments on the EzPC research project page, where you can read our publications and watch videos to learn more.

相关论文与出版物

继续阅读

查看所有博客文章