Demonstration of Geyser: Provenance Extraction and Applications over Data Science Scripts
- Fotis Psallidas ,
- Megan Eileen Leszczynski ,
- Mohammad Hossein Namaki ,
- Avrilia Floratou ,
- Ashvin Agrawal ,
- Konstantinos Karanasos ,
- Subru Krishnan ,
- Pavle Subotić ,
- Markus Weimer ,
- Yinghui Wu ,
- Yiwen Zhu
ACM SIGMOD |
As enterprises have started developing and deploying complicated data science workloads at scale, the need for mechanisms that enable enterprise-grade data science (e.g., compliance or auditing) has become more pronounced. In this paper, we present Geyser, an extensible provenance system for data science workloads that can be used as a foundation for enterprise-grade data science. Our system supports both static and dynamic provenance, over a wide range of data science scripts, driven by a knowledge base of data science APIs. We demonstrate the wide applicability of the system using various industrial applications: provenance extraction, model compliance, model linting, model versioning, and poisoning detection. A video of the demonstration is available at https://aka.ms/geyserdemo.