Data Science Through the Looking Glass: Analysis of Millions of GitHub Notebooks and ML.NET Pipelines
- Fotis Psallidas ,
- Yiwen Zhu ,
- Bojan Karlaš ,
- Jordan Henkel ,
- Matteo Interlandi ,
- Subru Krishnan ,
- Brian Kroth ,
- Venkatesh Emani ,
- Wentao Wu ,
- Ce Zhang ,
- Markus Weimer ,
- Avrilia Floratou ,
- Carlo Curino ,
- Konstantinos Karanasos
SIGMOD Record |
The recent success of machine learning (ML) has led to an explosive growth of systems and applications built by an ever-growing community of system builders and data science (DS) practitioners. This quickly shifting panorama, however, is challenging for system builders and practitioners alike to follow. In this paper, we set out to capture this panorama through a wide-angle lens, performing the largest analysis of DS projects to date, focusing on questions that can advance our understanding of the field and determine investments. Specifically, we download and analyze (a) over 8M notebooks publicly available on GITHUB and (b) over 2M enterprise ML pipelines developed within Microsoft. Our analysis includes coarse-grained statistical characterizations, fine-grained analysis of libraries and pipelines, and comparative studies across datasets and time. We report a large number of measurements for our readers to interpret and draw actionable conclusions on (a) what system builders should focus on to better serve practitioners and (b) what technologies should practitioners rely on.