Portrait de Wentao Wu

Wentao Wu

Principal Researcher

À propos

I am with the Data Systems (opens in new tab) group, Microsoft Research. I received my Ph.D. in the Database Group at the University of Wisconsin-Madison, under the supervision of Prof. Jeffrey Naughton.

I have broad interest in database system, data mining, and machine learning. I am currently working on query optimization, query processing, database system performance tuning, big data systems, distributed systems, data stream processing, and machine learning systems. In the past, I have worked on various topics including graph data management, personal data management, knowledgebase construction, social network analysis, data privacy, entity matching in data integration, database as a service in the cloud, and so on.

Since I joined Microsoft Research, my primary focus has been the project of «Autonomous Index Tuning for Database Systems», with an emphasis on using machine learning (ML) technologies to improve the efficiency and effectiveness of index tuning. More details can be found in [SIGMOD’24] (opens in new tab)[SIGMOD’24] (opens in new tab)[SIGMOD Record] (opens in new tab)[SIGMOD’22] (opens in new tab)[SIGMOD’22] (opens in new tab)[VLDB’22] (opens in new tab)[SIGMOD’19] (opens in new tab)[VLDB’18] (opens in new tab). I have also worked with production teams on developing indexing technologies in Helios [VLDB’20] (opens in new tab) and Hyperspace [VLDB’21] (opens in new tab). Helios is a system for inexpensive and flexible ingestion, indexing, and aggregation of large streams of real-time data at Microsoft, which combines the cloud and the edge as a single, holistic data processing platform. It has been featured in two blog entries of «the morning paper» series (part 1 (opens in new tab) and part 2 (opens in new tab)). Hyperspace introduces an indexing subsystem for Apache Spark (opens in new tab). It has been used by Azure Synapse Analytics (opens in new tab) and also open-sourced on GitHub (opens in new tab).

In addition, I have worked on the project of «Query Optimization for Data Stream Processing Systems». More details can be found in [ICDE’22] (opens in new tab)[VLDB’21] (opens in new tab)[CIDR’19] (opens in new tab). I have also worked in the area of «MLDev and MLOps», which aims for easing the development work of data scientists that use ML technologies. More details can be found in [ICLR’24] (opens in new tab)[ICDE’23] (opens in new tab)[SIGMOD Record] (opens in new tab)[IEEE Data Engineering Bulletin] (opens in new tab)[CIDR’21] (opens in new tab)[CIDR’21] (opens in new tab)[VLDB’21] (opens in new tab)[KDD’20] (opens in new tab)[VLDB’20] (opens in new tab)[VLDB’19] (opens in new tab)[SysML’19] (opens in new tab). The [SysML’19] (opens in new tab) paper on «continuous integration of machine learning models» has been featured in «the morning paper» series (blog post (opens in new tab)) as well. Moreover, I am interested in various other aspects of ML systems, such as AutoML [KDD’21] (opens in new tab)[VLDB’21] (opens in new tab), large-scale ML training [VLDB Journal] (opens in new tab)[SIGMOD’21] (opens in new tab)[ICDE’20] (opens in new tab)[ICDE’19] (opens in new tab), in-database ML [SIGMOD’22] (opens in new tab)[VLDB’17] (opens in new tab), multi-tenancy [VLDB’18] (opens in new tab)[VLDB’18] (opens in new tab), and benchmarking [VLDB’18] (opens in new tab).

Before joining Microsoft, I worked on the project of «Cost Modeling and Query Optimization for Database Systems», using sampling-based technologies. More details can be found in [SIGMOD’16] (opens in new tab)[VLDB’14] (opens in new tab)[VLDB’13] (opens in new tab)[ICDE’13] (opens in new tab). I also worked on developing Probase [SIGMOD’12] (opens in new tab), a probabilistic knowledge graph for text understanding, which later on became the Microsoft Concept Graph (opens in new tab).