Hyperspace: The Indexing Subsystem of Azure Synapse
- Rahul Potharaju ,
- Terry Kim ,
- Eunjin Song ,
- Wentao Wu ,
- Lev Novik ,
- Apoorve Dave ,
- Andrew Fogarty ,
- Pouria Pirzadeh ,
- Vidip Acharya ,
- Gurleen Dhody ,
- Jiying Li ,
- Sinduja Ramanujam ,
- Nicolas Bruno ,
- Cesar Galindo-Legaria ,
- Vivek Narasayya ,
- Surajit Chaudhuri ,
- Anil K. Nori ,
- Tomas Talius ,
- Raghu Ramakrishnan
Proceedings of the VLDB Endowment (VLDB 2021) |
Microsoft recently introduced Azure Synapse Analytics, which offers an integrated experience across data ingestion, storage, and querying in Apache Spark and T-SQL over data in the lake, including files and warehouse tables. In this paper, we present our experiences with designing and implementing Hyperspace, the indexing subsystem underlying Synapse. Hyperspace enables users to build multiple types of secondary indexes on their data, maintain them through a multi-user concurrency model, and leverage them automatically—without any change to their application code—for query/workload acceleration. Many requirements of Hyperspace are based on feedback from several enterprise customers. We present the details of Hyperspace’s underlying design, the user facing APIs, its concurrency control protocol for index access, its index-aware query processing techniques, and its maintenance mechanisms for handling index updates. Evaluations over standard industry benchmarks and real customer workloads show that Hyperspace can accelerate query execution by up to 10x and in certain real-world workloads, even up to two orders of magnitude.
论文与出版物下载
Hyperspace
11 8 月, 2021
An open source indexing subsystem that brings index-based query acceleration to Apache Spark™ and big data workloads.