As businesses become more data-driven, there is an increasing interest in adopting data lakes (e.g., Microsoft Fabric) in large enterprises. A data lake is a large storage repository that stores a vast amount of data in a variety of open data formats, making it accessible for all use cases (e.g., AI/data science/BI/reporting) that have arisen or could arise. This includes text-based raw data formats such as CSV and JSON, row-wise binary formats such as Apache Avro, and batched column-wise formats such as Apache Parquet and ORC. In data lakes, data is ingested in its native open format without expensive and time-consuming data preparation.
We are innovating on the storage tier of this emerging architecture to accelerate query processing on various open data formats. Our research has been commercialized and widely used in several products of Microsoft. Example techniques we developed include:
- Mison. Mison is a fast parser for raw data formats such as CSV and JSON. It is order of magnitude faster than the traditional finite state machine-based approach. Our new parsing technique allows query engines to push down projections and filters of queries into the parser, and thus avoids a great deal of wasted work by only parsing fields that are relevant to the queries. It also breaks the dependences in state transitions of the traditional approach and thus enables the parser to parse a vector of characters in parallel with SIMD instructions.
- Parquet-select. Parquet-select is an Apache Parquet reader that is up to one order of magnitude faster than the open-source Parquet reader. It enables predicate pushdown in Parquet and thus avoids expensive decompression on unnecessary compressed column values. Our techniques extensively use Bit Manipulation Instructions (BMI), a special instruction set extension of the X86 architecture, widely available in Intel/AMD CPUs.