Interacting with storage – be it main memory, local storage, or cloud storage – is one of the hardest challenges faced by application and platform developers. We have a “kitchen sink” of solutions available today, each optimized for a specific workload. The SimpleStore project aims at simplifying the use of storage for modern cloud, edge, serverless, and big data applications. Our recent presentation (opens in new tab) at HPTS overviews the broader research project. We tackle the problem under two broad umbrellas:
SimpleStore for Compute
We aim to simplify individual object access, update, and read-modify-write, for embedded edge and cloud applications, streaming, and auto-scaling serverless and actor-oriented compute frameworks. Towards this vision, we have been building systems, abstractions, and consistency models. The projects under this category include:
- FASTER: The FASTER (opens in new tab) project aims to provide an embedded key-value + cache (FasterKV) and log (FasterLog) abstraction over tiered storage, at very high performance.
- CPR: CPR (opens in new tab) is a new scalable recovery model that provides consistency across caches and storage, in a manner that is applicable to any database or key-value store. We have developed single- and multi-node versions of this model, and it is used for recovery in FASTER.
- Distribution and Scale-Out: We have built CRA (opens in new tab), an open-source distributed virtual connection runtime for the modern cloud-edge. CRA has been used with systems like Ambrosia (opens in new tab) and FASTER (opens in new tab) to provide resilient and ephemeral storage capabilities. We are also working on making it easier and more efficient to use FASTER in a distributed client-server environment, in the Shadowfax (opens in new tab) project. Finally, we are working on consistent storage/cache access in distributed serverless and actor environments, with a distributed version of CPR.
SimpleStore for Analytics
We aim to simplify and accelerate access to storage for analytics and more complex querying patterns (beyond point reads) by both applications and database systems. The projects under this category include:
- Qd-tree: In the qd-tree (opens in new tab) project, we have developed new techniques to leverage workload information to optimize data layouts towards a goal of accelerating modern analytics systems and databases. As future work, we are currently looking into supporting a broader class of workloads and caching layers.
- FishStore: Modern data sources have fixed or flexible schemas. FishStore (opens in new tab) is a fast ingestion, storage, and retrieval system that supports fast time-based ingestion of data and allows users to impose a complex workload on storage, with no a priori index or data layout selection necessary. FishStore leverages Mison (opens in new tab) and simdjson (opens in new tab) for fast partial parsing of JSON data. As future work, we plan to generalize FishStore to arbitrary types of queries over rapidly ingested logs.
- Secondary Indexing: PSF indexing is a concept from FishStore (opens in new tab) that allows users to define arbitrary «predicated subsets» of data and make them easily accessible for querying in future. We are adding this capability in FASTER C#. Further, based on our experience with FishStore, we are investigating the use of FASTER as the storage layer below a secondary range index such as RocksDB (opens in new tab), in order to support range queries.
Software Links
- https://github.com/microsoft/FASTER (opens in new tab)
- https://github.com/microsoft/CRA (opens in new tab)
- https://github.com/microsoft/FishStore (opens in new tab)