SPFresh: Incremental In-Place Update for Billion-Scale Vector Search
- Yuming Xu ,
- Hengyu Liang ,
- Jin Li ,
- Shuotao Xu ,
- Qi Chen ,
- Qianxi Zhang ,
- Cheng Li ,
- Ziyue Yang ,
- Fan Yang ,
- Yuqing Yang ,
- Peng Cheng ,
- Mao Yang
SOSP'23 |
Organized by ACM
https://www.microsoft.com/en-us/research/blog/unified-database-laying-the-foundation-for-large-language-model-vertical-applications/
Microsoft Research Blog
Download BibTexApproximate Nearest Neighbor Search (ANNS) is now widely used in various applications including information retrieval, question answering, and recommendation. As the amount of vector data grows continuously, it becomes important to support updates to vector index, the enabling technique that allows for efficient and accurate ANNS on vectors.
Because of the curse of high dimensionality, it is often costly to identify the right neighbors of a single new vector, a necessary process for index update. To amortize update costs, existing systems maintain a secondary index to accumulate updates, which are merged by the main index by global rebuilding the entire index periodically. However, this approach has high fluctuations of search latency and accuracy, not even to mention that it requires substantial resources and is extremely time-consuming for rebuilds.
We introduce SPFresh, a system that supports in-place vector updates. At the heart of SPFresh is LIRE, a lightweight incremental rebalancing protocol to split vector partitions and reassign vectors in the nearby partitions to adapt to data distribution shift. LIRE achieves low-overhead vector updates by only reassigning vectors at the boundary between partitions, where in a high-quality vector index the amount of such vectors are deemed small. With LIRE, SPFresh provides superior query latency and accuracy to solutions based on global rebuild, with only 1% of DRAM and less than 10% cores needed at the peak compared to the state-of-the-art, in a billion scale disk-based vector index with a 1% of daily vector update rate.