Deep Learning-based embeddings are used widely for «dense retrieval» in information retrieval, computer vision, NLP, amongst others, owing to capture diverse types of semantic information. This paradigm constructs embeddings so that semantically similar items are closer in a high dimensional metric space. The first step to enabling search and recommendation with such embeddings is to index the embeddings of the corpus and support approximate nearest-neighbor search (ANNS) a.k.a. Vector Search for query embeddings. While ANNS is a fundamental problem has been studied for decades, existing algorithms suffer from two main drawbacks: either their search accuracies are low, thereby affecting the quality of results downstream, or their memory (DRAM) footprint is enormous, making it hard to serve them at web scale.
In this project, we are designing algorithms to address the challenges of scaling ANNS for web and enterprise search and recommendation systems. Our goal is to build systems that serve trillions of points in a streaming setting cost effectively. Below is a summary of the associated research directions:
- DiskANN: (opens in new tab) an ANNS algorithm which can achieve both high accuracy as well as low DRAM footprint, by suitably using auxilliary SSD storage, which is significantly more cost-effective than DRAM. Using DiskANN, we can index 5-10X more points per machine than the state-of-the-art DRAM-based solutions: e.g., DiskANN can index upto a billion vectors while achieving 95% search accuracy with 5ms latencies, while existing DRAM-based algorithms peak at 100-200M points for similar latency and accuracy.
- FreshDiskANN (opens in new tab): We present the first graph-based ANNS index that reflects corpus updates into the index in real-time without compromising on search performance. Using the new graph update rules, we design a fresh-ANNS system that can index over a billion points on a workstation with an SSD and limited memory, and support thousands of concurrent real-time inserts, deletes and searches per second each, while retaining >95% 5-recall@5.
- Filtered Search: We are also working on ANNS algorithms which can support vector search with auxiliary filters based on structured information. For example, retrieving the closest vectors which satisfy a certain criteria, such as date range, or region, or stock availability, etc.
- Out-of-distribution queries (opens in new tab): Data-dependent ANNS algorithms overfit to index dat distribution and are inaccurate for queries drawn from a distribution different than the index distribution. This is often the case in multi-modal datasets. This work addresses how to use a sample query set to improve the accuracy for such queries.
- Theoretical Understanding: While graph-based ANNS has seen tremendous success in real-world applications, we don’t understand when and how they work so effectively. One of our ongoing efforts is in obtaining a deeper understanding here.
The codebase is maintained at https://github.com/Microsoft/DiskANN. (opens in new tab)
See also the NeurIPS’21 challenge (opens in new tab) on billion-scale ANNS algorithms, and report (opens in new tab) on winning ideas.