Skip to content

Commit

Permalink
DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Sin…
Browse files Browse the repository at this point in the history
…gle Node

Signed-off-by: Zhao Junwang <[email protected]>
  • Loading branch information
zhjwpku committed Mar 28, 2024
1 parent 623ae0f commit e63f1ae
Show file tree
Hide file tree
Showing 8 changed files with 52 additions and 1 deletion.
1 change: 1 addition & 0 deletions src/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@
- [vector db](./databases/vectordb/README.md)
- [hnsw](./databases/vectordb/hnsw.md)
- [ivf-hnsw](./databases/vectordb/ivf-hnsw.md)
- [diskann](./databases/vectordb/diskann.md)
- [product quantization](./databases/vectordb/pq.md)
- [citus](./databases/citus.md)
- [optimizer](./databases/optimizer/README.md)
Expand Down
Binary file added src/assets/images/graph_index_two_options.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added src/assets/images/vamana_graph_generation.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added src/assets/images/vamana_indexing_algorithm.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added src/assets/pdfs/DiskANN_2019.pdf
Binary file not shown.
4 changes: 3 additions & 1 deletion src/databases/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,8 @@
- **[Greenplum: A Hybrid Database for Transactional and Analytical Workloads][greenplum]**
- **[Vector DB](vectordb/index.html)**
- **[Hierarchical NSW][hnsw]**
- **[IVF-HNSW][[ivf-hnsw]]**
- **[IVF-HNSW][ivf-hnsw]**
- **[DiskANN][diskann]**
- **[Product Quantization][pq]**
- **[Citus: Distributed PostgreSQL for Data-Intensive Applications][citus]**
- **[Optimizer](optimizer/index.html)**
Expand Down Expand Up @@ -53,3 +54,4 @@
[pq]: vectordb/pq.md
[ivf-hnsw]: vectordb/ivf-hnsw.md
[wisckey]: kv/wisckey.md
[diskann]: vectordb/diskann.md
2 changes: 2 additions & 0 deletions src/databases/vectordb/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
- [Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs][hnsw], 2016
- [Revisiting the Inverted Indices for Billion-Scale Approximate Nearest Neighbors][ivf-hnsw], 2018
- [FINGER: Fast Inference for Graph-based Approximate Nearest Neighbor Search][hnsw-finger], 2022 [HNSW-FINGER Explained!](https://www.youtube.com/watch?v=OsxZG2XfcZA)
- [DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node][diskann], 2019

### Quantization

Expand All @@ -35,3 +36,4 @@
[pq]: pq.md
[ivf-hnsw]: ivf-hnsw.md
[hnsw-finger]: https://arxiv.org/abs/2206.11408
[diskann]: diskann.md
46 changes: 46 additions & 0 deletions src/databases/vectordb/diskann.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
### [DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node](/assets/pdfs/DiskANN_2019.pdf)

> https://dl.acm.org/doi/abs/10.5555/3454287.3455520
> Current state-of-the-art approximate nearest neighbor search (ANNS) algorithms generate indices that must be stored in main memory for fast high-recall search. This makes them expensive and limits the size of the dataset.
对于较大的数据集,可以通过 PQ 将数据进行压缩或将数据集分片到多个节点,但两种方式都有不足,PQ 会降低召回率,而 sharding 需要更多的硬件资源:

1. their 1-recall@1 is rather low (around 0.5) since the data compression is lossy.
2. extending this to web scale data with hundreds of billions of points would require thousands of machines.

#### DiskANN

DiskANN, an SSD-resident ANNS system based on our new **graph-based** indexing algorithm called **Vamana**.

Vamana 生成的图索引相对 [HNSW](/databases/vectordb/hnsw.md) 或 NSG 具有更小的直径(diameter),从而最小化磁盘读 I/O。另外:

- Vamana 生成的图索引也可直接在内存中使用,其搜索性能可以媲美或超越 HNSW 或 NSG 等内存图索引算法
- Vamana 可以将数据集的多个分区分别生成的索引 merge 成一个索引,其搜索性能几乎与为整个数据集构建的单个索引相同,这对小内存环境非常友好
- Vamana 可以与 PQ 结合使用,将压缩过的数据存储在内存中来加速搜索过程中的距离计算

#### Vamana Graph Construction Algorithm

HNSW,NSG 和 Vamana 本质上都是图算法,都可以抽象为两个操作 GreedySearch 和 RobustPrune:

![GreedySearch and RobustPrune](/assets/images/graph_index_two_options.png)

所不同的是:

- HNSW 和 NSG 没有可以调整的参数 α,隐式地使用 α = 1,而这正是 Vamana 能够在 graph degree 和 diameter 上进行 trade-off 的关键
- 另外,Vamana 和 NSG 使用 GreedySearch(s, p, 1, L) 活动的所有点的集合作为 RobustPrune 的候选集,以此来给图增加 long-range edged,而 HNSW 则是通过分层来达到此效果
- NSG 的初始图是一个数据集近似的 K-nearest neighbor graph,耗时相对较长且消耗更多内存,而 HNSW 和 Vamana 的初始图则相对简单,前者以空图开始,后者以随机图开始
- Vamana 需要扫两边数据集,第二遍能够提高图的质量

![Vamana Indexing algorithm](/assets/images/vamana_indexing_algorithm.png)

从下图可以看出,第一行使用 α = 1 消除了很多不必要的边,第二行使用 α > 1 将一些所谓的 long-range edges 加回到图中:

![Progression of the graph generated by the Vamana](/assets/images/vamana_graph_generation.png)

DiskANN 通过 BeamSearch(设置 beamwidth 一次读多个数据块) 和缓存最常访问的节点(eg. by caching all vertices that are C = 3 or 4 hops from the starting point s)来加速查询。 另外,DiskANN 将邻居节点的向量保存在磁盘索引文件中,来提高搜索的精度(Implicit Re-Ranking Using Full-Precision Vectors)。

#### References:

- [DiskANN, A Disk-based ANNS Solution with High Recall and High QPS on Billion-scale Dataset](https://milvus.io/blog/2021-09-24-diskann.md)
- [Vamana vs. HNSW - Exploring ANN algorithms Part 1](https://weaviate.io/blog/ann-algorithms-vamana-vs-hnsw)

0 comments on commit e63f1ae

Please sign in to comment.