Edit ‘vector_indexing’

This commit is contained in:
osmarks 2024-11-28 21:04:47 +00:00 committed by wikimind
parent f4ad7531bd
commit 3a60b03293

View File

@ -1,9 +1,11 @@
[[Neural nets|Modern technology]] has allowed converting many [[things]] to [[vectors]], allowing things related to other things to be found through finding records with the highest/highest/lowest dot product/cosine similarity/L2 distance with/to/from queries. This can be done exactly through brute force, but this is obviously not particularly efficient. [[Algorithms]] allow sublinear runtime scaling wrt. record count, with some possibility of missing the best (as determined by brute-force) match. The main techniques are: [[Neural nets|Modern technology]] has allowed converting many [[things]] to [[vectors]], allowing things related to other things to be found through finding records with the highest/highest/lowest dot product/cosine similarity/L2 distance with/to/from queries. This can be done exactly through brute force, but this is obviously not particularly efficient. [[Algorithms]] allow sublinear runtime scaling wrt. record count, with some possibility of missing the best (as determined by brute-force) match. The main techniques are:
* graph-based * graph-based e.g. HNSW, DiskANN/Vamana
* product quantization (lossy compression) * product quantization (lossy compression; also known as asymmetric hashing) e.g. ScaNN
* inverted lists (split vectors into clusters, search a subset of the clusters) * inverted lists (split vectors into clusters, search a subset of the clusters)
IVF-DAC (for some reason), which is just inverted lists combined with product quantization, was historically the most common way to search large vector datasets. However, recall is very bad in some circumstances (most notably when query/dataset vectors are drawn from significantly different distributions: see [[https://arxiv.org/abs/2305.04359]] and [[https://kay21s.github.io/RoarGraph-VLDB2024.pdf]]). The latter explains this phenomenon as resulting from the nearest neighbours being split across many more (and more widely distributed) clusters (cells) than with in-distribution queries. IVF-DAC (for some reason), which is just inverted lists combined with product quantization, was historically the most common way to search large vector datasets and still remains popular via FAISS. However, recall is very bad in some circumstances (most notably when query/dataset vectors are drawn from significantly different distributions: see [[https://arxiv.org/abs/2305.04359]] and [[https://kay21s.github.io/RoarGraph-VLDB2024.pdf]]). The latter explains this phenomenon as resulting from the nearest neighbours being split across many more (and more widely distributed) clusters (cells) than with in-distribution queries.
Graph-based approaches aim to create graphs such that a greedy search on the graph toward closer (by the vector distance metric) vertices rapidly converges on (most of the time) the best-matching vertex. These generally offer better search time/recall tradeoffs but have worse build time and are in some sense more [[cursed]] algorithmically. Graph-based approaches aim to create graphs such that a greedy search on the graph toward closer (by the vector distance metric) vertices rapidly converges on (most of the time) the best-matching vertex. These generally offer better search time/recall tradeoffs but have worse build time and are in some sense more [[cursed]] algorithmically.
Product quantization can be combined with reranking to produce better top-n search results; this is used in ScaNN and some FAISS configurations.