documentation/vector_indexing.myco

[[Neural nets|Modern technology]] has allowed converting many [[things]] to [[vectors]], allowing things related to other things to be found through finding records with the highest/highest/lowest dot product/cosine similarity/L2 distance with/to/from queries. This can be done exactly through brute force, but this is obviously not particularly efficient. [[Algorithms]] allow sublinear runtime scaling wrt. record count, with some possibility of missing the best (as determined by brute-force) match. The main techniques are:

* graph-based
* product quantization (lossy compression)
* inverted lists (split vectors into clusters, search a subset of the clusters)

Inverted list/product quantization was historically the most common way to search large vector datasets. However, recall is very bad in some circumstances (most notably when query/dataset vectors are drawn from significantly different distributions: see [[https://arxiv.org/abs/2305.04359]] and [[https://kay21s.github.io/RoarGraph-VLDB2024.pdf]]. The latter explains this phenomenon as resulting from the nearest neighbours being split across many more (and more widely distributed) clusters (cells) than with in-distribution queries.
Edit ‘vector_indexing’ 2024-11-28 15:56:46 +00:00			[[Neural nets\|Modern technology]] has allowed converting many [[things]] to [[vectors]], allowing things related to other things to be found through finding records with the highest/highest/lowest dot product/cosine similarity/L2 distance with/to/from queries. This can be done exactly through brute force, but this is obviously not particularly efficient. [[Algorithms]] allow sublinear runtime scaling wrt. record count, with some possibility of missing the best (as determined by brute-force) match. The main techniques are:

			`* graph-based`
			`* product quantization (lossy compression)`
Edit ‘vector_indexing’ 2024-11-28 20:43:57 +00:00			`* inverted lists (split vectors into clusters, search a subset of the clusters)`

			Inverted list/product quantization was historically the most common way to search large vector datasets. However, recall is very bad in some circumstances (most notably when query/dataset vectors are drawn from significantly different distributions: see [[https://arxiv.org/abs/2305.04359]] and [[https://kay21s.github.io/RoarGraph-VLDB2024.pdf]]. The latter explains this phenomenon as resulting from the nearest neighbours being split across many more (and more widely distributed) clusters (cells) than with in-distribution queries.