From f4ad7531bd846833d9c0113e87be1b2170fef918 Mon Sep 17 00:00:00 2001 From: osmarks Date: Thu, 28 Nov 2024 21:03:33 +0000 Subject: [PATCH] =?UTF-8?q?Edit=20=E2=80=98vector=5Findexing=E2=80=99?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- vector_indexing.myco | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/vector_indexing.myco b/vector_indexing.myco index 1aa4ed7..583f385 100644 --- a/vector_indexing.myco +++ b/vector_indexing.myco @@ -6,4 +6,4 @@ IVF-DAC (for some reason), which is just inverted lists combined with product quantization, was historically the most common way to search large vector datasets. However, recall is very bad in some circumstances (most notably when query/dataset vectors are drawn from significantly different distributions: see [[https://arxiv.org/abs/2305.04359]] and [[https://kay21s.github.io/RoarGraph-VLDB2024.pdf]]). The latter explains this phenomenon as resulting from the nearest neighbours being split across many more (and more widely distributed) clusters (cells) than with in-distribution queries. -Graph-based approaches aim to create graphs such that a greedy search on the graph toward closer (by the vector distance metric) points rapidly converges on (most of the time) the best-matching point. \ No newline at end of file +Graph-based approaches aim to create graphs such that a greedy search on the graph toward closer (by the vector distance metric) vertices rapidly converges on (most of the time) the best-matching vertex. These generally offer better search time/recall tradeoffs but have worse build time and are in some sense more [[cursed]] algorithmically. \ No newline at end of file