mirror of
https://github.com/osmarks/website
synced 2026-06-08 21:52:18 +00:00
copyedits, heavenbanning post
This commit is contained in:
@@ -2,7 +2,7 @@
|
||||
title: Scaling meme search to 230 million images
|
||||
description: Downloading and indexing everything* on Reddit on one computer.
|
||||
created: 24/01/2025
|
||||
# series: meme_search
|
||||
series: meme_search
|
||||
series_index: 3
|
||||
slug: memescale
|
||||
---
|
||||
@@ -50,7 +50,7 @@ There are better solutions available. [DiskANN](https://proceedings.neurips.cc/p
|
||||
|
||||
Benchmarking on a small dataset showed that the product quantization was a significant limit on OOD query performance (and ID performance, but this starts from a much higher point and seems to vary less). DiskANN's algorithm can compensate for this with more disk reads - since it retrieves full-precision vectors as it traverses the graph, it asymptotically approaches perfect recall - but disk reads are costly. I adopted a slight variant of [OOD-DiskANN](https://arxiv.org/abs/2211.12850)'s AOPQ algorithm, which adjusts quantization to adapt to the distribution of queries. I investigated [RaBitQ](https://github.com/gaoj0017/RaBitQ) briefly due to its better claims, but it did not seem good in practice, and the basic idea was considered and rejected by [some other papers](https://arxiv.org/abs/1806.03198) anyway. In principle, there are much more efficient ways to pack vectors into short codes - including, rather recently, some work on [using a neural network to produce an implicit codebook](https://arxiv.org/abs/2501.03078) - but the ~1μs/PQ comparison time budget at query time makes almost all of them impractical even with GPU offload[^14]. The one thing which can work is *additive* quantization, which is as cheap at query time as product quantization but much slower at indexing time. The slowdown is enough that it's not practical for DiskANN, unless GPU acceleration is used. Most additive quantizers rely on beam search, which parallelizes poorly, but [LSQ++](https://openaccess.thecvf.com/content_ECCV_2018/papers/Julieta_Martinez_LSQ_lower_runtime_ECCV_2018_paper.pdf) should be practical. This may be integrated later.
|
||||
|
||||
An interesting problem I also hadn't considered beforehand was sharding: DiskANN handles larger-than-memory datasets by clustering them using k-means or a similar algorithm, processing each shard separately and stitching the resulting graphs together. K-means itself, however, doesn't produce balanced clusters, which is important since it's necessary to bound the maximum cluster size to fit it in RAM. I fixed this in a moderately cursed way by using [simulated annealing](https://github.com/osmarks/meme-search-engine/blob/master/kmeans.py) to generate approximately-right clustering centroids on a subset of the data and adding a fudge factor at full dataset shard time to balance better than that. I tried several other things which did not work.
|
||||
An unexpected problem was sharding: DiskANN handles larger-than-memory datasets by clustering them using k-means or a similar algorithm, processing each shard separately and stitching the resulting graphs together. K-means itself, however, doesn't produce balanced clusters, which is important since it's necessary to bound the maximum cluster size to fit it in RAM. I fixed this in a moderately cursed way by using [simulated annealing](https://github.com/osmarks/meme-search-engine/blob/master/kmeans.py) to generate approximately-right clustering centroids on a subset of the data and adding a fudge factor at full dataset shard time to balance better than that. I tried several other things which did not work.
|
||||
|
||||
Full graph assembly took slightly over six days with 42 shards with about 13 million vectors each (each vector is spilled to two shards so that the shards are connected, and there are more vectors in the index than are shown in results because I did some filtering too late). There were some auxiliary processes like splitting and merging them and running the rating models and quantizers, but these were much quicker.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user