1
0
mirror of https://github.com/osmarks/website synced 2025-09-13 07:45:59 +00:00
This commit is contained in:
osmarks
2025-04-19 22:40:07 +01:00
parent 40f002cb7c
commit 4c6f1e89e2
5 changed files with 23 additions and 10 deletions

View File

@@ -24,9 +24,9 @@ The concept for this project was developed in May, when I was pondering how to g
The [unsanctioned datasets distributed via BitTorrent](https://academictorrents.com/details/9c263fc85366c1ef8f5bb9da0203f4c8c8db75f4), widely used in research and diligently maintained by [PushShift](https://github.com/Watchful1/PushshiftDumps) and [Arctic Shift](https://github.com/ArthurHeitmann/arctic_shift), were pleasantly easy to download and use, and after the slow process of decompressing and streaming all 500GB of historical submissions through some basic analysis tools on my staging VPS (it has a mechanical hard drive and two Broadwell cores...) I ran some rough estimates and realized that it would be possible for me to process *all* the images (up to November 2024)[^6] rather than just a subset.
This may be unintuitive, since "all the images" was, based on my early estimates, about 250 million. Assuming a (slightly pessimistic) 1MB per image, I certainly don't have 250TB of storage. Usable thumbnails would occupy perhaps 50kB each with the best available compression, which would have been very costly to apply, but 12TB is still more than I have free. The trick is that it wasn't necessary to store any of that[^4]: to do search, only the embedding vectors, occupying about 2kB each, are needed (as well as some metadata for practicality). Prior work like [img2dataset](https://github.com/rom1504/img2dataset) retained resized images for later embedding: I avoided this by implementing the entire system as a monolithic minimal-buffering pipeline going straight from URLs to image buffers to embeddings to a very large compressed blob on disk, with backpressure to clamp download speed to the rate necessary to feed the GPU.
This may be unintuitive, since "all the images" was, based on my early estimates, about 250 million. Assuming a (slightly pessimistic) 1MB per image, I certainly don't have 250TB of storage. Usable thumbnails would occupy perhaps 50kB each with the best available compression, which would have been very costly to apply, but 12TB is still more than I have free. The trick is that it wasn't necessary to store any of that[^4]: to do search, only the embedding vectors[^16], occupying about 2kB each, are needed (as well as some metadata for practicality). Prior work like [img2dataset](https://github.com/rom1504/img2dataset) retained resized images for later embedding: I avoided this by implementing the entire system as a monolithic minimal-buffering pipeline going straight from URLs to image buffers to embeddings to a very large compressed blob on disk, with backpressure to clamp download speed to the rate necessary to feed the GPU.
I spent a day or two [implementing](https://github.com/osmarks/meme-search-engine/blob/master/src/reddit_dump.rs) this, with a mode to randomly sample a small fraction of the images for initial testing. This revealed some bottlenecks - notably, the inference server was slower than it theoretically could be and substantially CPU-hungry - which I was able to partly fix by [hackily rewriting](https://github.com/osmarks/meme-search-engine/blob/master/aitemplate/model.py) the model using [AITemplate](https://github.com/facebookincubator/AITemplate). I had anticipated running close to network bandwidth limits, but with my GPU fully loaded and the inference server improved I only hit 200Mbps down at first; a surprising and more binding limit was the CPU-based image preprocessing code, which I "fixed" by compromising image quality very slightly. I also had to increase a lot of resource limits (file descriptors and local DNS caching) to handle the unreasonable amount of parallel downloads. This more or less worked, but more detailed calculations showed that I'd need a month of runtime and significant additional storage for a full run, and the electricity/SSD costs were nontrivial so the project was shelved.
I spent a day or two [implementing](https://github.com/osmarks/meme-search-engine/blob/master/src/reddit_dump.rs) this, with a mode to randomly sample a small fraction of the images for initial testing. This revealed some bottlenecks - notably, the inference server (for embedding the images) was slower than it theoretically could be and substantially CPU-hungry - which I was able to partly fix by [hackily rewriting](https://github.com/osmarks/meme-search-engine/blob/master/aitemplate/model.py) the model using [AITemplate](https://github.com/facebookincubator/AITemplate). I had anticipated running close to network bandwidth limits, but with my GPU fully loaded and the inference server improved I only hit 200Mbps down at first; a surprising and more binding limit was the CPU-based image preprocessing code, which I "fixed" by compromising image quality very slightly. I also had to increase several resource limits (file descriptors and local DNS caching) to handle the unreasonable quantity of parallel downloads. This more or less worked, but more detailed calculations showed that I'd need a month of runtime and significant additional storage for a full run, and the electricity/SSD costs were nontrivial so the project was shelved.
Recently, some reprioritization and requiring a lot of additional storage anyway resulted in me resurrecting the project from the archives. I had to make a few final tweaks to integrate it with the metrics system, reduce network traffic by making it ignore probably-non-image URLs earlier, log some data I was missing and (slightly) handle links to things like Imgur galleries. After an early issue with miswritten concurrency code leading to records being read in the wrong order such that it would not correctly recover from a restart, it ran very smoothly for a few days. There were, however, several unexplained discontinuities in the metrics, as well as some gradual changes over time which resulted in me using far too much CPU time. I had to actually think about optimization.
@@ -38,7 +38,7 @@ The metrics dashboard just after starting it up. The white stripe is due to plac
While not constantly maxing out CPU, it was bad enough to worsen GPU utilization.
:::
There were, conveniently, easy solutions. I reanalyzed some code and realized that I was using an inefficient `msgpack` library in the Python inference server for no particular reason, which was easy to swap out; that having the inference server client code send images as PNGs to reduce network traffic was not necessary for this and was probably using nontrivial CPU time for encode/decode (PNG uses outdated and slow compression); and that a farily easy [replacement](https://lib.rs/crates/fast_image_resize) for the Rust image resizing code was available with significant speedups. This reduced load enough to keep things functioning stably at slightly less than 100% CPU for a while, but it crept up again later. [Further profiling](/assets/images/meme-search-perf.png) revealed no obvious low-hanging fruit other than [console-subscriber](https://github.com/tokio-rs/console/), a monitoring tool for tokio async code, using ~15% of runtime for no good reason - fixing this and switching again to slightly lower-quality image resizing fixed everything for the remainder of runtime. There was a later spike in network bandwidth which appeared to be due to there being many more large PNGs to download, which I worried would sink the project (or at least make it 20% slower), but this resolved itself after a few hours.
There were, conveniently, easy solutions. I reanalyzed some code and realized that I was using an inefficient `msgpack` (de)serialization library in the Python inference server for no particular reason, which was easy to swap out; that having the inference server client code send images as PNGs to reduce network traffic was not necessary for this and was probably using nontrivial CPU time for encode/decode (PNG uses outdated and slow compression); and that a farily easy [replacement](https://lib.rs/crates/fast_image_resize) for the Rust image resizing code was available with significant speedups. This reduced load enough to keep things functioning stably at slightly less than 100% CPU for a while, but it crept up again later. [Further profiling](/assets/images/meme-search-perf.png) revealed no obvious low-hanging fruit other than [console-subscriber](https://github.com/tokio-rs/console/), a monitoring tool for `tokio` async code, using ~15% of runtime for no good reason - fixing this and switching again to slightly lower-quality image resizing fixed everything for the remainder of runtime. There was a later spike in network bandwidth which appeared to be due to there being many more large PNGs to download, which I worried would sink the project (or at least make it 20% slower), but this resolved itself after a few hours.
## Indexing
@@ -86,7 +86,7 @@ This all required more time in the data labelling mines and slightly different a
* I still think directly predicting winrates with a single model might be a good idea, but it would have been annoying to do, since I'd still have to train the score model, and I think most of the loss of information occurs elsewhere (rounding off preferences).
* Picking pairs with predicted winrate 0.5 would also pick mostly boring pairs the model is confident in. The variance of predictions across the ensemble is more meta-uncertainty, which I think is more relevant. I did add some code to have me rate the high-rated samples, though, since I worried that the unfiltered internet data was broadly too bad to make the model learn the high end.
It turns out that SigLIP is tasteful enough on its own that I don't need to do that much given a fairly specific query, and the classifier is not that useful - the bitter lesson in action.
It turns out that the SigLIP model is tasteful enough on its own that I don't need to do that much given a fairly specific query, and the classifier is not that useful - the bitter lesson in action.
I previously worked on [SAEs](/memesae/) to improve querying, but this seems to be unnecessary with everything else in place. Training of a bigger one has been completed for general interest - it can be downloaded [here](https://datasets.osmarks.net/big_sae/), along with the pages of the samples most/least in each feature direction. Subjectively, the negative samples seem somewhat more consistent and the features are more specific (I used 262144 features, up from 65536, and about ten times as many embeddings to train, and only did one epoch).
@@ -142,3 +142,5 @@ The meme search master plan.
[^14]: Also, most work doesn't seem interested in testing performance in the ~64B code size regime even though the winning method varies significantly by size.
[^15]: Could it be (usefully) run on GPU instead? I don't *think* so, at least without major tweaks. You'd have less VRAM, so you'd need smaller graph shards, and the algorithm is quite branchy and does many random reads.
[^16]: Embeddings are (roughly) a lossy summary of some input, represented as a large, mostly-opaque vector.