1
0
mirror of https://github.com/osmarks/website synced 2025-09-04 19:37:56 +00:00
This commit is contained in:
osmarks
2025-01-28 14:43:48 +00:00
parent f583c2fd9a
commit 3ce96ec7d4
3 changed files with 3 additions and 3 deletions

View File

@@ -26,7 +26,7 @@ The [unsanctioned datasets distributed via BitTorrent](https://academictorrents.
This may be unintuitive, since "all the images" was, based on my early estimates, about 250 million. Assuming a (slightly pessimistic) 1MB per image, I certainly don't have 250TB of storage. Usable thumbnails would occupy perhaps 50kB each with the best available compression, which would have been very costly to apply, but 12TB is still more than I have free. The trick is that it wasn't necessary to store any of that[^4]: to do search, only the embedding vectors, occupying about 2kB each, are needed (as well as some metadata for practicality). Prior work like [img2dataset](https://github.com/rom1504/img2dataset) retained resized images for later embedding: I avoided this by implementing the entire system as a monolithic minimal-buffering pipeline going straight from URLs to image buffers to embeddings to a very large compressed blob on disk, with backpressure to clamp download speed to the rate necessary to feed the GPU.
I spent a day or two [implementing](https://github.com/osmarks/meme-search-engine/blob/master/src/reddit_dump.rs) this, with a mode to randomly sample a small fraction of the images for initial testing. This revealed some bottlenecks - notably, the inference server was slower than it theoretically could be and substantially CPU-hungry - which I was able to partly fix by [hackily rewriting](https://github.com/osmarks/meme-search-engine/blob/master/aitemplate/model.py) the model using [AITemplate](https://github.com/facebookincubator/AITemplate). I had anticipated running close to network bandwidth limits, but with my GPU fully loaded and the inference server improved I only hit 200Mbps down at first; a surprising and more binding limit was the CPU-based image preprocessing code, which I "fixed" by compromising image quality very slightly. I also had to increase a lot of resource limits (file descriptors and local DNS caching) to handle the unreasonable amount of parallel downloads. This more or less worked, but more detailed calculations showed that I'd need a month of runtme and significant additional storage for a full run, and the electricity/SSD costs were nontrivial so the project was shelved.
I spent a day or two [implementing](https://github.com/osmarks/meme-search-engine/blob/master/src/reddit_dump.rs) this, with a mode to randomly sample a small fraction of the images for initial testing. This revealed some bottlenecks - notably, the inference server was slower than it theoretically could be and substantially CPU-hungry - which I was able to partly fix by [hackily rewriting](https://github.com/osmarks/meme-search-engine/blob/master/aitemplate/model.py) the model using [AITemplate](https://github.com/facebookincubator/AITemplate). I had anticipated running close to network bandwidth limits, but with my GPU fully loaded and the inference server improved I only hit 200Mbps down at first; a surprising and more binding limit was the CPU-based image preprocessing code, which I "fixed" by compromising image quality very slightly. I also had to increase a lot of resource limits (file descriptors and local DNS caching) to handle the unreasonable amount of parallel downloads. This more or less worked, but more detailed calculations showed that I'd need a month of runtime and significant additional storage for a full run, and the electricity/SSD costs were nontrivial so the project was shelved.
Recently, some reprioritization and requiring a lot of additional storage anyway resulted in me resurrecting the project from the archives. I had to make a few final tweaks to integrate it with the metrics system, reduce network traffic by making it ignore probably-non-image URLs earlier, log some data I was missing and (slightly) handle links to things like Imgur galleries. After an early issue with miswritten concurrency code leading to records being read in the wrong order such that it would not correctly recover from a restart, it ran very smoothly for a few days. There were, however, several unexplained discontinuities in the metrics, as well as some gradual changes over time which resulted in me using far too much CPU time. I had to actually think about optimization.