From a3574674d0d53df79d1924c98e3701cbd050d931 Mon Sep 17 00:00:00 2001
From: osmarks <osmarks@protonmail.com>
Date: Sat, 27 Apr 2024 17:33:24 +0100
Subject: [PATCH] "documentation"

---
 README.md | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/README.md b/README.md
index 129510e..b74299d 100644
--- a/README.md
+++ b/README.md
@@ -44,6 +44,20 @@ This is untested. It might work.
         * `backend_url` is the URL `mse.py` is exposed on (trailing slash probably optional).
 * If you want, configure Prometheus to monitor `mse.py` and `clip_server.py`.
 
+## MemeThresher
+
+See [here](https://osmarks.net/memethresher/) for information on MemeThresher, the new automatic meme acquisition/rating system (under `meme-rater`). Deploying it yourself is anticipated to be somewhat tricky but should be roughly doable:
+
+1. Edit `crawler.py` with your own source and run it to collect an initial dataset.
+2. Run `mse.py` with a config file like the provided one to index it.
+3. Use `rater_server.py` to collect an initial dataset of pairs.
+4. Copy to a server with a GPU and use `train.py` to train a model. You might need to adjust hyperparameters since I have no idea which ones are good.
+5. Use `active_learning.py` on the best available checkpoint to get new pairs to rate.
+6. Use `copy_into_queue.py` to copy the new pairs into the `rater_server.py` queue.
+7. Rate the resulting pairs.
+8. Repeat 4 through 7 until you feel good enough about your model.
+9. Deploy `library_processing_server.py` and schedule `meme_pipeline.py` to run periodically.
+
 ## Scaling
 
 Meme Search Engine uses an in-memory FAISS index to hold its embedding vectors, because I was lazy and it works fine (~100MB total RAM used for my 8000 memes). If you want to store significantly more than that you will have to switch to a more efficient/compact index (see [here](https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index)). As vector indices are held exclusively in memory, you will need to either persist them to disk or use ones which are fast to build/remove from/add to (presumably PCA/PQ indices). At some point if you increase total traffic the CLIP model may also become a bottleneck, as I also have no batching strategy. Indexing is currently GPU-bound since the new model appears somewhat slower at high batch sizes and I improved the image loading pipeline. You may also want to scale down displayed memes to cut bandwidth needs.
\ No newline at end of file