DeepSeek section, SAE

2025-11-17 21:55:14 +00:00 · 2025-02-02 22:34:14 +00:00
parent 3ce96ec7d4
commit cc29039444
2 changed files with 12 additions and 3 deletions
--- a/blog/ml-workstation.md
+++ b/blog/ml-workstation.md
@@ -2,7 +2,7 @@
 title: So you want a cheap ML workstation
 description: How to run local AI slightly more cheaply than with a prebuilt system. Somewhat opinionated.
 created: 25/02/2024
-updated: 14/04/2024
+updated: 02/02/2025
 slug: mlrig
 ---
 ::: emphasis
@@ -13,6 +13,7 @@ slug: mlrig
 - Buy recent consumer Nvidia GPUs with lots of VRAM (*not* datacentre or workstation ones).
 - Older or used parts are good to cut costs (not overly old GPUs).
 - Buy a sufficiently capable PSU.
+- For *specifically* big LLM inference, you probably want a server CPU (not a GPU) with lots of memory and memory bandwidth. See [this section](#cpu-inference).

 :::

@@ -118,6 +119,14 @@ Apple M1 Ultra | 21 | 819 | 27 | 128 | Apple Silicon has a bizarrely good memory

 One forward pass of an LLM with FP16 weights conveniently also requires loading two bytes per weight, so the FLOPS per byte ratio above is (approximately; I'm rounding off many, many details here) how many tokens can be processed in parallel without slowdown. Since sampling (generating outputs) is inherently serial you don't benefit from possible parallelism (except when processing the prompt), so quantization (which reduces memory bandwidth and slightly increases compute costs) has lots of room to work. In principle the FLOP/byte ratio should be high enough with everything that performance is directly proportional to bandwidth. This does not appear to be true with older GPUs according to [user reports](https://www.reddit.com/r/LocalLLaMA/search?q=p40&restrict_sr=on&sort=relevance&t=all), probably due to overheads I ignored - notably, nobody reports more than about 15 tokens/second. Thus, despite somewhat better software support, CPU inference is usually going to be slower than old-datacentre-GPU inference, but is at least the best way to get lots of memory capacity.

+#### MoE models
+
+I have seen a noticeable uptick in search traffic to this page recently, which I assume is because of interest in [DeepSeek-R1](https://github.com/deepseek-ai/DeepSeek-R1) and running it locally. If you're not aware, people often use "R1" to refer to either the original 671-billion-parameter [DeepSeek-V3](https://github.com/deepseek-ai/DeepSeek-V3)-based model or the much smaller (<70B) finetunes of other open-source models. These are different in many ways - the former is much smarter but hard to run.
+
+Mixture of Experts models are a way to improve on the compute-efficiency of standard "dense" transformers by using only a subset of parameters in each forward pass of the model. This is good for large-scale deployments, where one instance of a model is deployed across tens or hundreds of GPUs, because of lower per-token compute costs, but unfortunately quite bad for local users, because you need more total parameters for equivalent performance to a dense model and thus more VRAM capacity for inference.
+
+Most open-source language models (LLaMA, etc) are either small enough to fit on one or two GPUs, or compute-intensive enough (LLaMA-3.1-405B, for instance) that home users could only feasibly fit them on CPU RAM but would then experience awful (~1 token per second) speeds. DeepSeek-V3 (and thus -R1), however, is an MoE model - only 37B of the 671B total parameters are needed for each token, so bandwidth requirements are much lower, so with enough RAM capacity it can theoretically run at slow-GPU-system speeds of ~10 tokens per second. You'll need a server platform to fit this much RAM, and RAM itself is quite expensive, though. See [this article](https://digitalspaceport.com/how-to-run-deepseek-r1-671b-fully-locally-on-2000-epyc-rig/) on a $2000 single-socket AMD EPYC build doing ~4 tokens per second on the quantized model. See [this Twitter thread](https://x.com/carrigmat/status/1884244369907278106) for a $6000 dual-socket build doing ~8 tokens per second on the unquantized model. This is significantly below the theoretical limit (~25 t/s) so better software should be able to bring it up. I don't think it's possible to do significantly better than that without much more expensive hardware.
+
 ### Scaling up

 It's possible to have more GPUs without going straight to an expensive "real" GPU server or large workstation and the concomitant costs, but this is very much off the beaten path. Standard consumer platforms do not have enough PCIe lanes for more than two (reasonably) or four (unreasonably), so <span class="hoverdefn" title="High-End DeskTop">HEDT</span> or server hardware is necessary. HEDT is mostly dead and new server hardware increasingly expensive and divergent from desktop platforms, so it's most feasible to buy older server hardware, for which automated compatibility checkers and convenient part choice lists aren't available. The first well-documented build I saw was [this one](https://nonint.com/2022/05/30/my-deep-learning-rig/), which uses 7 GPUs and an AMD EPYC Rome platform (~2019) in an open-frame case designed for miners, although I think [Tinyboxes](https://tinygrad.org/) are intended to be similar. Recently, [this](https://www.mov-axbx.com/wopr/wopr_concept.html) was published, which is roughly the same except for using 4090s and a newer server platform. They propose using server power supplies (but didn't do it themselves), which is a smart idea - I had not considered the fact that you could get adapter boards for their edge connectors. Also see [this](https://battle-blackberry-78e.notion.site/How-to-run-ML-Experiments-for-cheap-b1270491395747458ac6726515b323cc), which recommends using significantly older server hardware - I don't really agree with this due to physical fit/power supply compatibility challenges.
--- a/blog/scaling-meme-search.md
+++ b/blog/scaling-meme-search.md
@@ -88,7 +88,7 @@ This all required more time in the data labelling mines and slightly different a

 It turns out that SigLIP is tasteful enough on its own that I don't need to do that much given a fairly specific query, and the classifier is not that useful - the bitter lesson in action.

-I previously worked on [SAEs](/memesae/) to improve querying, but this seems to be unnecessary with everything else in place. Training of a bigger one is ongoing for general interest.
+I previously worked on [SAEs](/memesae/) to improve querying, but this seems to be unnecessary with everything else in place. Training of a bigger one has been completed for general interest - it can be downloaded [here](https://datasets.osmarks.net/big_sae/), along with the pages of the samples most/least in each feature direction. Subjectively, the negative samples seem somewhat more consistent and the features are more specific (I used 262144 features, up from 65536, and about ten times as many embeddings to train, and only did one epoch).

 As ever, logs and data are available from the [datasets server](https://datasets.osmarks.net/projects-formerly-codenamed-radius-tyrian-phase-ii/).

@@ -111,7 +111,7 @@ The meme search master plan.
 * Data labelling makes me much more misanthropic. So much of the randomly sampled content used to bootstrap rating is inane advertising, extremely poorly photographed images of games on computer screens or nonsense political memes.
 * There is currently no provision for updating the index. It [should be possible](https://arxiv.org/abs/2105.09613) - the algorithms are not terribly complex but there are some slightly tricky engineering considerations.
 * I don't actually know how good the recall is because computing ground truth results is very expensive. Oh well.
-* The build-time deduplication is still insufficient, so I added a hacky step to do it at query time. I may do another pass and rebuild to fix this.
+* The build-time deduplication is still insufficient, so I added a hacky step to do it at query time which uses entirely too much compute. I may do another pass and rebuild to fix this.

 [^1]: Yes, I know we count supercomputer power in FP64 and consumer hardware mostly won't do double-precision. I am ignoring that for the purposes of art.