Addenda
### CPU inference
While I don't like this myself, you might be interested in slowly running very large language models interactively and nothing else. This is when datacentre GPUs might actually be sane (still not K80s), as well as running on CPU. To a first approximation, one token generated requires two FLOPS (one fused multiply-add) per parameter regardless of quantization, and loading every weight into cache from RAM once. Here is (roughly) the compute and memory bandwidth available with various hardware:
Hardware | TFLOP/s | Bandwidth (GB/s) | Ratio (FLOPS/B) | Capacity (GB) | Notes
---|---|---|---|---|---
Nvidia GeForce RTX 4090 | 165 | 1008 | 163 | 24 | FP16 dense tensor TFLOP/s from spec sheet (FP32 accumulate).
Nvidia GeForce RTX 3090 | 71 | 936 | 75 | 24 | As above.
Nvidia GeForce RTX 3060 (12GB) | 25 | 360 | 70 | 12 | As above.
Nvidia Tesla K80 (one GPU) | 4 | 240 | 16 | 12 | Each Tesla K80 card contains two individual GPU chips. They do not have FP16, so I'm using FP32 numbers.
Nvidia Tesla M40 | 7 | 288 | 24 | 24 | Still no FP16, but only one GPU per card. It has less aggregate bandwidth than a whole K80 card as a result.
Nvidia Tesla P40 | 12 | 347 | 34 | 24 | It has hardware FP16 but crippled, so I use FP32 figures.
AMD Ryzen 9 7950X | 2.5 | 83 | 30 | <=192 | TFLOP/s estimated from [AVX-512 figures here](https://www.mersenneforum.org/showthread.php?p=614191). Bandwidth is theoretical, assuming DDR5-5200 dual-channel (I think in practice Infinity Fabric links bottleneck this). Using four DIMMs will reduce rated RAM speed a lot.
AMD Ryzen 7 7800X | 1.3 | 83 | 16 | <=192 | Basically half a 7950X in terms of compute.
Intel Core i9-14900K | 2.5 | 90 | 27 | <=192 | No AVX-512, but the [same amount](https://chipsandcheese.com/2021/12/02/popping-the-hood-on-golden-cove/) of floating point execution capacity as AMD on P-cores, I think. Each E-core ([Gracemont](https://chipsandcheese.com/2021/12/21/gracemont-revenge-of-the-atom-cores/)) provides half as much per cycle. I am assuming maximum turbo frequencies on all cores at once. Rated memory bandwidth is slightly higher than AMD's (on DDR5).
Intel Core i9-14600K | 1.5 | 90 | 16 | <=192 | As above.
Intel Xeon Platinum 8280 | 4.8 | 141 | 34 | <=1024 | Just for fun (these, and boards for them, are hard to get, though easier/cheaper than modern server CPUs). Compute is overestimated as these downclock badly in heavy AVX-512 loads.
Apple M1 Ultra | 21 | 819 | 27 | 128 | Apple Silicon has a bizarrely good memory subsystem. I'm counting its GPU TFLOP/s here.
One forward pass of an LLM with FP16 weights conveniently also requires loading two bytes per weight, so the FLOPS per byte ratio above is (approximately; I'm rounding off many, many details here) how many tokens can be processed in parallel without slowdown. Since sampling (generating outputs) is inherently serial you don't benefit from possible parallelism (except when processing the prompt), so quantization (which reduces memory bandwidth and slightly increases compute costs) has lots of room to work. In principle the FLOP/byte ratio should be high enough with everything that performance is directly proportional to bandwidth. This does not appear to be true with older GPUs according to [user reports](https://www.reddit.com/r/LocalLLaMA/search?q=p40&restrict_sr=on&sort=relevance&t=all), probably due to overheads I ignored - notably, nobody reports more than about 15 tokens/second. Thus, despite somewhat better software support, CPU inference is usually going to be slower than old-datacentre-GPU inference, but is at least the best way to get lots of memory capacity.
### Scaling up
It's possible to have more GPUs without going straight to an expensive "real" GPU server or large workstation and the concomitant costs, but this is very much off the beaten path. Standard consumer platforms do not have enough PCIe lanes for more than two (reasonably) or four (unreasonably), so HEDT or server hardware is necessary. HEDT is mostly dead and new server hardware increasingly expensive and divergent from desktop platforms, so it's most feasible to buy older server hardware, for which automated compatibility checkers and convenient part choice lists aren't available. The first well-documented build I saw was [this one](https://nonint.com/2022/05/30/my-deep-learning-rig/), which uses 7 GPUs and an AMD EPYC Rome platform (~2019) in an open-frame case designed for miners, although I think [Tinyboxes](https://tinygrad.org/) are intended to be similar. Recently, [this](https://www.mov-axbx.com/wopr/wopr_concept.html) was published, which is roughly the same except for using 4090s and a newer server platform. They propose using server power supplies (but didn't do it themselves), which is a smart idea - I had not considered the fact that you could get adapter boards for their edge connectors.
They describe somewhat horrifying electrical engineering problems due to using several power supplies together, and custom cooling modifications. While doable, all this requires much more expertise than just assembling a standard desktop from a normal part list. Your other option is to take an entire old server and install GPUs in it, but most are not designed for consumer GPUs and will not easily fit or power them. I've also been told that some of them have inflexible firmware and might have issues running unexpected PCIe cards or different fan configurations.
[^1]: Not really.
[^2]: High-performance compute hardware is still not cheap in an absolute sense, and for infrequent loads you are likely better off with [cloud services](https://vast.ai/).
[^3]: I'm told it works fine on their latest datacentre cards. You are not getting those. You aren't even renting those, for some reason.
[^4]: Intel's is arguably better on consumer hardware than datacentre, as their datacentre hardware doesn't work.
[^5]: Especially since most LLM quantization dequantizes to FP16 before doing the matrix multiplications, sparing no compute but lots of bandwidth and VRAM.
[^6]: Tim Dettmers has a good [technical explanation](https://timdettmers.com/2023/01/30/which-gpu-for-deep-learning/) of this, though many of the specific recommendations it makes are outdated, Nvidia is now known to artificially limit FP16 tensor performance with FP32 reduction on both Ada Lovelace and Ampere, and the structured sparsity feature has not had any real adoption.
[^7]: Compare the RTX 3060 and RTX 4060, for instance. It's still faster for gaming because of caches compensating for this and higher clocks providing more compute.
[^8]: I don't know the theoretical link rate, but it's [benchmarked here](https://www.boston.co.uk/blog/2021/03/09/boston-labs-tests-nvidia-nvlink.aspx).
[^9]: The AD102 chip in the RTX 4090 even appears to have had NVLink removed late in development (see the blank areas around the perimeter): ![AD102 die shot by Fritzchens Fritz](/assets/images/ad102.jpg) (image source: