--- title: So You Want A Cheap ML Workstation description: How to run local AI slightly more cheaply than with a prebuilt system. Somewhat opinionated. created: 25/02/2024 slug: mlrig --- ## Summary - Most of your workstation should be like a normal gaming desktop, but with less emphasis on single-threaded performance and more RAM. These are not hard to build yourself. - Buy recent consumer Nvidia GPUs with lots of VRAM (*not* datacentre or workstation ones). - Older or used parts are good to cut costs (not overly old GPUs). - Buy a sufficiently capable PSU. ## Long version Thanks to the osmarks.net crawlers scouring the web for bloggable information[^1], I've found out that many people are interested in having local hardware to run machine learning workloads (by which I refer to GPU-accelerated inference or training of large neural nets: anything else is [not real](http://www.incompleteideas.net/IncIdeas/BitterLesson.html)), but are doing it wrong, or not at all. There are superficially good part choices which are, in actuality, extremely bad for almost anything, and shiny [prebuilt options](https://lambdalabs.com/gpu-workstations/vector-one) which are far more expensive than necessary. In this article, I will outline what to do to get a useful system at somewhat less expense[^2]. ## Do not fear hardware (much) If you mostly touch software, you might be worried about interacting with the physical world, such as by buying and assembling computer hardware. Don't be. Desktop computer hardware is heavily standardized, and assembly of a computer from parts can easily be done in a few hours by anyone with functional fine motor control and a screwdriver (there are many free high-quality guides available). As long as you're not doing anything exotic, part selections can be automatically checked for compatibility [by PCPartPicker](https://pcpartpicker.com/), and many online communities offer free human review. Part selection is also not extremely complicated in the average case, though some knowledge of your workload and basic computer architecture is necessary. I am not, however, going to provide part lists, because these vary with your requirements and with local pricing. You may want to ask [r/buildapc](https://www.reddit.com/r/buildapc/) or similar communities to review your part list. ## GPU choice The most important decision you will make in your build is your choice of GPU(s) - the GPU will be doing most of your compute, and generally define how capable the rest of your components need to be. You can, practically, run at most two on consumer hardware (see [Scaling up](#scaling-up) for more). ### Submit to Jensen Unless you want to spend lots of your time messing around with drivers, Nvidia is your only practical choice for compute workloads. Optimized kernels[^12] such as [Flash Attention](https://github.com/Dao-AILab/flash-attention) are generally only written for CUDA, hampering effective compute performance on alternatives. AMD make capable GPUs for gaming which go underappreciated by many buyers, and Intel... make GPUs... but AMD does not appear to be taking their compute stack seriously on consumer hardware[^3] and Intel's is merely okay[^4]. AMD's CUDA competitor, ROCm, appears to only be officially supported on the [highest-end cards](https://rocm.docs.amd.com/projects/radeon/en/latest/docs/compatibility.html), and (at least according to [geohot as of a few months ago](https://geohot.github.io/blog/jekyll/update/2023/06/07/a-dive-into-amds-drivers.html)) does not work very reliably even on those. AMD also lacks capable matrix multiplication acceleration, meaning its GPUs' AI compute performance is lacking - even the latest RDNA 3 hardware only has [WMMA](https://gpuopen.com/learn/wmma_on_rdna3/), which reuses existing hardware slightly more efficiently, resulting in the top-end RX 7900 XTX being slower than Nvidia's last-generation RTX 3090 in theoretical matrix performance. Intel GPUs have good matrix multiplication accelerators, but their most powerful (consumer) GPU product is not very performant and the software is problematic - [Triton](https://github.com/intel/intel-xpu-backend-for-triton) and [PyTorch](https://github.com/intel/intel-extension-for-pytorch) are supported, but not all tools will support Intel's integration code, and there is presently an issue with addressing more than 4GB of memory in one allocation due to their iGPU heritage which apparently causes many problems. ### Do not buy datacentre cards Many unwary buyers have fallen for the siren song of increasingly cheap used Nvidia Tesla GPUs, since they offer very large VRAM pools at very low cost. However, these are a bad choice unless you *only* need that VRAM. The popular Tesla K80 is 9 years old, with lacking driver support, no FP16, extremely lacking general performance, high power consumption, and no modern optimization efforts, and it's not actually one GPU - it's two on a single card, so you have to deal with parallelizing anything big across GPUs. The next-generation Tesla M40 has similar problems, although it is a single GPU rather than two, and the P40 is not much different, though instead of *no* FP16 it has *unusably slow* FP16[^14]. Even a Tesla P100 is lacking in compute performance compared to newer generations. Datacentre cards newer than that are not available cheaply. There's also some complexity with cooling, since they're designed for server airflow with separate fans, unlike a consumer GPU.[^13]
### Do not buy workstation cards Nvidia has a range of workstation graphics cards. However, they are generally worse than their consumer GPU counterparts in every way except for VRAM capacity, sometimes compactness, and artificial feature gating (PCIe P2P and ECC): the prices are drastically higher (the confusingly named RTX 6000 Ada Generation ("6000A") sells for about four times the price of the similar RTX 4090), the memory bandwidth lower (consumer cards use GDDR6X, which generally offers higher bandwidth, but workstation hardware uses plain GDDR6 due to power) and performance in practice actually worse even when on paper it should be better. The 6000A has an underpowered cooler and aggressively throttles back under high-power loads, resulting in drastically lower performance.[^11] ### Workload characteristics As you can probably now infer, I recommend using recent consumer hardware, which offers better performance/$. Exactly which consumer hardware to buy depends on intended workload. There are typically only three relevant metrics (which should be easy to find in spec sheets): * Memory bandwidth. * Compute performance (FP16 tensor TFLOP/s). * VRAM capacity. VRAM capacity doesn't affect performance until it runs out, at which point you will incur heavy penalties from swapping and/or moving part of your workload to the CPU. Memory bandwidth is generally limiting with large models and small batch sizes (e.g. online LLM inference for chatbots[^5]), and compute the bottleneck for training and some inference (e.g. Stable Diffusion and some other vision models)[^6]. Within a GPU generation, these generally scale together, but between generations bandwidth usually grows slower than compute. Between Ampere (RTX 3XXX) and Ada Lovelace (RTX 4XXX) it has in some cases gone *down*[^7]. As VRAM effectively upper-bounds practical workloads, it's best to get the cards Nvidia generously deigns to give outsized amounts of VRAM for their compute performance, unless you're sure of what you want to run. This usually means a RTX 3060 (12GB), RTX 3090 or RTX 4090. RTX 3090s are readily available used far below the official retail prices, and are a good choice if you're mostly concerned with inference, since their memory bandwidth is almost the same as a 4090's, but 4090s have over twice as much compute on paper and (in non-memory-bound scenarios) also bear this out in practice. ### Multi-GPU You can run two graphics cards in a consumer system without any particularly special requirements - just make sure your power supply [can handle it](#power-consumption) and that you get a mainboard with PCIe slots with enough spacing between them. Each GPU will run with 8 PCIe lanes, via PCIe bifurcation. Any parallelizable workload which fits onto a single card should work at almost double speed with data parallelism, and larger models can be loaded across both via pipeline or tensor parallelism. Note that the latter requires fast interconnect between the GPUs. To spite users[^9], only the RTX 3090 has NVLink, which provides about 50GB/s (each direction) between GPUs[^8], and only workstation GPUs have PCIe P2P enabled, which reduces latency and increases bandwidth when using standard PCIe between two GPUs. However, you can get away without either of these if you don't need more than about 12GB/s (each direction) between GPUs, which I am told you usually don't. Technically, you *can* plug in more GPUs than this (up to 4), but they'll have less bandwidth and messing around with riser cables is usually necessary. ### Power consumption GPUs are pretty power-hungry. PCPartPicker will make a good estimate of maximum power draw in most cases, but Ampere GPUs can briefly have power spikes to far above their rated TDP[^10]. A good PSU may handle these without tripping overcurrent/overpower protection, but it's safer to just assume that a RTX 3090 has a maximum power draw of 600W and choose a power supply accordingly. If you're concerned about reducing your power bill, Ada Lovelace GPUs are generally much more efficient than Ampere due to their newer manufacturing process. You can also power-limit your GPU using `nvidia-smi -pl [power limit in watts]` (note that this must be run each boot in some way): this does reduce performance, but nonlinearly. ## Other components Obviously computers contain parts other than the GPU. For the purposes of a pure ML workstation, these don't really matter, as they won't usually be bottlenecks (if you intend to debase your nice GPU by also running *games* and other graphical tasks on it, then you will of course need more powerful ones). Any recent consumer CPU should be more than capable of driving a GPU for running models. For more intensive work involving heavy data preprocessors or compilation you should prioritize core count over single-threaded performance (e.g. by buying a slightly older-generation higher-core-count CPU). Every good-quality NVMe SSD is fast enough for almost anything you might want to do with it. Your build will not be very different from a standard gaming computer apart from these minor details, so it's easiest to take a good build for one of those and make the necessary tweaks. One thing to be concerned about, however, is RAM. If you do anything novel, most of the code you will run will be "research-grade" and consume far more RAM than it should. To work around this, you should make sure to buy plenty of RAM (at the very least, more CPU RAM than VRAM) or to use a very big swap file, as this is much more practical than fixing all the code. If possible, buy the biggest single DIMMs (memory modules) you can, as running more or fewer than two sticks will cut your CPU's memory bandwidth - while not performance-critical like *GPU* memory bandwidth, there's no reason to incur this hit unnecessarily. Also note that modern GPUs are very big. You should be sure that your case supports the length and width of your GPU, as well as the height of your GPU plus its power cables.