website/diypc-server-bifurcation.md at 1c35f1d1fe0b526e3689e236082bb5c34fcb8518

osmarks/website

Fork 0

mirror of https://github.com/osmarks/website synced 2025-06-25 06:32:56 +00:00

osmarks 1c35f1d1fe DIYPC/server differences post

2025-05-27 20:41:32 +01:00

16 KiB

Raw Blame History

title

description

slug

created

Efficient end-to-end power delivery

Most commodity servers, and desktop systems based on ATX, contain a power supply which takes 110V/230V AC power from the "wall" and produces 12V/5V/3.3V DC outputs. This is only for historical reasons: changes in the voltage requirements of modern electronics mean that the vast majority of power draw is on 12V, and goes to VRMs which reduce voltages to around 1V to supply the digital logic in CPUs and GPUs. There is a standard simplifying and improving the efficiency of desktop PSUs by switching to 12V only, but inertia means this has not been widely adopted.

When you're buying hundreds of thousands of servers at once, it is easier to change things. Open Compute Project (a project standardizing hyperscaler-friendly server infrastructure) racks replace power distribution with rack-level "power shelves" which convert AC to 12V/48V DC centrally and distribute it via a busbar. According to OCP documentation, 48V conversion is more efficient than 12V by several percentage points - important at scale - and distribution of course has lower resistive losses³. 48V was probably chosen since it is nearly the standard limit of "safe" low voltage and for consistency with -48V telecoms equipment. This also provides reliability improvements (more redundancy for the power supplies) and easier battery backups.

::: captioned src=/assets/images/molex-orv3.jpg An Open Rack V3 busbar, via Molex. :::

Cabled PCIe

The PCIe interconnect used for almost every high-speed link inside a modern computer (excepting RAM, cross-socket interconnect and inter-GPU connectivity) has roughly doubled in throughput every generation to keep up with demand. It's done this by using more complicated (less noise-resistant) modulation and increasing bandwidth, both of which lead to suffering for electrical engineers tasked with routing the signals. PCIe signals used to be run cheaply over PCBs on mainboards, in riser cards and for drive backplanes, but now they frequently need concerning quantities of expensive retimer chips or the cables now standardized as, for some reason, CopprLink.

::: captioned src=/assets/images/GENOAD8X-2TBCM.jpg The ASRock Rack GENOAD8X-2T/BCM motherboard, with several retimer chips (yellow) and MCIO connectors (green). :::

Relatedly, power and interconnectivity difficulties with PCIe cards have led to the OAM standard⁴, which puts GPUs or GPU-likes onto separate "baseboards" with 48V power inputs and built-in cross-accelerator connections. These are cabled to the rest of the system. I think something like this is sorely necessary in DIY computers, where GPUs fit the PCIe card form factor badly for almost exactly the same reasons, but nobody has been able to coordinate the change.

Fast networking

Consumers have mostly been stuck with Gigabit Ethernet for decades⁵. 10 Gigabit Ethernet was available soon afterward but lacks consumer adoption: I think this is because of a combination of expensive and power-hungry initial products, more stringent cabling requirements, and lack of demand (because internet connections were slow and LANs became less important). A decade later, the more modest 2.5GbE and 5GbE standards were released, designed for cheaper implementations and cables⁶. These have eventually been used in desktop motherboards and higher-end consumer networking equipment⁷.

Servers, being in environments where fast core networks and backhaul are more easily available and having more use for high throughput, moved more rapidly to 10, 40, 100(/25)⁸, 200, 400 and 800Gbps, with 1.6Tbps Ethernet now being standardized. The highest speeds are mostly for AI and specialized network applications, since most code is bad and is limited by some combination of CPU time and external API latency. Optimized code which does not do much work can handle millions of HTTP requests a second on 28 outdated cores, and with kernel bypass and hardware-offloaded cryptography DPDK can push 100 million: most software is not like that, and has to do more work per byte sent/received.

Energy per bit transferred is scaling down slower than data rates are scaling up, so high-performance switches are having to move to co-packaged optics and similar technologies.

Expanded hardware acceleration

Many common workloads - compression, cryptographic key exchanges and now matrix multiplication - can run much faster and more cheaply on dedicated hardware than general-purpose CPU cores. Many years ago, Intel released QAT, which initially sped up cryptography in cheap networking appliances using its CPUs - this was expanded and rolled out inconsistently since then. As of "Sapphire Rapids", their 2022/23 generation, these were finally brought to (most) mainstream server CPUs⁹, along with new capabilities - DLB, which provides hardware queue management for networking, and AMX, which multiplies matrices. By my estimations, recent parts with AMX are performance-competitive with recent consumer GPUs or old datacentre GPUs.

The closest things made available to consumers are in networking, as the most common most accelerateable high-throughput area around. Almost every network card can checksum packets, and assemble and disassemble sequences of them, in hardware, and cheap "routers" rely on hardware-offloaded NAT. Servers have gone much further: they now regularly use DPUs, full multicore Linux-based computers on a PCIe card with programmable routing/switching/packet processing hardware in the data path. This was pioneered by cloud providers wanting to move management features off the CPU cores they rent out. Even simpler NICs can offload stateful firewalling and several remote storage and tunnelling protocols.

Power density

People complain about the RTX 5090 having 600W of rated power draw and the "inefficiency" of modern client CPUs, but power density in servers has similarly been trending upwards. At the top end, Nvidia is pushing increasingly deranged 600kW racks, equivalent to roughly half the power draw of a small legacy datacentre, but we see a rough exponential trend in mainstream dual-socket CPUs, which now have maximum TDPs you would struggle to run your desktop at¹⁰. Desktop chassis are roomy and permit large, quiet cooling systems: most servers are one or two rack units (1.25 inches) tall, so they've historically used terrifying 10k-RPM fans which can use as much as 10% of a server's power budget. To mitigate this, high-performance systems are moving to liquid cooling. Unlike enthusiast liquid cooling systems, which exist to dump heat from power-dense CPUs into the probably-cool-enough air quickly, datacentres use liquid cooling to manage temperatures at the scale of racks and above, and might have facility-level water cooling.

::: captioned src=/assets/images/supermicro_water_cooling.jpg A SuperMicro GPU server with direct-to-chip liquid cooling, via ServeTheHome. Unlike consumer liquid cooling, this is designed for serviceability, with special quick-disconnect fittings. :::

Disaggregation

Even as individual servers grow more powerful, there is demand for pulling hardware out of them and sharing it between them to optimize utilization. This is an old idea for bulk storage (storage area networks), although there are some new ideas like directly Ethernet-connected SSDs. With the increased bandwidth of PCIe and RAM costs making up an increasing fraction of server costs (about half for Azure), modern servers now have the CXL protocol for adding extra memory over PCIe (physical-layer) links. This is most important for cloud providers¹¹, who deal with many VMs at once which may not fill the server they are on perfectly, and which need to have all the memory they're paying for "available" but which may not actively use much of it at a time. This creates inconsistent memory latency, but servers already had to deal with this - even single-socket servers now have multiple NUMA nodes because of use of chiplets.

::: captioned src=/assets/images/cxl_memory_expander.jpg A CXL memory expander which can use older DDR4 DIMMs, via ServeTheHome. :::

Conclusion

The main consequences of this are:

Somewhat less ability to transfer learning from personal homelabs to datacentres - it's much easier to run the fanciest datacentre software stacks at home than exotic expensive hardware which only makes sense at large scale.
Yet more difficulty in getting the highest performance out of computers (through more complex memory topologies, more parallelism and the requirement to use more offloads).
Increased advantages to consolidation in hosting (through more ability to use disaggregation technology and amortization of fixed costs of using more difficult technologies).

This is a long story which is mostly not publicly known, but you can read about some of what happened here and here. ↩︎
Arguably the "end of Moore's law", but Moore's law is about leading-edge density and not costs. As far as I know, cost per transistor has plateaued or worsened recently, and we don't see the same rapid migration of volume to new processes we used to. ↩︎
With some people's fear of "melting" 12VHPWR connectors, this could be a major selling point if 48V ever makes it to consumer products. ↩︎
Nvidia GPUs ship in their own incompatible SXM form factors, of course. ↩︎
It was released in 1999, so it's now retro. ↩︎
Because of this timing, 10GbE devices may or may not be able to negotiate down to 2.5GbE or 5GbE. This is often not documented clearly, to provide excitement and chance to users. ↩︎
Adoption in consumer systems seems to track Realtek's product lineup, as apparently nobody else is competently trying. We began to see much 5GbE adoption only after the RTL8126 offered it cheaply. They have a 10GbE product now which will perhaps finally drive use. ↩︎
40Gbps Ethernet is something of a technological dead end: it's four 10Gbps channels bonded together, and soon after it was widely available it became practical to upgrade each channel to 25Gbps for a total of 100Gbps, or use a single 25Gbps channel for a cheaper roughly-as-good link. ↩︎
Intel management being, presumably, insane, they still market-segment these despite them being one of few advantages they have over AMD. ↩︎
Desktop CPUs are still less efficient in normal operation, though - they clock higher on fewer cores, for cost and single-threaded performance. ↩︎
Though see this paper arguing against it. I think the ability to reuse older DRAM for less latency-sensitive memory contents is an important application it doesn't consider, however. ↩︎

16 KiB Raw Blame History Unescape Escape