new graphics demos, fix a bug, minor post updates, add opengraph

2025-09-04 11:27:57 +00:00 · 2024-10-11 18:42:05 +01:00
parent 626608939b
commit 908422beec
11 changed files with 403 additions and 9 deletions
--- a/blog/ai-accelerator.md
+++ b/blog/ai-accelerator.md
@@ -21,7 +21,7 @@ Matrix multiplication uses $O(n^3)$ operations (wrt. matrix size)[^1], primarily

 It is, of course, not really possible to do 300 useful operations on a single byte without interacting with anything else, so the GPU has registers, shared memory and caches with much higher bandwidth and lower latency than main memory - this allows keeping chunks of the input matrices closer to compute units and performing tiled matrix multiplications using those chunks[^5]. There's also dedicated hardware for asynchronously fetching data from memory without ever tying up compute generating memory addresses. Even with all this, H100 GPUs can usually only manage ~70% of their quoted FLOPS performing a matrix multiplication, and large-scale training runs only manage ~40%[^4].

-The worse utiliation on real training runs is partly because of individual weight matrices and inputs not filling the GPU, and partly because of the aforementioned scalar/vector operations: naively, doing $\mathrm{ReLU}$ to a vector would waste the vast majority of FLOPS, because each value would be fetched, trivially operated on, and then written back, though kernel fusion (applying it before writing to memory during a matmul) mitigates this. There are also significant slowdowns introduced by multi-GPU operations. I don't know exactly what causes this, since communications and computation can mostly be overlapped in larger models, but one part is that all GPUs are forced to wait for all others at various points, and there is significant variance in performance between GPUs[^6]. Also, network communications tie up memory bandwidth and power budget.
+The worse utilization on real training runs is partly because of individual weight matrices and inputs not filling the GPU, and partly because of the aforementioned scalar/vector operations: naively, doing $\mathrm{ReLU}$ to a vector would waste the vast majority of FLOPS, because each value would be fetched, trivially operated on, and then written back, though kernel fusion (applying it before writing to memory during a matmul) mitigates this. There are also significant slowdowns introduced by multi-GPU operations. I don't know exactly what causes this, since communications and computation can mostly be overlapped in larger models, but one part is that all GPUs are forced to wait for all others at various points, and there is significant variance in performance between GPUs[^6]. Also, network communications tie up memory bandwidth and power budget.

 ### DLRMs

@@ -119,4 +119,4 @@ The design space of AI accelerators built on digital logic is fairly tightly bou

 [^12]: [https://arxiv.org/abs/1912.03413](https://arxiv.org/abs/1912.03413) page 76.

-[^13]: Possibly it's something to do with graph neural networks, given the "graph" in the name.
+[^13]: Possibly it's something to do with graph neural networks, given the "graph" in the name.