add and remove information

2026-07-15 16:22:45 +00:00 · 2026-02-23 17:02:56 +00:00
parent 697dfd3843
commit 9c49e226ab
2 changed files with 14 additions and 3 deletions
@@ -17,7 +17,7 @@ Graphcore IPU chips contain ~1000 independent cores ("tiles"), whereas modern CP
 IPU architecture diagram via [Graphcore docs](https://docs.graphcore.ai/projects/ipu-programmers-guide/en/latest/about_ipu.html).
 :::

-Why this architecture? Graphcore has made different claims about it over the years, being quite an old company by AI standards (their founding in 2016 predates transformers and I imagine they had the core ideas beforehand). The most obvious reason for their design is sparsity support and overfitting to contemporary RNNs/CNNs[^9], but there are better reasons. GPT-5.2-high found [a presentation](https://cdn2.hubspot.net/hubfs/729091/assets/ScaledML%20Stanford%2024mar18%20SK.pdf) from 2018 justifying their strategy. They correctly determined that power would be a binding constraint on future AI hardware, that direct-to-GPU interconnects would need to scale beyond a single node and that memory bandwidth would continue to be a bottleneck. Also, they added hardware-accelerated [stochastic rounding](https://shape-of-code.com/2022/11/20/stochastic-rounding-reemerges/) for low-precision training in their first generation, while Nvidia only integrated this in recent Blackwell GPUs. Later, they [talk about](https://hc33.hotchips.org/assets/program/conference/day2/HC2021.Graphcore.SimonKnowles.v04.pdf) the power and cost advantages of avoiding HBM, and how having enough SRAM allows using DRAM with lower bandwidth.
+Why this architecture? Graphcore has made different claims about it over the years, being quite an old company by AI standards (their founding in 2016 predates transformers and I imagine they had the core ideas beforehand). The most obvious reason for their design is sparsity support and overfitting to contemporary RNNs/CNNs[^9], but there are better reasons. GPT-5.2-high found [a presentation](https://cdn2.hubspot.net/hubfs/729091/assets/ScaledML%20Stanford%2024mar18%20SK.pdf) from 2018 justifying their strategy. They correctly determined that power would be a binding constraint on future AI hardware, that direct-to-GPU interconnects would need to scale beyond a single node and that memory bandwidth would continue to be a bottleneck. Also, they added hardware-accelerated [stochastic rounding](https://shape-of-code.com/2022/11/20/stochastic-rounding-reemerges/) for low-precision training in their first generation, while Nvidia only integrated this in recent Blackwell GPUs[^20]. Later, they [talk about](https://hc33.hotchips.org/assets/program/conference/day2/HC2021.Graphcore.SimonKnowles.v04.pdf) the power and cost advantages of avoiding HBM, and how having enough SRAM allows using DRAM with lower bandwidth.

 Most of these arguments and decisions are essentially correct, and very early: the overall Graphcore design was locked in a decade ago, but it's only in the past two or three years that datacentre buildouts became heavily power-constrained, Nvidia [started scaling NVLink to racks](https://www.nvidia.com/en-us/data-center/gb200-nvl72/), and HBM became supply-crunched (due to advanced packaging in ~2023 and memory production in ~2025[^6]) rather than merely costly. Some have blamed their lack of adoption on the architecture being difficult to program but this fails to distinguish them from competitors: efficient GPU kernels involve [all kinds of arcana](https://siboehm.com/articles/22/CUDA-MMM) even without newer sometimes-programming-model-breaking innovations such as tensor cores, [TMA](https://docs.nvidia.com/cuda/hopper-tuning-guide/index.html#tensor-memory-accelerator) asynchronous loads, Blackwell's [async matrix multiplications](https://research.colfax-intl.com/cutlass-tutorial-writing-gemm-kernels-using-tensor-memory-for-nvidia-blackwell-gpus/), new low-precision floating point formats, partitioning SMs into compute and communication, and Hopper's [cursed swizzles](https://hazyresearch.stanford.edu/blog/2024-05-12-tk). Google TPUs used to require you to write TensorFlow code and have no public way to write low-level code for cases where the compiler isn't sufficient, and many were willing to put up with this agony because they were reasonably fast and [free](https://sites.research.google/trc/about/) for some hobbyists[^4], and they have a number of external customers these days. Graphcore IPUs lack a performant "eager mode" experience like GPUs, which puts off researchers, but this is also true of TPUs, as are the long compile times[^13]. TPUs and GPUs are (were) more accessible to hobbyists and consumers, but this feels an unreasonably self-serving explanation, IPUs were given to many researchers, and large B2B sales (which they had, or at least tried for) should have been less affected by this.

@@ -35,7 +35,7 @@ It failed to turn on when I installed it, but it turns out I had just forgotten

 Actual ML workloads were harder. I wanted to run the [SigLIP](https://huggingface.co/google/siglip-so400m-patch14-384) image encoder model previously used for my [meme initiatives](/memescale/). In principle, with [0.7 model TFLOPS](https://github.com/mlfoundations/open_clip/blob/main/docs/model_profile.csv), 280TFLOP/s of FP16 and a reasonable 60% MFU, I should have been able to do 250 images per second. With 400 million parameters and 900MB of SRAM, I needed to use FP8, which the chip supports at double rate, so it should have been possible to go even faster. After spending several hours wrangling ONNX and [PopRT](https://github.com/graphcore/PopRT), since FP8 support was seemingly never added to their [PyTorch fork](https://github.com/graphcore/poptorch), I was able to execute the model, but only at an infeasibly low 100 images per second[^15], because I could only run at batch size 1, because at any higher batch size I got to experience the compiler spinning for ten minutes then producing "insufficient tile memory" errors. The profiling tool, which still worked after unpacking it and running it with a newer Electron version, helpfully broke down cycle count by kernel, showing that enormous amounts of time were spent in some kind of on-tile copy operation and presumably-low-utilization matrix multiplies. With all the layers of abstraction between the model and hardware, I did not know why, however.

-I suspected that it might have been due to inefficient attention computation. There is a [Flash Attention](https://github.com/graphcore-research/flash-attention-ipu) for IPU, but it's very unready and only works with Torch, which, as we established, does not work with FP8 in the outdated SDK. Poking at the open-source code further revealed nothing to me but enormous amounts of unpleasant C++ slop. In my hubris, I thought that with modern LLM technology it should be possible to simply replace all the inconvenient parts - the ML compiler and planner logic, but not the compiler and LLVM backend for individual tiles, which seems fine, and the low-level driver - with a cut-down pipeline for transformer inference only. Graphcore was going for training support also, and cared about complex mostly-convolutional models (in fact, matrix multiplies are handled as 1x1 convolutions). However, to generate useful code, you need to be able to operate across multiple tiles, and that requires exchange code generation[^16], and for some reason this is both closed-source and much more complicated than the "compute some timings, set four registers and trigger sync" I had anticipated. Reverse-engineering efforts are ongoing.
+I suspected that it might have been due to inefficient attention computation. There is a [Flash Attention](https://github.com/graphcore-research/flash-attention-ipu) for IPU, but it's very unready and only works with Torch, which, as we established, does not work with FP8 in the outdated SDK. Poking at the open-source code further revealed nothing to me but enormous amounts of unpleasant C++ slop. In my hubris, I thought that with modern LLM technology it should be possible to simply replace all the inconvenient parts - the ML compiler and planner logic, but not the compiler and LLVM backend for individual tiles, which seems fine, and the low-level driver - with a cut-down pipeline for transformer inference only. Graphcore was going for training support and multi-IPU operation, and cared about complex mostly-convolutional models (in fact, matrix multiplies are handled as 1x1 convolutions), which I can ignore. However, to generate useful code, you need to be able to operate across multiple tiles, and that requires exchange code generation[^16], and for some reason this is both closed-source and much more complicated than the "compute some timings, set four registers and trigger sync" I had anticipated. Reverse-engineering efforts are ongoing.

 Even without this, there are some possible applications which do work quite well. Small [GPT-2 training](https://github.com/graphcore-research/flash-attention-ipu/blob/main/demo/train.py) was perfectly operable when I tested it, and the stack seems good enough for my other very-small-model work. If anyone knows where the "Pod" hardware went (there used to be cloud offerings, but no more), I would like to try some out too.

@@ -63,7 +63,7 @@ Even without this, there are some possible applications which do work quite well

 [^12]: They wanted to focus on complete systems after their first generation, but apparently the Chinese market wanted more flexible PCIe cards, so they had to release C600. There might have been an export-controls reason, but I don't know of any which affected the pods and not the PCIe cards.

-[^13]: Possibly it's that Google's software is/was *less* annoying, or they were more willing to "eat bitterness" and make their engineers and researchers do more work to save money at scale, because TPUs avoided most external margins.
+[^13]: Possibly it's that Google's software is/was *less* annoying, or they were more willing to "eat bitterness" and make their engineers and researchers do more work to save money at scale, especially because TPUs avoided more external margins.

 [^14]: Unlike much of the rest of the stack, these do not have available source code.

@@ -76,3 +76,5 @@ Even without this, there are some possible applications which do work quite well
 [^18]: SIMD lanes in AVX-512 units are close to "CUDA cores". The GPU "core" number I used is SMSPs.

 [^19]: Not data latency, which is [250ns](https://www.graphcore.ai/posts/accelerating-resnet50-training-on-the-ipu-behind-our-mlperf-benchmark). There are separate cables for sync.
+
+[^20]: [Tenstorrent](https://tenstorrent.com/) had it earlier, but it has been bugged for generations: the functional model is [slightly defective](https://github.com/tenstorrent/tt-isa-documentation/blob/main/WormholeB0/TensixTile/TensixCoprocessor/SFPSTOCHRND_FloatFloat.md).
@@ -5242,5 +5242,14 @@
        "date": "2022-01-17T09:07:39.000Z",
        "website": "Graphcore",
        "auto": true
+    },
+    "https://github.com/tenstorrent/tt-isa-documentation/blob/main/WormholeB0/TensixTile/TensixCoprocessor/SFPSTOCHRND_FloatFloat.md": {
+        "inline": true,
+        "excerpt": "Contribute to tenstorrent/tt-isa-documentation development by creating an account on GitHub.",
+        "title": "tt-isa-documentation/WormholeB0/TensixTile/TensixCoprocessor/SFPSTOCHRND_FloatFloat.md at main · tenstorrent/tt-isa-documentation",
+        "author": "tenstorrent",
+        "date": null,
+        "website": "GitHub",
+        "auto": true
    }
 }