misc edits

2026-03-01 05:49:45 +00:00 · 2026-02-28 15:04:20 +00:00
parent e92f3e626e
commit 9c8aa6983d
3 changed files with 1794 additions and 2 deletions
--- a/assets/misc/copenhagen_ethics.html
+++ b/assets/misc/copenhagen_ethics.html
--- a/blog/ai-companions.md
+++ b/blog/ai-companions.md
@@ -17,7 +17,7 @@ Writing this has been a low priority for a while, but Elon Musk has forced my ha

 ## Superhuman niceness

-::: epigraph attribution=Gwern
+::: epigraph attribution=Gwern link=https://www.reddit.com/r/MediaSynthesis/comments/1h2xy98/the_neruda_factory_jenn_failed_prompting/lzpzthu/
 Nothing [Agreeable](https://en.wikipedia.org/wiki/Big_Five_personality_traits) makes it out of the near-future intact.
 :::

--- a/blog/graphcore.md
+++ b/blog/graphcore.md
@@ -31,7 +31,7 @@ They're probably not available in quantity, but at \$500 (plus shipping) per C60
 The C600 on a desk before I installed it. It's slightly grubby from, presumably, prior use. I wonder what it was used for. Strangely, it came in a quantity-1 box with Graphcore branding and a decent amount of empty space - did they not care much about packaging efficiency, or were they being sold in extremely small quantity?
 :::

-It failed to turn on when I installed it, but it turns out I had just forgotten to connect one end of the power cable[^17]. Somewhat surprisingly, my [guesswork-based patch](https://github.com/osmarks/gc-kernel-module-patch/) to the kernel module worked fine, and the 2020-vintage [CLI tools](https://docs.graphcore.ai/projects/command-line-tools/en/latest/introduction.html) worked as expected[^14], except the FLOP/s benchmark `gc-flops`, which exited early for some reason and returned an infeasibly high result. I got [IPUpy](https://github.com/osmarks/IPUpy-patch), which runs 1472 Python interpreters concurrently, to work with some minor tweaks, but [IPUDOOM](https://github.com/jndean/IPUDOOM) failed with a mysterious linker error after I spent 30 minutes waiting for GCC 7 to compile. This turned out to be because it shipped an opaque precompiled binary (for the wrong IPU architecture) with code for JITing cross-tile communications (normally this is meant to be statically compiled on the host).
+It failed to turn on when I installed it, but it turns out I had just forgotten to connect one end of the power cable[^17]. Somewhat surprisingly, my [guesswork-based patch](https://github.com/osmarks/gc-kernel-module-patch/) to the kernel module worked fine, and the 2020-vintage [CLI tools](https://docs.graphcore.ai/projects/command-line-tools/en/latest/introduction.html) worked as expected[^14], except the FLOP/s benchmark `gc-flops`, which exited early for some reason and returned an infeasibly high result. I got [IPUpy](https://github.com/osmarks/IPUpy-patch), which runs 1472 Python interpreters concurrently, to work with some minor tweaks, but [IPUDOOM](https://github.com/jndean/IPUDOOM) failed with a mysterious linker error after I spent 30 minutes waiting for GCC 7 to compile. This turned out to be because it shipped an opaque precompiled binary (for the wrong IPU architecture) with code for JITing cross-tile communications (normally this is meant to be statically compiled on the host)[^21].

 Actual ML workloads were harder. I wanted to run the [SigLIP](https://huggingface.co/google/siglip-so400m-patch14-384) image encoder model previously used for my [meme initiatives](/memescale/). In principle, with [0.7 model TFLOPS](https://github.com/mlfoundations/open_clip/blob/main/docs/model_profile.csv), 280TFLOP/s of FP16 and a reasonable 60% MFU, I should have been able to do 250 images per second. With 400 million parameters and 900MB of SRAM, I needed to use FP8, which the chip supports at double rate, so it should have been possible to go even faster. After spending several hours wrangling ONNX and [PopRT](https://github.com/graphcore/PopRT), since FP8 support was seemingly never added to their [PyTorch fork](https://github.com/graphcore/poptorch), I was able to execute the model, but only at an infeasibly low 100 images per second[^15], because I could only run at batch size 1, because at any higher batch size I got to experience the compiler spinning for ten minutes then producing "insufficient tile memory" errors. The profiling tool, which still worked after unpacking it and running it with a newer Electron version, helpfully broke down cycle count by kernel, showing that enormous amounts of time were spent in some kind of on-tile copy operation and presumably-low-utilization matrix multiplies. With all the layers of abstraction between the model and hardware, I did not know why, however.

@@ -78,3 +78,5 @@ Even without this, there are some possible applications which do work quite well
 [^19]: Not data latency, which is [250ns](https://www.graphcore.ai/posts/accelerating-resnet50-training-on-the-ipu-behind-our-mlperf-benchmark). There are separate cables for sync.

 [^20]: [Tenstorrent](https://tenstorrent.com/) had it earlier than Nvidia, but it has been bugged for generations: the functional model is [slightly defective](https://github.com/tenstorrent/tt-isa-documentation/blob/main/WormholeB0/TensixTile/TensixCoprocessor/SFPSTOCHRND_FloatFloat.md).
+
+[^21]: As the "IPU21" architecture in the chip is very close to the "IPU2" of the binary, it can be patched to operate anyway. It appeared to run properly after that, though I don't know how to play Doom.