mirror of
https://github.com/osmarks/website
synced 2024-10-30 00:56:15 +00:00
New blog posts and minor tweaks
This commit is contained in:
parent
be6940f78c
commit
9801444628
3
.gitignore
vendored
3
.gitignore
vendored
@ -3,4 +3,5 @@ out
|
|||||||
openring
|
openring
|
||||||
draft
|
draft
|
||||||
cache.json
|
cache.json
|
||||||
cache.sqlite3
|
cache.sqlite3
|
||||||
|
strings.json
|
||||||
|
BIN
assets/images/pricecog.png.original
Normal file
BIN
assets/images/pricecog.png.original
Normal file
Binary file not shown.
After Width: | Height: | Size: 1.9 MiB |
@ -37,7 +37,7 @@ Better manufacturing processes make transistors smaller, faster and lower-power,
|
|||||||
|
|
||||||
It's only possible to fit about 1GB of SRAM onto a die, even if you are using all the die area and the maximum single-die size. Obviously, modern models are larger than this, and it wouldn't be economical to do this anyway. The solution used by most accelerators is to use external DRAM (dynamic random access memory). This is much cheaper and more capacious, at the cost of worse bandwidth and greater power consumption. Generally this will be HBM (high-bandwidth memory, which is more expensive and integrated more closely with the logic via advanced packaging), or some GDDR/LPDDR variant.
|
It's only possible to fit about 1GB of SRAM onto a die, even if you are using all the die area and the maximum single-die size. Obviously, modern models are larger than this, and it wouldn't be economical to do this anyway. The solution used by most accelerators is to use external DRAM (dynamic random access memory). This is much cheaper and more capacious, at the cost of worse bandwidth and greater power consumption. Generally this will be HBM (high-bandwidth memory, which is more expensive and integrated more closely with the logic via advanced packaging), or some GDDR/LPDDR variant.
|
||||||
|
|
||||||
Another major constraint is power use, which directly contributes to running costs and cooling system complexity. Transistors being present and powered consumes power (static/leakage power) and transistors switching on and off consumes power (dynamic/switching power). The latter scales superlinearly with clock frequency, which is inconvenient, since performance scales slightly sublinearly with clock frequency. A handy Google paper[^8] (extending work from 2014[^9]), worth reading in its own right, provides rough energy estimates per operation, though without much detail about e.g. clock frequency:
|
Another major constraint is power use, which directly contributes to running costs and cooling system complexity. Transistors being present and powered consumes power (static/leakage power) and transistors switching on and off consumes power (dynamic/switching power). The latter scales superlinearly with clock frequency, which is inconvenient, since performance scales slightly sublinearly with clock frequency. A handy Google paper[^8] (extending work from 2014[^9]), worth reading in its own right, provides rough energy estimates per operation, though without much detail about e.g. clock frequency:
|
||||||
|
|
||||||
| Operation | | Picojoules per Operation |||
|
| Operation | | Picojoules per Operation |||
|
||||||
|-----------|---|-------------|-------|-------|
|
|-----------|---|-------------|-------|-------|
|
||||||
@ -85,7 +85,7 @@ A danger of these architectures is that they're easy to overfit to specific mode
|
|||||||
|
|
||||||
DRAM is costly, slow and power-hungry, so why don't we get rid of it? Many startups have tried this or similar things, by making big die with many small cores with attached SRAM and arithmetic units. This eliminates the power cost and bottlenecks of DRAM, meaning code can run much faster... as long as its data fits into the <1GB of SRAM available on each accelerator. In the past, this would often have been sufficient; now, scaling has eaten the world, and LLMs run into the terabytes at the highest end and ~10GB at the lower.
|
DRAM is costly, slow and power-hungry, so why don't we get rid of it? Many startups have tried this or similar things, by making big die with many small cores with attached SRAM and arithmetic units. This eliminates the power cost and bottlenecks of DRAM, meaning code can run much faster... as long as its data fits into the <1GB of SRAM available on each accelerator. In the past, this would often have been sufficient; now, scaling has eaten the world, and LLMs run into the terabytes at the highest end and ~10GB at the lower.
|
||||||
|
|
||||||
A good example of this is Graphcore "IPUs" (intelligence processing units). They're very good at convolutions[^12] but achieve low utilization on large matrix multiplications, though the high memory bandwidth makes them better at small batches than a GPU would be. It's not clear to me what their design intention was, since their architecture's main advantage seems to be exactly the kind of fine-grained control which AI did not need when it was designed and doesn't need now[^13].
|
A good example of this is Graphcore "IPUs" (intelligence processing units). They're very good at convolutions[^12] but achieve low utilization on large matrix multiplications, though the high memory bandwidth makes them better at small batches than a GPU would be. It's not clear to me exactly what their design intention was, since their architecture's main advantage is fine-grained local control, which standard neural nets did not need at the time and require even less now. It may be intended for "graph neural networks", which are used in some fields where inputs have more structure than text, or sparse training, where speed benefits are derived from skipping zeroed weights or activations. Like GPUs, this flexibility does also make them useful for non-AI tasks.
|
||||||
|
|
||||||
[Groq](https://groq.com/) has done somewhat better recently; I don't know what they intended either, but they also use many small slices of matrix units and SRAM. They are more focused on splitting models across several chips, rely on deterministic scheduling and precompilation, and became widely known for running big LLMs at multiple hundreds of tokens per second, an order of magnitude faster than most other deployments. However, they need [hundreds of chips](https://www.semianalysis.com/p/groq-inference-tokenomics-speed-but) for this due to memory capacity limits, and it's not clear that this responsiveness improvement justifies the cost.
|
[Groq](https://groq.com/) has done somewhat better recently; I don't know what they intended either, but they also use many small slices of matrix units and SRAM. They are more focused on splitting models across several chips, rely on deterministic scheduling and precompilation, and became widely known for running big LLMs at multiple hundreds of tokens per second, an order of magnitude faster than most other deployments. However, they need [hundreds of chips](https://www.semianalysis.com/p/groq-inference-tokenomics-speed-but) for this due to memory capacity limits, and it's not clear that this responsiveness improvement justifies the cost.
|
||||||
|
|
||||||
@ -95,7 +95,7 @@ Finally, [Tenstorrent](https://tenstorrent.com/) - their architecture doesn't fi
|
|||||||
|
|
||||||
## Conclusion
|
## Conclusion
|
||||||
|
|
||||||
The design space of AI accelerators built on digital logic is fairly tightly bounded by the practical limits of manufacturing and the current algorithms used for AI. The exciting headline figures quoted by many startups belie problematic tradeoffs, and doing much better without a radical physical overhaul is not possible. The most egregious instance of this I've seen, which caused me to write this, is [Etched](https://www.etched.com/announcing-etched), who claim that by "burning the transformer architecture into [their] chip" they can achieve orders-of-magnitude gains over Nvidia GPUs. I don't think this is at all feasible: transformer inference is limited by memory bandwidth (sometimes) and total compute (more so) for performing large matmuls, not lack of specialization. Future hardware gains will come from the slow grind of slightly better process technology and designs - or potentially a large algorithmic change which makes specialization more gainful. Part of transformers' advantage was running with more parallelism on existing GPUs, but there is enough money in the field for strong hardware/software/algorithmic codesign. I don't know what form this will take.
|
The design space of AI accelerators built on digital logic - for mainstream workloads - is fairly tightly bounded by the practical limits of manufacturing and the current algorithms used for AI. The exciting headline figures quoted by many startups belie problematic tradeoffs, and doing much better without a radical physical overhaul is not possible. The most egregious instance of this I've seen, which caused me to write this, is [Etched](https://www.etched.com/announcing-etched), who claim that by "burning the transformer architecture into [their] chip" they can achieve orders-of-magnitude gains over Nvidia GPUs. I don't think this is at all feasible: transformer inference is limited by memory bandwidth (sometimes) and total compute (more so) for performing large matmuls, not lack of specialization. Future hardware gains will come from the slow grind of slightly better process technology and designs - or potentially a large algorithmic change which makes specialization more gainful. Part of transformers' advantage was running with more parallelism on existing GPUs, but there is enough money in the field for strong hardware/software/algorithmic codesign. I don't know what form this will take.
|
||||||
|
|
||||||
[^1]: Technically, some algorithms can do better asymptotically, but these are not widely used because they only work at unreachably large scales, are numerically unstable, or complicate control flow.
|
[^1]: Technically, some algorithms can do better asymptotically, but these are not widely used because they only work at unreachably large scales, are numerically unstable, or complicate control flow.
|
||||||
|
|
||||||
@ -118,5 +118,3 @@ The design space of AI accelerators built on digital logic is fairly tightly bou
|
|||||||
[^11]: [Power consumption of a test chip](https://www.youtube.com/watch?v=rsxCZAE8QNA&t=1067) and [instruction overhead](https://www.youtube.com/watch?v=rsxCZAE8QNA&t=646).
|
[^11]: [Power consumption of a test chip](https://www.youtube.com/watch?v=rsxCZAE8QNA&t=1067) and [instruction overhead](https://www.youtube.com/watch?v=rsxCZAE8QNA&t=646).
|
||||||
|
|
||||||
[^12]: [https://arxiv.org/abs/1912.03413](https://arxiv.org/abs/1912.03413) page 76.
|
[^12]: [https://arxiv.org/abs/1912.03413](https://arxiv.org/abs/1912.03413) page 76.
|
||||||
|
|
||||||
[^13]: Possibly it's something to do with graph neural networks, given the "graph" in the name.
|
|
||||||
|
@ -269,6 +269,10 @@ The author, Zachary Mason, also wrote [The Lost Books of the Odyssey](https://ww
|
|||||||
* [an important lesson in cryptography](https://archiveofourown.org/works/50491414).
|
* [an important lesson in cryptography](https://archiveofourown.org/works/50491414).
|
||||||
* [Stargate Physics 101](https://archiveofourown.org/works/3673335).
|
* [Stargate Physics 101](https://archiveofourown.org/works/3673335).
|
||||||
* [Programmer at Large](https://archiveofourown.org/works/9233966/chapters/21827111) does futuristic and strange software and society worldbuilding.
|
* [Programmer at Large](https://archiveofourown.org/works/9233966/chapters/21827111) does futuristic and strange software and society worldbuilding.
|
||||||
|
* [Economies of Force](https://apex-magazine.com/short-fiction/economies-of-force/).
|
||||||
|
* [The Titanomachy](https://kishoto.wordpress.com/2015/09/09/the-titanomachy-rrational-challenge-defied-prophecy/).
|
||||||
|
* [Wizard, Cabalist, Ascendant](/stuff/wizard-cabalist-ascendant.xhtml).
|
||||||
|
* [Sekhmet Hunts The Dying: A Computation]9http://www.beneath-ceaseless-skies.com/stories/sekhmet-hunts-the-dying-gnosis-a-computation/.
|
||||||
* More to be added when I feel like it.
|
* More to be added when I feel like it.
|
||||||
|
|
||||||
## Games
|
## Games
|
||||||
|
19
blog/price-discrimination-by-cognitive-load.md
Normal file
19
blog/price-discrimination-by-cognitive-load.md
Normal file
@ -0,0 +1,19 @@
|
|||||||
|
---
|
||||||
|
title: Price discrimination by cognitive load
|
||||||
|
description: A slightly odd pattern I've observed.
|
||||||
|
created: 16/10/2024
|
||||||
|
slug: pricecog
|
||||||
|
---
|
||||||
|
Price discrimination is a practice where sellers of a product try and sell materially the same product to different customers at different prices closer to their willingness to pay. Economists probably have opinions on whether this is good or bad, but I'm going to focus on a specific mechanism for it here. Price discrimination requires some way to show different customers different prices: this might be through timing, very slightly different variants of the product, location or directly selling to customers without a public price.
|
||||||
|
|
||||||
|
One perhaps more modern variant is to discriminate through making prices extremely confusing. I initially thought of this while discussing airline ticketing with one of my friends: the pricing of air fares is [famously](http://www.demarcken.org/carl/papers//ITA-software-travel-complexity/ITA-software-travel-complexity.html) so complex that it is literally impossible[^1] to determine the best price for a journey in some possible cases. One possible explanation is that the airlines want to price-discriminate by offering fares with strange restrictions to more price-sensitive buyers while charging more for convenient journeys. However, in some cases it's possible to pay less for exactly the same journey with the same features through a more complicated purchasing structure!
|
||||||
|
|
||||||
|
This can be explained away as an oversight - perhaps the airlines do not understand the implications of their actions - but this seems implausible given the ruthless cost-optimization of their other operations. A more egregious example is the UK's rail ticketing system: it is sometimes cheaper to buy tickets from A to B and B to C separately than to buy a single ticket from A to C directly, with no implications for the traveller (it's permitted to stay on the train) except slightly more complexity in booking, possible issues with seat reservations and having to show tickets slightly more often. National Rail [knows about this](https://www.nationalrail.co.uk/tickets-railcards-and-offers/buying-a-ticket/split-train-tickets/), and clearly has for several years, but doesn't seem to care to fix it. This, too, can be explained by the patchwork and incoherent structure of UK rail, but it may also function as a strategy to maximize profits.
|
||||||
|
|
||||||
|
Generally, the organizations are - deliberately or not - showing busy or price-insensitive customers one higher price while providing price-sensitive ones lower prices through more effort, but at sale time rather than use time. Some other more arguable examples are, for instance: mail-in rebates (a bizarre US practice where you are refunded part of a product's price for manually mailing in a voucher); online shopping platforms offering an expensive headline price for a product for quick purchases and a list of cheaper alternatives containing, somewhere, an identical or near-identical offering which is hard to find; the horrors of American prescription drug pricing.
|
||||||
|
|
||||||
|
This is mostly relevant where resale is impractical or the price difference is low, or crafty organizations would simply arbitrage the pricing. But there's another route for external organizations to profit from this price structure: selling the service of automatically navigating the pricing for you. Services for this do in fact exist for the transport ticket cases, so it's reasonable to ask why the discriminatory structure still exists at all. I think this doesn't break the equilibrium so much as shift it: the discriminatory services are often protective of their pricing data, so automatic solutions need their cooperation - which is often granted. Instead of selecting consumers on willingness to solve nonsensical discrete optimization problems, it selects them on willingness to go to some extra effort to use another service to, and knowledge of the existence of clever tricks. Perhaps this is almost as good.
|
||||||
|
|
||||||
|
This may not be an especially important pattern, but I think it clarifies some things, it seems novel enough that I wasn't easily able to find preexisting work on it (though maybe I don't know the right terms, or it's buried in other things), and it permits me to feel vaguely annoyed that the world is not run by a perfect machine god executing zero-waste central planning.
|
||||||
|
|
||||||
|
[^1]: Uncomputable - apparently, there's some construction which lets you reduce diophantine equations to fare search problems.
|
Loading…
Reference in New Issue
Block a user