documentation/softmax_bottleneck.myco

Almost all modern [[large language model|LLMs]] map relatively low-dimensional hidden states to high-dimensional probability distributions over [[tokenizer|tokens]] using a single [[matrix]] and a [[softmax]] operation. The [[rank]] of this transformation is limited to the hidden size, so not all valid probability distributions can be represented. Some mixtures of tokens are not representable without introducing additional higher-probability tokens, particularly where a mixture of such would not be common in the training data. This has a number of [[consequences]].

References:

* https://x.com/kalomaze/status/1776341569542431150
* https://aclanthology.org/2022.acl-long.554/
Edit ‘softmax_bottleneck’ 2024-09-17 20:00:24 +00:00			Almost all modern [[large language model\|LLMs]] map relatively low-dimensional hidden states to high-dimensional probability distributions over [[tokenizer\|tokens]] using a single [[matrix]] and a [[softmax]] operation. The [[rank]] of this transformation is limited to the hidden size, so not all valid probability distributions can be represented. Some mixtures of tokens are not representable without introducing additional higher-probability tokens, particularly where a mixture of such would not be common in the training data. This has a number of [[consequences]].
Create ‘softmax_bottleneck’ 2024-09-17 19:59:45 +00:00
			`References:`

			`* https://x.com/kalomaze/status/1776341569542431150`
			`* https://aclanthology.org/2022.acl-long.554/`