documentation/large_language_model.myco

A large language model is a [[neural net]] model of [[language]] which is [[large]], usually in the sense of parameter count or total [[compute]], making it [[good]] at text prediction by [[scaling laws]]. Usually these are [[autoregressive]] and pretrained on general text data with a next token prediction [[loss function]], though this is not necessarily required. The largest large LLMs known are around 2 trillion parameters, though the smallest LLM is not known.

== History

The era of LLMs is often taken to have begun with [[https://arxiv.org/abs/1706.03762|Attention Is All You Need]], published in 2017, though it draws heavily on previous work in [[machine translation]]. This paper introduced the [[Transformer]] architecture used in most modern LLMs, though with only ~1e8 parameters and supervised training data its models are not central examples.

The modern decoder-only architecture and [[self-supervised]] pretraining used now derives from [[https://openai.com/index/language-unsupervised/|OpenAI's GPT-1]], which demonstrated that then-large amounts of compute (240 GPU-days with unspecified GPUs) could be very effective across tasks with unspecialized training data and architecture. [[https://openai.com/index/better-language-models/|GPT-2]], which changed little but scaled further, was widely regarded as a [[cool toy]] by those aware of it, though those within [[OpenAI]] apparently understood the promise of scaling, as it was followed up by [[https://arxiv.org/abs/2005.14165|GPT-3]], which added several orders of magnitude more compute, resulting in [[https://gwern.net/gpt-3#what-benchmarks-miss-demos|enough capability]] to be both [[useful]] and [[fearsome]], mostly via [[in-context learning]].

Development slowed at this point due to design and compute limits. In the [[https://arxiv.org/abs/2203.15556|Chinchilla paper]], [[Google DeepMind]] showed that existing [[scaling laws]] were [[wrong]], and models could be trained substantially more efficiently by reducing parameter count and using additional training data, and [[https://arxiv.org/abs/2308.00951|Mixture of Experts architectures]] granted further compute efficiency. It was not until March 2023 - by which time several organizations had produced GPT-3-level models - that OpenAI announced [[GPT-4]] (though it was finished in August 2022 but kept secret for 8 months for "safety testing"). It represented a very significant advance, and combined with the [[instruction tuning]] which made [[ChatGPT]] a user-friendly product, crushed all competitors. Following this, however, was the [[GPT-4 Wall]].

== Applications

* [[Autogollark]]
* [[Code generation]]
* [[Email jobs]]
* [[Paperclip manufacturing]]
* [[Enterprise Resource Planning|ERP]]