Edit ‘large_language_model’

2024-10-03 09:15:51 +00:00
parent 4da1b9ea26
commit a93594ac3b
1 changed files with 1 additions and 1 deletions
--- a/large_language_model.myco
+++ b/large_language_model.myco
@@ -4,7 +4,7 @@ A large language model is a [[neural net]] model of [[language]] which is [[larg

 The era of LLMs is often taken to have begun with [[https://arxiv.org/abs/1706.03762|Attention Is All You Need]], published in 2017, though it draws heavily on previous work in [[machine translation]]. This paper introduced the [[Transformer]] architecture used in most modern LLMs, though with only ~1e8 parameters and supervised training data its models are not central examples.

-The modern decoder-only architecture and [[self-supervised]] pretraining used now derives from [[https://openai.com/index/language-unsupervised/|OpenAI's GPT-1]], which demonstrated that then-large amounts of compute (240 GPU-days with unspecified GPUs) could be very effective across tasks with unspecialized training data and architecture. [[https://openai.com/index/better-language-models/|GPT-2]], which changed little but scaled further, was widely regarded as a [[cool toy]] by those aware of it, though those within [[OpenAI]] apparently understood the promise of scaling, as it was followed up by [[https://arxiv.org/abs/2005.14165|GPT-3]], which added several orders of magnitude more compute, resulting in [[https://gwern.net/gpt-3#what-benchmarks-miss-demos|enough capability]] to be both [[useful]] and [[fearsome]].
+The modern decoder-only architecture and [[self-supervised]] pretraining used now derives from [[https://openai.com/index/language-unsupervised/|OpenAI's GPT-1]], which demonstrated that then-large amounts of compute (240 GPU-days with unspecified GPUs) could be very effective across tasks with unspecialized training data and architecture. [[https://openai.com/index/better-language-models/|GPT-2]], which changed little but scaled further, was widely regarded as a [[cool toy]] by those aware of it, though those within [[OpenAI]] apparently understood the promise of scaling, as it was followed up by [[https://arxiv.org/abs/2005.14165|GPT-3]], which added several orders of magnitude more compute, resulting in [[https://gwern.net/gpt-3#what-benchmarks-miss-demos|enough capability]] to be both [[useful]] and [[fearsome]], mostly via [[in-context learning]].

 Development slowed at this point due to design and compute limits. In the [[https://arxiv.org/abs/2203.15556|Chinchilla paper]], [[Google DeepMind]] showed that existing [[scaling laws]] were [[wrong]], and models could be trained substantially more efficiently by reducing parameter count and using additional training data, and [[https://arxiv.org/abs/2308.00951|Mixture of Experts architectures]] granted further compute efficiency. It was not until March 2023 - by which time several organizations had produced GPT-3-level models - that OpenAI announced [[GPT-4]] (though it was finished in August 2022 but kept secret for 8 months for "safety testing"). It represented a very significant advance, and combined with the [[instruction tuning]] which made [[ChatGPT]] a user-friendly product, crushed all competitors. Following this, however, was the [[GPT-4 Wall]].