From b0e0a5e914abfe00e8f2cafa81470373939c30da Mon Sep 17 00:00:00 2001 From: osmarks Date: Fri, 6 Sep 2024 07:10:35 +0000 Subject: [PATCH] =?UTF-8?q?Edit=20=E2=80=98large=5Flanguage=5Fmodel?= =?UTF-8?q?=E2=80=99?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- large_language_model.myco | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/large_language_model.myco b/large_language_model.myco index fb73502..931b9d5 100644 --- a/large_language_model.myco +++ b/large_language_model.myco @@ -4,7 +4,7 @@ A large language model is a [[neural net]] model of [[language]] which is [[larg The era of LLMs is often taken to have begun with [[https://arxiv.org/abs/1706.03762|Attention Is All You Need]], published in 2017, though it draws heavily on previous work in [[machine translation]]. This paper introduced the [[Transformer]] architecture used in most modern LLMs, though with only ~1e8 parameters and supervised training data its models are not central examples. -The modern decoder-only architecture and [[self-supervised]] pretraining used now derives from [[https://openai.com/index/language-unsupervised/|OpenAI's GPT-1]], which demonstrated that then-large amounts of compute (240 GPU-days with unspecified GPUs) could be very effective across tasks with unspecialized training data and architecture. [[https://openai.com/index/better-language-models/|GPT-2]], which changed little but scaled further, was widely regarded as a [[cool toy]] by those aware of it, though those within [[OpenAI]] apparently understood the promise of scaling, as it was followed up by [[https://arxiv.org/abs/2005.14165|GPT-3]], which added several orders of magnitude more compute, resulting in enough capability to be both [[useful]] and [[fearsome]]. +The modern decoder-only architecture and [[self-supervised]] pretraining used now derives from [[https://openai.com/index/language-unsupervised/|OpenAI's GPT-1]], which demonstrated that then-large amounts of compute (240 GPU-days with unspecified GPUs) could be very effective across tasks with unspecialized training data and architecture. [[https://openai.com/index/better-language-models/|GPT-2]], which changed little but scaled further, was widely regarded as a [[cool toy]] by those aware of it, though those within [[OpenAI]] apparently understood the promise of scaling, as it was followed up by [[https://arxiv.org/abs/2005.14165|GPT-3]], which added several orders of magnitude more compute, resulting in [[https://gwern.net/gpt-3#what-benchmarks-miss-demos|enough capability]] to be both [[useful]] and [[fearsome]]. Development slowed at this point due to design and compute limits. In the [[https://arxiv.org/abs/2203.15556|Chinchilla paper]], [[Google DeepMind]] showed that existing [[scaling laws]] were [[wrong]], and models could be trained substantially more efficiently by reducing parameter count and using additional training data, and [[https://arxiv.org/abs/2308.00951|Mixture of Experts architectures]] granted further compute efficiency. It was not until March 2023 - by which time several organizations had produced GPT-3-level models - that OpenAI announced [[GPT-4]] (though it was finished in August 2022 but kept secret for 8 months for "safety testing"). It represented a very significant advance, and combined with the [[instruction tuning]] which made [[ChatGPT]] a user-friendly product, crushed all competitors. Following this, however, was the [[GPT-4 Wall]].