From 6be8909f4a342999376dcf6dffe9bb433ee9d197 Mon Sep 17 00:00:00 2001 From: osmarks Date: Thu, 5 Sep 2024 19:40:05 +0000 Subject: [PATCH] =?UTF-8?q?Create=20=E2=80=98large=5Flanguage=5Fmodel?= =?UTF-8?q?=E2=80=99?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- large_language_model.myco | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) create mode 100644 large_language_model.myco diff --git a/large_language_model.myco b/large_language_model.myco new file mode 100644 index 0000000..59f564a --- /dev/null +++ b/large_language_model.myco @@ -0,0 +1,17 @@ +A large language model is a [[neural net]] model of [[language]] which is [[large]], usually in the sense of parameter count or total [[compute]], making them [[good]] at text prediction by [[scaling laws]]. Usually these are [[autoregressive]] and pretrained on general text data with a next token prediction [[loss function]], though this is not necessarily required. The largest large LLMs known are around 2 trillion parameters, though the smallest LLM is not known. + +== History + +The era of LLMs is often taken to have begun with [[https://arxiv.org/abs/1706.03762|Attention Is All You Need]], published in 2017, though it draws heavily on previous work in [[machine translation]]. This paper introduced the [[Transformer]] architecture used in most modern LLMs, though with only ~1e8 parameters and supervised training data its models are not central examples. + +The modern decoder-only architecture and [[self-supervised]] pretraining used now derives from [[https://openai.com/index/language-unsupervised/|OpenAI's GPT-1]], which demonstrated that then-large amounts of compute (240 GPU-days with unspecified GPUs) could be very effective across tasks with unspecialized training data and architecture. [[https://openai.com/index/better-language-models/|GPT-2]], which changed little but scaled further, was widely regarded as a [[cool toy]] by those aware of it, though those within [[OpenAI]] apparently understood the promise of scaling, as it was followed up by [[https://arxiv.org/abs/2005.14165|GPT-3]], which added several orders of magnitude more compute, resulting in enough capability to be both [[useful]] and [[fearsome]]. + +Development slowed at this point due to design and compute limits. In the [[https://arxiv.org/abs/2203.15556|Chinchilla paper]], [[Google DeepMind]] showed that existing [[scaling laws]] were [[wrong]], and models could be trained substantially more efficiently by reducing parameter count and using additional training data, and [[https://arxiv.org/abs/2308.00951|Mixture of Experts architectures]] granted further compute efficiency. It was not until March 2023 - by which time several organizations had produced GPT-3-level models - that OpenAI announced [[GPT-4]] (though it was finished in August 2022 but kept secret for 8 months for "safety testing"). It represented a very significant advance, and combined with the [[instruction tuning]] which made [[ChatGPT]] a user-friendly product, crushed all competitors. Following this, however, was the [[GPT-4 Wall]]. + +== Applications + +* [[Autogollark]] +* [[Code generation]] +* [[Email jobs]] +* [[Paperclip manufacturing]] +* [[ERP]] \ No newline at end of file