nanogpt-experiments

mirror of https://github.com/osmarks/nanogpt-experiments.git synced 2024-09-21 03:39:44 +00:00

Author	SHA1	Message	Date
Andrej	f08abb45bd	Merge pull request #274 from apivovarov/gelu Use nn.GELU - 1.27x faster training	2023-06-14 16:25:15 -07:00
Alexander Pivovarov	39ae397a93	Remove pos unsqueeze(0)	2023-05-17 02:30:18 +00:00
Alexander Pivovarov	594068e7ae	Use nn.GELU	2023-05-17 00:53:35 +00:00
Andrej Karpathy	7fe4a099ad	simplify configure_optimizers by a lot	2023-05-06 14:40:28 +00:00
Andrej	01e48ec1ab	Merge pull request #240 from YassineYousfi/master don't dropout in eval mode	2023-04-12 22:43:59 -07:00
Andrej	ad62003d7a	Merge pull request #142 from kovkev/patch-1 Fix the position of a comma	2023-04-12 22:24:06 -07:00
Yassine Yousfi	7399dfe39d	dont always dropout!	2023-04-10 22:56:22 -07:00
Kirill	c3f254844d	Fix GPT.crop_block_size when flash attention is available	2023-03-24 14:51:02 +03:00
Driss Guessous	6170531b8a	enable sdpa for nonzero dropout	2023-03-05 19:29:29 +00:00
kovkev	c2531159c7	Fix the position of a comma	2023-02-11 17:13:24 -08:00
Andrej Karpathy	8b1e43209e	small tweaks, make default WD be 0.1 as is often cited, and remove spurious init of LayerNorm, which is already initialized at 1,0	2023-02-06 23:07:25 +00:00
Andrej Karpathy	ab0718a7dd	add the estimation of model flops utilization (MFU), a very commonly looked at metric that estimates the token throughput in units of A100 bfloat16 peak flops (312 TFLOPS). this gives us a sense of the hardware utilization we're achieving	2023-02-05 00:48:58 +00:00
Andrej Karpathy	34720df284	make more accurate the way in which we count parameters. previous count incorrectly included the positional encoding params, when typically only the number of weight parameters is reported for these models	2023-02-04 23:51:18 +00:00
Andrej Karpathy	25d95dbd65	mildly dramatic refactor for handing all these usage cases across all possible supported and unsupported devices for all the possible switches and flags	2023-02-04 21:06:17 +00:00
Andrej Karpathy	77e7e04c26	padding 50257 -> 50304 vocab_size, the nerest multiple of 64. the biggest deal smallest optimization i've made in recent past, about 25% faster. this is because the last layer is a major latency bottleneck consuming about 40% of latency due to the very high channel count.	2023-02-04 16:06:18 +00:00
Andrej Karpathy	e170e40872	use the new fused AdamW from pytorch nightly, if available	2023-02-03 17:56:51 +00:00
Andrej Karpathy	1e87509e47	if dropout > 0.0 disable Flash until pytorch fix. don't assert fail sigh	2023-02-02 23:22:56 +00:00
Andrej Karpathy	d995c22128	fix bug with loading GPT-2 parameters, assert gets incorrectly tripped due to .bias missing since it is now optionally present depending on flash or not	2023-02-01 02:05:34 +00:00
Andrej Karpathy	ae06d0b15a	add flash attention support, resolving last few issues but for now seems to work ok	2023-01-30 23:18:26 +00:00
Andrej Karpathy	e808a67149	bunch of plumbing of bias all around. measuring bias=False to be about 6% faster	2023-01-27 20:41:17 +00:00
Andrej Karpathy	cc5444e194	add the bias option to config, default it to True for now	2023-01-27 20:29:45 +00:00
Andrej Karpathy	2bf07a3fbf	rewrite model class so layernorm has an optional bias= parameter	2023-01-27 20:17:32 +00:00
Andrej Karpathy	2892858ce7	attempt a non-biased model, per few papers that cite this as working well	2023-01-27 18:54:08 +00:00
Andrej Karpathy	23a0bfac20	try bring back mingpt init	2023-01-27 16:52:18 +00:00
Andrej Karpathy	89da79eee1	add note of caution for the produced warning, investigate later	2023-01-14 20:38:22 +00:00
Andrej Karpathy	91d02510ce	fix bug... if topk > vocab_size, torch.topk will throw error	2023-01-14 03:57:00 +00:00
Andrej Karpathy	43b37fd568	reverse the order, making sure that the final layer init is preserved, and becomes the token embedding instead of the other way around. otherwise the loss can be all messed up from a bad init	2023-01-14 02:16:10 +00:00
Andrej Karpathy	7c8288552b	tie the weights of lm_head.weight and transformer.wte.weight, i.e. the last linear layer of decoder and the token embeddings.	2023-01-14 01:00:55 +00:00
Andrej Karpathy	8f85b83347	inference time mini-optimization low-hanging fruit ty @jxtps for raising: when we are running inference we can apply lm_head on only the very last token	2023-01-12 06:02:50 +00:00
Andrej Karpathy	177d5f7dc5	disabling torch.jit.script here for massive performance boost when using torch.compile, our default. see issue #11 . thanks @vgoklani for flagging	2023-01-02 23:05:01 +00:00
Andrej Karpathy	2febf4463c	candidate changes to apis, have to think through more	2023-01-01 01:29:48 +00:00
ankandrew	7f0e6d9a71	Frozen GPTConfig	2022-12-29 17:07:19 -03:00
Andrej Karpathy	fe8042867c	first very bad commit	2022-12-28 00:58:19 +00:00

33 Commits