1
0
mirror of https://github.com/osmarks/nanogpt-experiments.git synced 2024-09-21 03:39:44 +00:00
Commit Graph

33 Commits

Author SHA1 Message Date
Andrej
f08abb45bd
Merge pull request #274 from apivovarov/gelu
Use nn.GELU - 1.27x faster training
2023-06-14 16:25:15 -07:00
Alexander Pivovarov
39ae397a93 Remove pos unsqueeze(0) 2023-05-17 02:30:18 +00:00
Alexander Pivovarov
594068e7ae Use nn.GELU 2023-05-17 00:53:35 +00:00
Andrej Karpathy
7fe4a099ad simplify configure_optimizers by a lot 2023-05-06 14:40:28 +00:00
Andrej
01e48ec1ab
Merge pull request #240 from YassineYousfi/master
don't dropout in eval mode
2023-04-12 22:43:59 -07:00
Andrej
ad62003d7a
Merge pull request #142 from kovkev/patch-1
Fix the position of a comma
2023-04-12 22:24:06 -07:00
Yassine Yousfi
7399dfe39d dont always dropout! 2023-04-10 22:56:22 -07:00
Kirill
c3f254844d
Fix GPT.crop_block_size when flash attention is available 2023-03-24 14:51:02 +03:00
Driss Guessous
6170531b8a enable sdpa for nonzero dropout 2023-03-05 19:29:29 +00:00
kovkev
c2531159c7
Fix the position of a comma 2023-02-11 17:13:24 -08:00
Andrej Karpathy
8b1e43209e small tweaks, make default WD be 0.1 as is often cited, and remove spurious init of LayerNorm, which is already initialized at 1,0 2023-02-06 23:07:25 +00:00
Andrej Karpathy
ab0718a7dd add the estimation of model flops utilization (MFU), a very commonly looked at metric that estimates the token throughput in units of A100 bfloat16 peak flops (312 TFLOPS). this gives us a sense of the hardware utilization we're achieving 2023-02-05 00:48:58 +00:00
Andrej Karpathy
34720df284 make more accurate the way in which we count parameters. previous count incorrectly included the positional encoding params, when typically only the number of weight parameters is reported for these models 2023-02-04 23:51:18 +00:00
Andrej Karpathy
25d95dbd65 mildly dramatic refactor for handing all these usage cases across all possible supported and unsupported devices for all the possible switches and flags 2023-02-04 21:06:17 +00:00
Andrej Karpathy
77e7e04c26 padding 50257 -> 50304 vocab_size, the nerest multiple of 64. the biggest deal smallest optimization i've made in recent past, about 25% faster. this is because the last layer is a major latency bottleneck consuming about 40% of latency due to the very high channel count. 2023-02-04 16:06:18 +00:00
Andrej Karpathy
e170e40872 use the new fused AdamW from pytorch nightly, if available 2023-02-03 17:56:51 +00:00
Andrej Karpathy
1e87509e47 if dropout > 0.0 disable Flash until pytorch fix. don't assert fail sigh 2023-02-02 23:22:56 +00:00
Andrej Karpathy
d995c22128 fix bug with loading GPT-2 parameters, assert gets incorrectly tripped due to .bias missing since it is now optionally present depending on flash or not 2023-02-01 02:05:34 +00:00
Andrej Karpathy
ae06d0b15a add flash attention support, resolving last few issues but for now seems to work ok 2023-01-30 23:18:26 +00:00
Andrej Karpathy
e808a67149 bunch of plumbing of bias all around. measuring bias=False to be about 6% faster 2023-01-27 20:41:17 +00:00
Andrej Karpathy
cc5444e194 add the bias option to config, default it to True for now 2023-01-27 20:29:45 +00:00
Andrej Karpathy
2bf07a3fbf rewrite model class so layernorm has an optional bias= parameter 2023-01-27 20:17:32 +00:00
Andrej Karpathy
2892858ce7 attempt a non-biased model, per few papers that cite this as working well 2023-01-27 18:54:08 +00:00
Andrej Karpathy
23a0bfac20 try bring back mingpt init 2023-01-27 16:52:18 +00:00
Andrej Karpathy
89da79eee1 add note of caution for the produced warning, investigate later 2023-01-14 20:38:22 +00:00
Andrej Karpathy
91d02510ce fix bug... if topk > vocab_size, torch.topk will throw error 2023-01-14 03:57:00 +00:00
Andrej Karpathy
43b37fd568 reverse the order, making sure that the final layer init is preserved, and becomes the token embedding instead of the other way around. otherwise the loss can be all messed up from a bad init 2023-01-14 02:16:10 +00:00
Andrej Karpathy
7c8288552b tie the weights of lm_head.weight and transformer.wte.weight, i.e. the last linear layer of decoder and the token embeddings. 2023-01-14 01:00:55 +00:00
Andrej Karpathy
8f85b83347 inference time mini-optimization low-hanging fruit ty @jxtps for raising: when we are running inference we can apply lm_head on only the very last token 2023-01-12 06:02:50 +00:00
Andrej Karpathy
177d5f7dc5 disabling torch.jit.script here for massive performance boost when using torch.compile, our default. see issue #11. thanks @vgoklani for flagging 2023-01-02 23:05:01 +00:00
Andrej Karpathy
2febf4463c candidate changes to apis, have to think through more 2023-01-01 01:29:48 +00:00
ankandrew
7f0e6d9a71 Frozen GPTConfig 2022-12-29 17:07:19 -03:00
Andrej Karpathy
fe8042867c first very bad commit 2022-12-28 00:58:19 +00:00