Andrej Karpathy
|
8b1e43209e
|
small tweaks, make default WD be 0.1 as is often cited, and remove spurious init of LayerNorm, which is already initialized at 1,0
|
2023-02-06 23:07:25 +00:00 |
|
Andrej Karpathy
|
ab0718a7dd
|
add the estimation of model flops utilization (MFU), a very commonly looked at metric that estimates the token throughput in units of A100 bfloat16 peak flops (312 TFLOPS). this gives us a sense of the hardware utilization we're achieving
|
2023-02-05 00:48:58 +00:00 |
|
Andrej Karpathy
|
34720df284
|
make more accurate the way in which we count parameters. previous count incorrectly included the positional encoding params, when typically only the number of weight parameters is reported for these models
|
2023-02-04 23:51:18 +00:00 |
|
Andrej Karpathy
|
25d95dbd65
|
mildly dramatic refactor for handing all these usage cases across all possible supported and unsupported devices for all the possible switches and flags
|
2023-02-04 21:06:17 +00:00 |
|
Andrej Karpathy
|
77e7e04c26
|
padding 50257 -> 50304 vocab_size, the nerest multiple of 64. the biggest deal smallest optimization i've made in recent past, about 25% faster. this is because the last layer is a major latency bottleneck consuming about 40% of latency due to the very high channel count.
|
2023-02-04 16:06:18 +00:00 |
|
Andrej Karpathy
|
e170e40872
|
use the new fused AdamW from pytorch nightly, if available
|
2023-02-03 17:56:51 +00:00 |
|
Andrej Karpathy
|
1e87509e47
|
if dropout > 0.0 disable Flash until pytorch fix. don't assert fail sigh
|
2023-02-02 23:22:56 +00:00 |
|
Andrej Karpathy
|
d995c22128
|
fix bug with loading GPT-2 parameters, assert gets incorrectly tripped due to .bias missing since it is now optionally present depending on flash or not
|
2023-02-01 02:05:34 +00:00 |
|
Andrej Karpathy
|
ae06d0b15a
|
add flash attention support, resolving last few issues but for now seems to work ok
|
2023-01-30 23:18:26 +00:00 |
|
Andrej Karpathy
|
e808a67149
|
bunch of plumbing of bias all around. measuring bias=False to be about 6% faster
|
2023-01-27 20:41:17 +00:00 |
|
Andrej Karpathy
|
cc5444e194
|
add the bias option to config, default it to True for now
|
2023-01-27 20:29:45 +00:00 |
|
Andrej Karpathy
|
2bf07a3fbf
|
rewrite model class so layernorm has an optional bias= parameter
|
2023-01-27 20:17:32 +00:00 |
|
Andrej Karpathy
|
2892858ce7
|
attempt a non-biased model, per few papers that cite this as working well
|
2023-01-27 18:54:08 +00:00 |
|
Andrej Karpathy
|
23a0bfac20
|
try bring back mingpt init
|
2023-01-27 16:52:18 +00:00 |
|
Andrej Karpathy
|
89da79eee1
|
add note of caution for the produced warning, investigate later
|
2023-01-14 20:38:22 +00:00 |
|
Andrej Karpathy
|
91d02510ce
|
fix bug... if topk > vocab_size, torch.topk will throw error
|
2023-01-14 03:57:00 +00:00 |
|
Andrej Karpathy
|
43b37fd568
|
reverse the order, making sure that the final layer init is preserved, and becomes the token embedding instead of the other way around. otherwise the loss can be all messed up from a bad init
|
2023-01-14 02:16:10 +00:00 |
|
Andrej Karpathy
|
7c8288552b
|
tie the weights of lm_head.weight and transformer.wte.weight, i.e. the last linear layer of decoder and the token embeddings.
|
2023-01-14 01:00:55 +00:00 |
|
Andrej Karpathy
|
8f85b83347
|
inference time mini-optimization low-hanging fruit ty @jxtps for raising: when we are running inference we can apply lm_head on only the very last token
|
2023-01-12 06:02:50 +00:00 |
|
Andrej Karpathy
|
177d5f7dc5
|
disabling torch.jit.script here for massive performance boost when using torch.compile, our default. see issue #11. thanks @vgoklani for flagging
|
2023-01-02 23:05:01 +00:00 |
|
Andrej Karpathy
|
2febf4463c
|
candidate changes to apis, have to think through more
|
2023-01-01 01:29:48 +00:00 |
|
ankandrew
|
7f0e6d9a71
|
Frozen GPTConfig
|
2022-12-29 17:07:19 -03:00 |
|
Andrej Karpathy
|
fe8042867c
|
first very bad commit
|
2022-12-28 00:58:19 +00:00 |
|