1
0
mirror of https://github.com/osmarks/nanogpt-experiments.git synced 2024-12-22 08:00:28 +00:00
Commit Graph

134 Commits

Author SHA1 Message Date
Andrej Karpathy
580902617c oops optimizer now demands to know device_type 2023-02-05 00:43:15 +00:00
Andrej Karpathy
34720df284 make more accurate the way in which we count parameters. previous count incorrectly included the positional encoding params, when typically only the number of weight parameters is reported for these models 2023-02-04 23:51:18 +00:00
Andrej Karpathy
3341b4cecc oops forgot to subtract embedding params, which don't enter the 6ND equation 2023-02-04 22:33:35 +00:00
Andrej Karpathy
5a162bc773 fix silly error, i don't want to confuse a future GPT training on this notebook in the future 2023-02-04 22:11:16 +00:00
Andrej Karpathy
0bb96d3fff add reference for 6ND to notebook too 2023-02-04 22:07:32 +00:00
Andrej Karpathy
eae986c2d2 new notebook with a bunch of calculations related to flops and memory of Transformer 2023-02-04 22:02:53 +00:00
Andrej Karpathy
a74e8363a2 clean up TODOs a bit, they are stale 2023-02-04 21:11:25 +00:00
Andrej Karpathy
25d95dbd65 mildly dramatic refactor for handing all these usage cases across all possible supported and unsupported devices for all the possible switches and flags 2023-02-04 21:06:17 +00:00
Andrej Karpathy
e108ffb973 very slight refactor, bit cleaner 2023-02-04 19:34:24 +00:00
Andrej
dc149891b6
Merge pull request from nynyg/remove_cpu_pin_mem
Pin memory only when training on GPU
2023-02-04 11:28:08 -08:00
Nan Yang
b8286f343e Pin memory only when training on GPU 2023-02-04 11:16:26 -08:00
Andrej Karpathy
77e7e04c26 padding 50257 -> 50304 vocab_size, the nerest multiple of 64. the biggest deal smallest optimization i've made in recent past, about 25% faster. this is because the last layer is a major latency bottleneck consuming about 40% of latency due to the very high channel count. 2023-02-04 16:06:18 +00:00
Andrej Karpathy
b3c17c6c6a slight tweak compressing LOC 2023-02-04 15:57:29 +00:00
Andrej
53d56b82f1
Merge pull request from ramtingh/master
Minor change to allow using ddp with exclusive process mode
2023-02-04 07:42:32 -08:00
Ramtin Gharleghi
9da1627c7f
Explicitly set ddp device 2023-02-04 15:07:36 +11:00
Andrej Karpathy
3fd4c0c5ef who needs a dataloader? overlap the prefetching of the next batch with GPU compute, ehiding the data loading latency entirely. this saves about 1ms lol 2023-02-04 02:52:48 +00:00
Andrej
46428d3142
Merge pull request from akashmjn/akashmjn/fix-notebook-stats
add template .gitattributes that fixes language stats
2023-02-03 17:23:44 -08:00
Akash Mahajan
d9a73374ed
keep only what's needed 2023-02-03 15:13:13 -08:00
Andrej Karpathy
3969860ff5 include launch command too. anyone should be able to do this now 2023-02-03 22:17:05 +00:00
Andrej Karpathy
f9348f3f18 add gpt2 training config 2023-02-03 22:14:37 +00:00
Akash Mahajan
0e2c12b5ae
add template .gitattributes that fixes language stats 2023-02-03 13:36:36 -08:00
Andrej Karpathy
e170e40872 use the new fused AdamW from pytorch nightly, if available 2023-02-03 17:56:51 +00:00
Andrej
7d44bdf6b5
Merge pull request from YassineYousfi/master
use the ``enabled`` arg in GradScaler
2023-02-02 17:23:22 -08:00
Andrej Karpathy
1e87509e47 if dropout > 0.0 disable Flash until pytorch fix. don't assert fail sigh 2023-02-02 23:22:56 +00:00
Andrej Karpathy
d8b1a94519 change grad accum to default off because i think it just confuses everyone 2023-02-02 18:38:49 +00:00
Andrej Karpathy
d01863ef01 small usability tweaks to bench 2023-02-02 17:23:46 +00:00
Yassine Yousfi
40f4d6ff70 use the enabled arg in GradScaler 2023-01-31 21:12:49 -08:00
Andrej Karpathy
d995c22128 fix bug with loading GPT-2 parameters, assert gets incorrectly tripped due to .bias missing since it is now optionally present depending on flash or not 2023-02-01 02:05:34 +00:00
Andrej Karpathy
038ce89438 rename iter to it, because iter is a concrete Python builtin 2023-01-31 23:34:02 +00:00
Andrej Karpathy
d2705bd92a tune cited numbers and reproductions and more explicitly point out the problems w.r.t. the OWT vs WT domain gap 2023-01-31 21:57:07 +00:00
Andrej Karpathy
4386bce1f4 adjust teaser figure with a more tuned result 2023-01-31 21:43:30 +00:00
Andrej Karpathy
924a0873eb merge, make cleaner, careful with gradient clipping when using grad scaler fp16 training 2023-01-30 23:40:35 +00:00
Andrej Karpathy
ae06d0b15a add flash attention support, resolving last few issues but for now seems to work ok 2023-01-30 23:18:26 +00:00
Andrej Karpathy
0e90ee9d48 based on my experiments these biases are indeed not needed. code runs faster, identical results. keeping the option just because it deviates from the gpt-2 setup 2023-01-30 08:07:58 +00:00
Andrej Karpathy
001c1e7be7 stay true to the README file and set grad accum to 5, so the default batch size is about 0.5M and is reproducing gpt2 2023-01-27 20:51:50 +00:00
Andrej Karpathy
79dbe0086d let me set bias=True until I validate it properly, but this should be ok to merge to master for now, is equivalent to previous functionality 2023-01-27 20:45:28 +00:00
Andrej Karpathy
e808a67149 bunch of plumbing of bias all around. measuring bias=False to be about 6% faster 2023-01-27 20:41:17 +00:00
Andrej Karpathy
cc5444e194 add the bias option to config, default it to True for now 2023-01-27 20:29:45 +00:00
Andrej Karpathy
2bf07a3fbf rewrite model class so layernorm has an optional bias= parameter 2023-01-27 20:17:32 +00:00
Andrej Karpathy
2892858ce7 attempt a non-biased model, per few papers that cite this as working well 2023-01-27 18:54:08 +00:00
Andrej Karpathy
f29a9ff5bf ok i tried bringing back original init again and this time it makes a ton of difference and works much better than default. i'm not sure what was different with my earlier experiment where i saw a slight regression. may try to dissect commits later, for now merged the original mingpt init (following gpt-2 paper) as default. 2023-01-27 17:56:18 +00:00
Andrej Karpathy
23a0bfac20 try bring back mingpt init 2023-01-27 16:52:18 +00:00
Andrej Karpathy
3cb3fc059c grad clipping seems to slightly speed up training in the beginning but i can't see a big difference later in the training. it costs non-negligeable compute to clip. adding it for now because it is standard, and i think more necessary as the model becomes larger. practitioners may consider turning it off for minor efficiency gains 2023-01-27 16:45:09 +00:00
Andrej Karpathy
e0c689cf38 allow the prompt to compe from a file 2023-01-25 01:12:43 +00:00
Andrej Karpathy
21675d7755 allow sample.py to init from a pretrained gpt2 checkpoints as well, in similar style to train.py 2023-01-25 00:55:29 +00:00
johnwildauer
e0e94a1094 use GradScaler in model only if dtype is float16 2023-01-24 15:53:31 -07:00
Andrej
6c40a08b41
Merge pull request from danielgross/master
Missed two spots while relative pathing
2023-01-22 13:47:32 -08:00
DG
2f7fd0ac57 add relative import in shakespeare 2023-01-22 12:18:24 -08:00
DG
bf779456f3 add relative import in shakespeare_char 2023-01-22 11:11:25 -08:00
Andrej
3611338959
Merge pull request from cchan/patch-1
Zero-grad more aggressively to save memory
2023-01-20 14:38:10 -08:00