1
0
mirror of https://github.com/osmarks/nanogpt-experiments.git synced 2024-12-18 22:20:29 +00:00
Commit Graph

132 Commits

Author SHA1 Message Date
Andrej Karpathy
3341b4cecc oops forgot to subtract embedding params, which don't enter the 6ND equation 2023-02-04 22:33:35 +00:00
Andrej Karpathy
5a162bc773 fix silly error, i don't want to confuse a future GPT training on this notebook in the future 2023-02-04 22:11:16 +00:00
Andrej Karpathy
0bb96d3fff add reference for 6ND to notebook too 2023-02-04 22:07:32 +00:00
Andrej Karpathy
eae986c2d2 new notebook with a bunch of calculations related to flops and memory of Transformer 2023-02-04 22:02:53 +00:00
Andrej Karpathy
a74e8363a2 clean up TODOs a bit, they are stale 2023-02-04 21:11:25 +00:00
Andrej Karpathy
25d95dbd65 mildly dramatic refactor for handing all these usage cases across all possible supported and unsupported devices for all the possible switches and flags 2023-02-04 21:06:17 +00:00
Andrej Karpathy
e108ffb973 very slight refactor, bit cleaner 2023-02-04 19:34:24 +00:00
Andrej
dc149891b6
Merge pull request #120 from nynyg/remove_cpu_pin_mem
Pin memory only when training on GPU
2023-02-04 11:28:08 -08:00
Nan Yang
b8286f343e Pin memory only when training on GPU 2023-02-04 11:16:26 -08:00
Andrej Karpathy
77e7e04c26 padding 50257 -> 50304 vocab_size, the nerest multiple of 64. the biggest deal smallest optimization i've made in recent past, about 25% faster. this is because the last layer is a major latency bottleneck consuming about 40% of latency due to the very high channel count. 2023-02-04 16:06:18 +00:00
Andrej Karpathy
b3c17c6c6a slight tweak compressing LOC 2023-02-04 15:57:29 +00:00
Andrej
53d56b82f1
Merge pull request #116 from ramtingh/master
Minor change to allow using ddp with exclusive process mode
2023-02-04 07:42:32 -08:00
Ramtin Gharleghi
9da1627c7f
Explicitly set ddp device 2023-02-04 15:07:36 +11:00
Andrej Karpathy
3fd4c0c5ef who needs a dataloader? overlap the prefetching of the next batch with GPU compute, ehiding the data loading latency entirely. this saves about 1ms lol 2023-02-04 02:52:48 +00:00
Andrej
46428d3142
Merge pull request #115 from akashmjn/akashmjn/fix-notebook-stats
add template .gitattributes that fixes language stats
2023-02-03 17:23:44 -08:00
Akash Mahajan
d9a73374ed
keep only what's needed 2023-02-03 15:13:13 -08:00
Andrej Karpathy
3969860ff5 include launch command too. anyone should be able to do this now 2023-02-03 22:17:05 +00:00
Andrej Karpathy
f9348f3f18 add gpt2 training config 2023-02-03 22:14:37 +00:00
Akash Mahajan
0e2c12b5ae
add template .gitattributes that fixes language stats 2023-02-03 13:36:36 -08:00
Andrej Karpathy
e170e40872 use the new fused AdamW from pytorch nightly, if available 2023-02-03 17:56:51 +00:00
Andrej
7d44bdf6b5
Merge pull request #106 from YassineYousfi/master
use the ``enabled`` arg in GradScaler
2023-02-02 17:23:22 -08:00
Andrej Karpathy
1e87509e47 if dropout > 0.0 disable Flash until pytorch fix. don't assert fail sigh 2023-02-02 23:22:56 +00:00
Andrej Karpathy
d8b1a94519 change grad accum to default off because i think it just confuses everyone 2023-02-02 18:38:49 +00:00
Andrej Karpathy
d01863ef01 small usability tweaks to bench 2023-02-02 17:23:46 +00:00
Yassine Yousfi
40f4d6ff70 use the enabled arg in GradScaler 2023-01-31 21:12:49 -08:00
Andrej Karpathy
d995c22128 fix bug with loading GPT-2 parameters, assert gets incorrectly tripped due to .bias missing since it is now optionally present depending on flash or not 2023-02-01 02:05:34 +00:00
Andrej Karpathy
038ce89438 rename iter to it, because iter is a concrete Python builtin 2023-01-31 23:34:02 +00:00
Andrej Karpathy
d2705bd92a tune cited numbers and reproductions and more explicitly point out the problems w.r.t. the OWT vs WT domain gap 2023-01-31 21:57:07 +00:00
Andrej Karpathy
4386bce1f4 adjust teaser figure with a more tuned result 2023-01-31 21:43:30 +00:00
Andrej Karpathy
924a0873eb merge, make cleaner, careful with gradient clipping when using grad scaler fp16 training 2023-01-30 23:40:35 +00:00
Andrej Karpathy
ae06d0b15a add flash attention support, resolving last few issues but for now seems to work ok 2023-01-30 23:18:26 +00:00
Andrej Karpathy
0e90ee9d48 based on my experiments these biases are indeed not needed. code runs faster, identical results. keeping the option just because it deviates from the gpt-2 setup 2023-01-30 08:07:58 +00:00
Andrej Karpathy
001c1e7be7 stay true to the README file and set grad accum to 5, so the default batch size is about 0.5M and is reproducing gpt2 2023-01-27 20:51:50 +00:00
Andrej Karpathy
79dbe0086d let me set bias=True until I validate it properly, but this should be ok to merge to master for now, is equivalent to previous functionality 2023-01-27 20:45:28 +00:00
Andrej Karpathy
e808a67149 bunch of plumbing of bias all around. measuring bias=False to be about 6% faster 2023-01-27 20:41:17 +00:00
Andrej Karpathy
cc5444e194 add the bias option to config, default it to True for now 2023-01-27 20:29:45 +00:00
Andrej Karpathy
2bf07a3fbf rewrite model class so layernorm has an optional bias= parameter 2023-01-27 20:17:32 +00:00
Andrej Karpathy
2892858ce7 attempt a non-biased model, per few papers that cite this as working well 2023-01-27 18:54:08 +00:00
Andrej Karpathy
f29a9ff5bf ok i tried bringing back original init again and this time it makes a ton of difference and works much better than default. i'm not sure what was different with my earlier experiment where i saw a slight regression. may try to dissect commits later, for now merged the original mingpt init (following gpt-2 paper) as default. 2023-01-27 17:56:18 +00:00
Andrej Karpathy
23a0bfac20 try bring back mingpt init 2023-01-27 16:52:18 +00:00
Andrej Karpathy
3cb3fc059c grad clipping seems to slightly speed up training in the beginning but i can't see a big difference later in the training. it costs non-negligeable compute to clip. adding it for now because it is standard, and i think more necessary as the model becomes larger. practitioners may consider turning it off for minor efficiency gains 2023-01-27 16:45:09 +00:00
Andrej Karpathy
e0c689cf38 allow the prompt to compe from a file 2023-01-25 01:12:43 +00:00
Andrej Karpathy
21675d7755 allow sample.py to init from a pretrained gpt2 checkpoints as well, in similar style to train.py 2023-01-25 00:55:29 +00:00
johnwildauer
e0e94a1094 use GradScaler in model only if dtype is float16 2023-01-24 15:53:31 -07:00
Andrej
6c40a08b41
Merge pull request #82 from danielgross/master
Missed two spots while relative pathing
2023-01-22 13:47:32 -08:00
DG
2f7fd0ac57 add relative import in shakespeare 2023-01-22 12:18:24 -08:00
DG
bf779456f3 add relative import in shakespeare_char 2023-01-22 11:11:25 -08:00
Andrej
3611338959
Merge pull request #71 from cchan/patch-1
Zero-grad more aggressively to save memory
2023-01-20 14:38:10 -08:00
Andrej Karpathy
1f77d03024 make mentions of mps in docs. ty good people in issue #28 2023-01-20 21:28:20 +00:00
Andrej
a6bffeee59
Merge pull request #73 from danielgross/master
Use relative paths
2023-01-20 12:21:33 -08:00