Andrej Karpathy
|
eae986c2d2
|
new notebook with a bunch of calculations related to flops and memory of Transformer
|
2023-02-04 22:02:53 +00:00 |
|
Andrej Karpathy
|
a74e8363a2
|
clean up TODOs a bit, they are stale
|
2023-02-04 21:11:25 +00:00 |
|
Andrej Karpathy
|
25d95dbd65
|
mildly dramatic refactor for handing all these usage cases across all possible supported and unsupported devices for all the possible switches and flags
|
2023-02-04 21:06:17 +00:00 |
|
Andrej Karpathy
|
e108ffb973
|
very slight refactor, bit cleaner
|
2023-02-04 19:34:24 +00:00 |
|
Andrej
|
dc149891b6
|
Merge pull request #120 from nynyg/remove_cpu_pin_mem
Pin memory only when training on GPU
|
2023-02-04 11:28:08 -08:00 |
|
Nan Yang
|
b8286f343e
|
Pin memory only when training on GPU
|
2023-02-04 11:16:26 -08:00 |
|
Andrej Karpathy
|
77e7e04c26
|
padding 50257 -> 50304 vocab_size, the nerest multiple of 64. the biggest deal smallest optimization i've made in recent past, about 25% faster. this is because the last layer is a major latency bottleneck consuming about 40% of latency due to the very high channel count.
|
2023-02-04 16:06:18 +00:00 |
|
Andrej Karpathy
|
b3c17c6c6a
|
slight tweak compressing LOC
|
2023-02-04 15:57:29 +00:00 |
|
Andrej
|
53d56b82f1
|
Merge pull request #116 from ramtingh/master
Minor change to allow using ddp with exclusive process mode
|
2023-02-04 07:42:32 -08:00 |
|
Ramtin Gharleghi
|
9da1627c7f
|
Explicitly set ddp device
|
2023-02-04 15:07:36 +11:00 |
|
Andrej Karpathy
|
3fd4c0c5ef
|
who needs a dataloader? overlap the prefetching of the next batch with GPU compute, ehiding the data loading latency entirely. this saves about 1ms lol
|
2023-02-04 02:52:48 +00:00 |
|
Andrej
|
46428d3142
|
Merge pull request #115 from akashmjn/akashmjn/fix-notebook-stats
add template .gitattributes that fixes language stats
|
2023-02-03 17:23:44 -08:00 |
|
Akash Mahajan
|
d9a73374ed
|
keep only what's needed
|
2023-02-03 15:13:13 -08:00 |
|
Andrej Karpathy
|
3969860ff5
|
include launch command too. anyone should be able to do this now
|
2023-02-03 22:17:05 +00:00 |
|
Andrej Karpathy
|
f9348f3f18
|
add gpt2 training config
|
2023-02-03 22:14:37 +00:00 |
|
Akash Mahajan
|
0e2c12b5ae
|
add template .gitattributes that fixes language stats
|
2023-02-03 13:36:36 -08:00 |
|
Andrej Karpathy
|
e170e40872
|
use the new fused AdamW from pytorch nightly, if available
|
2023-02-03 17:56:51 +00:00 |
|
Andrej
|
7d44bdf6b5
|
Merge pull request #106 from YassineYousfi/master
use the ``enabled`` arg in GradScaler
|
2023-02-02 17:23:22 -08:00 |
|
Andrej Karpathy
|
1e87509e47
|
if dropout > 0.0 disable Flash until pytorch fix. don't assert fail sigh
|
2023-02-02 23:22:56 +00:00 |
|
Andrej Karpathy
|
d8b1a94519
|
change grad accum to default off because i think it just confuses everyone
|
2023-02-02 18:38:49 +00:00 |
|
Andrej Karpathy
|
d01863ef01
|
small usability tweaks to bench
|
2023-02-02 17:23:46 +00:00 |
|
Yassine Yousfi
|
40f4d6ff70
|
use the enabled arg in GradScaler
|
2023-01-31 21:12:49 -08:00 |
|
Andrej Karpathy
|
d995c22128
|
fix bug with loading GPT-2 parameters, assert gets incorrectly tripped due to .bias missing since it is now optionally present depending on flash or not
|
2023-02-01 02:05:34 +00:00 |
|
Andrej Karpathy
|
038ce89438
|
rename iter to it, because iter is a concrete Python builtin
|
2023-01-31 23:34:02 +00:00 |
|
Andrej Karpathy
|
d2705bd92a
|
tune cited numbers and reproductions and more explicitly point out the problems w.r.t. the OWT vs WT domain gap
|
2023-01-31 21:57:07 +00:00 |
|
Andrej Karpathy
|
4386bce1f4
|
adjust teaser figure with a more tuned result
|
2023-01-31 21:43:30 +00:00 |
|
Andrej Karpathy
|
924a0873eb
|
merge, make cleaner, careful with gradient clipping when using grad scaler fp16 training
|
2023-01-30 23:40:35 +00:00 |
|
Andrej Karpathy
|
ae06d0b15a
|
add flash attention support, resolving last few issues but for now seems to work ok
|
2023-01-30 23:18:26 +00:00 |
|
Andrej Karpathy
|
0e90ee9d48
|
based on my experiments these biases are indeed not needed. code runs faster, identical results. keeping the option just because it deviates from the gpt-2 setup
|
2023-01-30 08:07:58 +00:00 |
|
Andrej Karpathy
|
001c1e7be7
|
stay true to the README file and set grad accum to 5, so the default batch size is about 0.5M and is reproducing gpt2
|
2023-01-27 20:51:50 +00:00 |
|
Andrej Karpathy
|
79dbe0086d
|
let me set bias=True until I validate it properly, but this should be ok to merge to master for now, is equivalent to previous functionality
|
2023-01-27 20:45:28 +00:00 |
|
Andrej Karpathy
|
e808a67149
|
bunch of plumbing of bias all around. measuring bias=False to be about 6% faster
|
2023-01-27 20:41:17 +00:00 |
|
Andrej Karpathy
|
cc5444e194
|
add the bias option to config, default it to True for now
|
2023-01-27 20:29:45 +00:00 |
|
Andrej Karpathy
|
2bf07a3fbf
|
rewrite model class so layernorm has an optional bias= parameter
|
2023-01-27 20:17:32 +00:00 |
|
Andrej Karpathy
|
2892858ce7
|
attempt a non-biased model, per few papers that cite this as working well
|
2023-01-27 18:54:08 +00:00 |
|
Andrej Karpathy
|
f29a9ff5bf
|
ok i tried bringing back original init again and this time it makes a ton of difference and works much better than default. i'm not sure what was different with my earlier experiment where i saw a slight regression. may try to dissect commits later, for now merged the original mingpt init (following gpt-2 paper) as default.
|
2023-01-27 17:56:18 +00:00 |
|
Andrej Karpathy
|
23a0bfac20
|
try bring back mingpt init
|
2023-01-27 16:52:18 +00:00 |
|
Andrej Karpathy
|
3cb3fc059c
|
grad clipping seems to slightly speed up training in the beginning but i can't see a big difference later in the training. it costs non-negligeable compute to clip. adding it for now because it is standard, and i think more necessary as the model becomes larger. practitioners may consider turning it off for minor efficiency gains
|
2023-01-27 16:45:09 +00:00 |
|
Andrej Karpathy
|
e0c689cf38
|
allow the prompt to compe from a file
|
2023-01-25 01:12:43 +00:00 |
|
Andrej Karpathy
|
21675d7755
|
allow sample.py to init from a pretrained gpt2 checkpoints as well, in similar style to train.py
|
2023-01-25 00:55:29 +00:00 |
|
johnwildauer
|
e0e94a1094
|
use GradScaler in model only if dtype is float16
|
2023-01-24 15:53:31 -07:00 |
|
Andrej
|
6c40a08b41
|
Merge pull request #82 from danielgross/master
Missed two spots while relative pathing
|
2023-01-22 13:47:32 -08:00 |
|
DG
|
2f7fd0ac57
|
add relative import in shakespeare
|
2023-01-22 12:18:24 -08:00 |
|
DG
|
bf779456f3
|
add relative import in shakespeare_char
|
2023-01-22 11:11:25 -08:00 |
|
venusatuluri
|
f9d8020f48
|
Fix decode fn in shakespeare_char/prepare.py
|
2023-01-21 06:14:16 +00:00 |
|
Andrej
|
3611338959
|
Merge pull request #71 from cchan/patch-1
Zero-grad more aggressively to save memory
|
2023-01-20 14:38:10 -08:00 |
|
Andrej Karpathy
|
1f77d03024
|
make mentions of mps in docs. ty good people in issue #28
|
2023-01-20 21:28:20 +00:00 |
|
Andrej
|
a6bffeee59
|
Merge pull request #73 from danielgross/master
Use relative paths
|
2023-01-20 12:21:33 -08:00 |
|
DG
|
edb7a7eab0
|
use relative paths so that running the data prep scripts always create files in local folder, no matter where run from
|
2023-01-20 10:39:45 -08:00 |
|
Clive Chan
|
67166079c9
|
Zero-grad more aggressively to save memory
|
2023-01-19 22:10:44 -08:00 |
|