Andrej Karpathy
|
ab0718a7dd
|
add the estimation of model flops utilization (MFU), a very commonly looked at metric that estimates the token throughput in units of A100 bfloat16 peak flops (312 TFLOPS). this gives us a sense of the hardware utilization we're achieving
|
2023-02-05 00:48:58 +00:00 |
|
Andrej Karpathy
|
580902617c
|
oops optimizer now demands to know device_type
|
2023-02-05 00:43:15 +00:00 |
|
Andrej Karpathy
|
34720df284
|
make more accurate the way in which we count parameters. previous count incorrectly included the positional encoding params, when typically only the number of weight parameters is reported for these models
|
2023-02-04 23:51:18 +00:00 |
|
Andrej Karpathy
|
3341b4cecc
|
oops forgot to subtract embedding params, which don't enter the 6ND equation
|
2023-02-04 22:33:35 +00:00 |
|
Andrej Karpathy
|
5a162bc773
|
fix silly error, i don't want to confuse a future GPT training on this notebook in the future
|
2023-02-04 22:11:16 +00:00 |
|
Andrej Karpathy
|
0bb96d3fff
|
add reference for 6ND to notebook too
|
2023-02-04 22:07:32 +00:00 |
|
Andrej Karpathy
|
eae986c2d2
|
new notebook with a bunch of calculations related to flops and memory of Transformer
|
2023-02-04 22:02:53 +00:00 |
|
Andrej Karpathy
|
a74e8363a2
|
clean up TODOs a bit, they are stale
|
2023-02-04 21:11:25 +00:00 |
|
Andrej Karpathy
|
25d95dbd65
|
mildly dramatic refactor for handing all these usage cases across all possible supported and unsupported devices for all the possible switches and flags
|
2023-02-04 21:06:17 +00:00 |
|
Andrej Karpathy
|
e108ffb973
|
very slight refactor, bit cleaner
|
2023-02-04 19:34:24 +00:00 |
|
Andrej
|
dc149891b6
|
Merge pull request #120 from nynyg/remove_cpu_pin_mem
Pin memory only when training on GPU
|
2023-02-04 11:28:08 -08:00 |
|
Nan Yang
|
b8286f343e
|
Pin memory only when training on GPU
|
2023-02-04 11:16:26 -08:00 |
|
Andrej Karpathy
|
77e7e04c26
|
padding 50257 -> 50304 vocab_size, the nerest multiple of 64. the biggest deal smallest optimization i've made in recent past, about 25% faster. this is because the last layer is a major latency bottleneck consuming about 40% of latency due to the very high channel count.
|
2023-02-04 16:06:18 +00:00 |
|
Andrej Karpathy
|
b3c17c6c6a
|
slight tweak compressing LOC
|
2023-02-04 15:57:29 +00:00 |
|
Andrej
|
53d56b82f1
|
Merge pull request #116 from ramtingh/master
Minor change to allow using ddp with exclusive process mode
|
2023-02-04 07:42:32 -08:00 |
|
Ramtin Gharleghi
|
9da1627c7f
|
Explicitly set ddp device
|
2023-02-04 15:07:36 +11:00 |
|
Andrej Karpathy
|
3fd4c0c5ef
|
who needs a dataloader? overlap the prefetching of the next batch with GPU compute, ehiding the data loading latency entirely. this saves about 1ms lol
|
2023-02-04 02:52:48 +00:00 |
|
Andrej
|
46428d3142
|
Merge pull request #115 from akashmjn/akashmjn/fix-notebook-stats
add template .gitattributes that fixes language stats
|
2023-02-03 17:23:44 -08:00 |
|
Akash Mahajan
|
d9a73374ed
|
keep only what's needed
|
2023-02-03 15:13:13 -08:00 |
|
Andrej Karpathy
|
3969860ff5
|
include launch command too. anyone should be able to do this now
|
2023-02-03 22:17:05 +00:00 |
|
Andrej Karpathy
|
f9348f3f18
|
add gpt2 training config
|
2023-02-03 22:14:37 +00:00 |
|
Akash Mahajan
|
0e2c12b5ae
|
add template .gitattributes that fixes language stats
|
2023-02-03 13:36:36 -08:00 |
|
Andrej Karpathy
|
e170e40872
|
use the new fused AdamW from pytorch nightly, if available
|
2023-02-03 17:56:51 +00:00 |
|
Andrej
|
7d44bdf6b5
|
Merge pull request #106 from YassineYousfi/master
use the ``enabled`` arg in GradScaler
|
2023-02-02 17:23:22 -08:00 |
|
Andrej Karpathy
|
1e87509e47
|
if dropout > 0.0 disable Flash until pytorch fix. don't assert fail sigh
|
2023-02-02 23:22:56 +00:00 |
|
Andrej Karpathy
|
d8b1a94519
|
change grad accum to default off because i think it just confuses everyone
|
2023-02-02 18:38:49 +00:00 |
|
Andrej Karpathy
|
d01863ef01
|
small usability tweaks to bench
|
2023-02-02 17:23:46 +00:00 |
|
Yassine Yousfi
|
40f4d6ff70
|
use the enabled arg in GradScaler
|
2023-01-31 21:12:49 -08:00 |
|
Andrej Karpathy
|
d995c22128
|
fix bug with loading GPT-2 parameters, assert gets incorrectly tripped due to .bias missing since it is now optionally present depending on flash or not
|
2023-02-01 02:05:34 +00:00 |
|
Andrej Karpathy
|
038ce89438
|
rename iter to it, because iter is a concrete Python builtin
|
2023-01-31 23:34:02 +00:00 |
|
Andrej Karpathy
|
d2705bd92a
|
tune cited numbers and reproductions and more explicitly point out the problems w.r.t. the OWT vs WT domain gap
|
2023-01-31 21:57:07 +00:00 |
|
Andrej Karpathy
|
4386bce1f4
|
adjust teaser figure with a more tuned result
|
2023-01-31 21:43:30 +00:00 |
|
Andrej Karpathy
|
924a0873eb
|
merge, make cleaner, careful with gradient clipping when using grad scaler fp16 training
|
2023-01-30 23:40:35 +00:00 |
|
Andrej Karpathy
|
ae06d0b15a
|
add flash attention support, resolving last few issues but for now seems to work ok
|
2023-01-30 23:18:26 +00:00 |
|
Andrej Karpathy
|
0e90ee9d48
|
based on my experiments these biases are indeed not needed. code runs faster, identical results. keeping the option just because it deviates from the gpt-2 setup
|
2023-01-30 08:07:58 +00:00 |
|
Andrej Karpathy
|
001c1e7be7
|
stay true to the README file and set grad accum to 5, so the default batch size is about 0.5M and is reproducing gpt2
|
2023-01-27 20:51:50 +00:00 |
|
Andrej Karpathy
|
79dbe0086d
|
let me set bias=True until I validate it properly, but this should be ok to merge to master for now, is equivalent to previous functionality
|
2023-01-27 20:45:28 +00:00 |
|
Andrej Karpathy
|
e808a67149
|
bunch of plumbing of bias all around. measuring bias=False to be about 6% faster
|
2023-01-27 20:41:17 +00:00 |
|
Andrej Karpathy
|
cc5444e194
|
add the bias option to config, default it to True for now
|
2023-01-27 20:29:45 +00:00 |
|
Andrej Karpathy
|
2bf07a3fbf
|
rewrite model class so layernorm has an optional bias= parameter
|
2023-01-27 20:17:32 +00:00 |
|
Andrej Karpathy
|
2892858ce7
|
attempt a non-biased model, per few papers that cite this as working well
|
2023-01-27 18:54:08 +00:00 |
|
Andrej Karpathy
|
f29a9ff5bf
|
ok i tried bringing back original init again and this time it makes a ton of difference and works much better than default. i'm not sure what was different with my earlier experiment where i saw a slight regression. may try to dissect commits later, for now merged the original mingpt init (following gpt-2 paper) as default.
|
2023-01-27 17:56:18 +00:00 |
|
Andrej Karpathy
|
23a0bfac20
|
try bring back mingpt init
|
2023-01-27 16:52:18 +00:00 |
|
Andrej Karpathy
|
3cb3fc059c
|
grad clipping seems to slightly speed up training in the beginning but i can't see a big difference later in the training. it costs non-negligeable compute to clip. adding it for now because it is standard, and i think more necessary as the model becomes larger. practitioners may consider turning it off for minor efficiency gains
|
2023-01-27 16:45:09 +00:00 |
|
Andrej Karpathy
|
e0c689cf38
|
allow the prompt to compe from a file
|
2023-01-25 01:12:43 +00:00 |
|
Andrej Karpathy
|
21675d7755
|
allow sample.py to init from a pretrained gpt2 checkpoints as well, in similar style to train.py
|
2023-01-25 00:55:29 +00:00 |
|
johnwildauer
|
e0e94a1094
|
use GradScaler in model only if dtype is float16
|
2023-01-24 15:53:31 -07:00 |
|
Andrej
|
6c40a08b41
|
Merge pull request #82 from danielgross/master
Missed two spots while relative pathing
|
2023-01-22 13:47:32 -08:00 |
|
DG
|
2f7fd0ac57
|
add relative import in shakespeare
|
2023-01-22 12:18:24 -08:00 |
|
DG
|
bf779456f3
|
add relative import in shakespeare_char
|
2023-01-22 11:11:25 -08:00 |
|