1
0
mirror of https://github.com/osmarks/nanogpt-experiments.git synced 2025-01-18 13:12:53 +00:00
Commit Graph

155 Commits

Author SHA1 Message Date
Andrej
ea24604b29
Merge pull request #220 from python273/patch-1
Fix GPT.crop_block_size when flash attention is available
2023-04-12 22:13:01 -07:00
Andrej
8aeea6d970
Merge pull request #224 from SnehalRaj/patch-1
fix small typo
2023-04-12 22:12:26 -07:00
Andrej
2457471c9c
Merge pull request #236 from ymurenko/master
fix "cuda out of memory" when resuming training
2023-04-12 22:09:42 -07:00
Andrej Karpathy
553f949f46 fix minor bug where we have to scale the loss to account for gradient accumulation, which sums before backprop. note that this is not a major bug because AdamW is scale invariant. however, this did affect gradient clipping 2023-04-13 04:59:11 +00:00
ymurenko
4ac2e8ce3a fix "cuda out of memory" when resuming training 2023-04-05 17:28:55 -04:00
Snehal Raj
c58fc4605c
fix small typo 2023-03-25 20:36:46 +01:00
Kirill
c3f254844d
Fix GPT.crop_block_size when flash attention is available 2023-03-24 14:51:02 +03:00
Andrej
a82b33b525
Merge pull request #199 from ChristianOrr/patch-1
bugfix in decode function
2023-03-12 13:40:20 -07:00
Christian Orr
36c7db8c44
bugfix in decode function
Return was left out of the decoder, so it didn't work.
2023-03-08 10:16:19 +02:00
Andrej
0d8fbd11ae
Merge pull request #195 from drisspg/enable_sdpa_with_nonzero_dropout
Enable sdpa for nonzero dropout
2023-03-06 21:47:20 -08:00
Driss Guessous
6170531b8a enable sdpa for nonzero dropout 2023-03-05 19:29:29 +00:00
Andrej
ae3a8d5fdd
Merge pull request #145 from otaviogood/gradAccumStability
fix for training stability on single GPU
2023-02-14 18:48:54 -08:00
Otavio Good
086ebe1822 fix for training stability on single GPU 2023-02-13 10:42:44 -08:00
Andrej Karpathy
55c5069696 fix misinformation in readme 2023-02-10 16:34:46 +00:00
Andrej Karpathy
e58f0cfa94 oops i should not be needing or multiplying by world_size to calculate mfu 2023-02-07 21:38:39 +00:00
Andrej Karpathy
8b1e43209e small tweaks, make default WD be 0.1 as is often cited, and remove spurious init of LayerNorm, which is already initialized at 1,0 2023-02-06 23:07:25 +00:00
Andrej Karpathy
ab21d6c15d bugfix we have to call the raw_model's estimate_mfu ty @jprobichaud for original PR 2023-02-06 19:55:35 +00:00
Andrej Karpathy
f83dd034e1 also add a sampling/inference section 2023-02-05 21:02:30 +00:00
Andrej Karpathy
23a8e701d2 revamp the readme file to be a bit better and more accessible, i hope 2023-02-05 19:31:32 +00:00
Andrej Karpathy
fce706cbe6 tune the hyperparams a bit, in configs 2023-02-05 19:31:18 +00:00
Andrej Karpathy
ab0718a7dd add the estimation of model flops utilization (MFU), a very commonly looked at metric that estimates the token throughput in units of A100 bfloat16 peak flops (312 TFLOPS). this gives us a sense of the hardware utilization we're achieving 2023-02-05 00:48:58 +00:00
Andrej Karpathy
580902617c oops optimizer now demands to know device_type 2023-02-05 00:43:15 +00:00
Andrej Karpathy
34720df284 make more accurate the way in which we count parameters. previous count incorrectly included the positional encoding params, when typically only the number of weight parameters is reported for these models 2023-02-04 23:51:18 +00:00
Andrej Karpathy
3341b4cecc oops forgot to subtract embedding params, which don't enter the 6ND equation 2023-02-04 22:33:35 +00:00
Andrej Karpathy
5a162bc773 fix silly error, i don't want to confuse a future GPT training on this notebook in the future 2023-02-04 22:11:16 +00:00
Andrej Karpathy
0bb96d3fff add reference for 6ND to notebook too 2023-02-04 22:07:32 +00:00
Andrej Karpathy
eae986c2d2 new notebook with a bunch of calculations related to flops and memory of Transformer 2023-02-04 22:02:53 +00:00
Andrej Karpathy
a74e8363a2 clean up TODOs a bit, they are stale 2023-02-04 21:11:25 +00:00
Andrej Karpathy
25d95dbd65 mildly dramatic refactor for handing all these usage cases across all possible supported and unsupported devices for all the possible switches and flags 2023-02-04 21:06:17 +00:00
Andrej Karpathy
e108ffb973 very slight refactor, bit cleaner 2023-02-04 19:34:24 +00:00
Andrej
dc149891b6
Merge pull request #120 from nynyg/remove_cpu_pin_mem
Pin memory only when training on GPU
2023-02-04 11:28:08 -08:00
Nan Yang
b8286f343e Pin memory only when training on GPU 2023-02-04 11:16:26 -08:00
Andrej Karpathy
77e7e04c26 padding 50257 -> 50304 vocab_size, the nerest multiple of 64. the biggest deal smallest optimization i've made in recent past, about 25% faster. this is because the last layer is a major latency bottleneck consuming about 40% of latency due to the very high channel count. 2023-02-04 16:06:18 +00:00
Andrej Karpathy
b3c17c6c6a slight tweak compressing LOC 2023-02-04 15:57:29 +00:00
Andrej
53d56b82f1
Merge pull request #116 from ramtingh/master
Minor change to allow using ddp with exclusive process mode
2023-02-04 07:42:32 -08:00
Ramtin Gharleghi
9da1627c7f
Explicitly set ddp device 2023-02-04 15:07:36 +11:00
Andrej Karpathy
3fd4c0c5ef who needs a dataloader? overlap the prefetching of the next batch with GPU compute, ehiding the data loading latency entirely. this saves about 1ms lol 2023-02-04 02:52:48 +00:00
Andrej
46428d3142
Merge pull request #115 from akashmjn/akashmjn/fix-notebook-stats
add template .gitattributes that fixes language stats
2023-02-03 17:23:44 -08:00
Akash Mahajan
d9a73374ed
keep only what's needed 2023-02-03 15:13:13 -08:00
Andrej Karpathy
3969860ff5 include launch command too. anyone should be able to do this now 2023-02-03 22:17:05 +00:00
Andrej Karpathy
f9348f3f18 add gpt2 training config 2023-02-03 22:14:37 +00:00
Akash Mahajan
0e2c12b5ae
add template .gitattributes that fixes language stats 2023-02-03 13:36:36 -08:00
Andrej Karpathy
e170e40872 use the new fused AdamW from pytorch nightly, if available 2023-02-03 17:56:51 +00:00
Andrej
7d44bdf6b5
Merge pull request #106 from YassineYousfi/master
use the ``enabled`` arg in GradScaler
2023-02-02 17:23:22 -08:00
Andrej Karpathy
1e87509e47 if dropout > 0.0 disable Flash until pytorch fix. don't assert fail sigh 2023-02-02 23:22:56 +00:00
Andrej Karpathy
d8b1a94519 change grad accum to default off because i think it just confuses everyone 2023-02-02 18:38:49 +00:00
Andrej Karpathy
d01863ef01 small usability tweaks to bench 2023-02-02 17:23:46 +00:00
Yassine Yousfi
40f4d6ff70 use the enabled arg in GradScaler 2023-01-31 21:12:49 -08:00
Andrej Karpathy
d995c22128 fix bug with loading GPT-2 parameters, assert gets incorrectly tripped due to .bias missing since it is now optionally present depending on flash or not 2023-02-01 02:05:34 +00:00
Andrej Karpathy
038ce89438 rename iter to it, because iter is a concrete Python builtin 2023-01-31 23:34:02 +00:00