1
0
mirror of https://github.com/osmarks/nanogpt-experiments.git synced 2024-09-21 11:49:46 +00:00
Commit Graph

50 Commits

Author SHA1 Message Date
Andrej Karpathy
8b1e43209e small tweaks, make default WD be 0.1 as is often cited, and remove spurious init of LayerNorm, which is already initialized at 1,0 2023-02-06 23:07:25 +00:00
Andrej Karpathy
ab21d6c15d bugfix we have to call the raw_model's estimate_mfu ty @jprobichaud for original PR 2023-02-06 19:55:35 +00:00
Andrej Karpathy
ab0718a7dd add the estimation of model flops utilization (MFU), a very commonly looked at metric that estimates the token throughput in units of A100 bfloat16 peak flops (312 TFLOPS). this gives us a sense of the hardware utilization we're achieving 2023-02-05 00:48:58 +00:00
Andrej Karpathy
a74e8363a2 clean up TODOs a bit, they are stale 2023-02-04 21:11:25 +00:00
Andrej Karpathy
25d95dbd65 mildly dramatic refactor for handing all these usage cases across all possible supported and unsupported devices for all the possible switches and flags 2023-02-04 21:06:17 +00:00
Andrej Karpathy
e108ffb973 very slight refactor, bit cleaner 2023-02-04 19:34:24 +00:00
Nan Yang
b8286f343e Pin memory only when training on GPU 2023-02-04 11:16:26 -08:00
Andrej Karpathy
77e7e04c26 padding 50257 -> 50304 vocab_size, the nerest multiple of 64. the biggest deal smallest optimization i've made in recent past, about 25% faster. this is because the last layer is a major latency bottleneck consuming about 40% of latency due to the very high channel count. 2023-02-04 16:06:18 +00:00
Andrej Karpathy
b3c17c6c6a slight tweak compressing LOC 2023-02-04 15:57:29 +00:00
Ramtin Gharleghi
9da1627c7f
Explicitly set ddp device 2023-02-04 15:07:36 +11:00
Andrej Karpathy
3fd4c0c5ef who needs a dataloader? overlap the prefetching of the next batch with GPU compute, ehiding the data loading latency entirely. this saves about 1ms lol 2023-02-04 02:52:48 +00:00
Andrej
7d44bdf6b5
Merge pull request #106 from YassineYousfi/master
use the ``enabled`` arg in GradScaler
2023-02-02 17:23:22 -08:00
Andrej Karpathy
d8b1a94519 change grad accum to default off because i think it just confuses everyone 2023-02-02 18:38:49 +00:00
Yassine Yousfi
40f4d6ff70 use the enabled arg in GradScaler 2023-01-31 21:12:49 -08:00
Andrej Karpathy
038ce89438 rename iter to it, because iter is a concrete Python builtin 2023-01-31 23:34:02 +00:00
Andrej Karpathy
924a0873eb merge, make cleaner, careful with gradient clipping when using grad scaler fp16 training 2023-01-30 23:40:35 +00:00
Andrej Karpathy
0e90ee9d48 based on my experiments these biases are indeed not needed. code runs faster, identical results. keeping the option just because it deviates from the gpt-2 setup 2023-01-30 08:07:58 +00:00
Andrej Karpathy
001c1e7be7 stay true to the README file and set grad accum to 5, so the default batch size is about 0.5M and is reproducing gpt2 2023-01-27 20:51:50 +00:00
Andrej Karpathy
79dbe0086d let me set bias=True until I validate it properly, but this should be ok to merge to master for now, is equivalent to previous functionality 2023-01-27 20:45:28 +00:00
Andrej Karpathy
e808a67149 bunch of plumbing of bias all around. measuring bias=False to be about 6% faster 2023-01-27 20:41:17 +00:00
Andrej Karpathy
3cb3fc059c grad clipping seems to slightly speed up training in the beginning but i can't see a big difference later in the training. it costs non-negligeable compute to clip. adding it for now because it is standard, and i think more necessary as the model becomes larger. practitioners may consider turning it off for minor efficiency gains 2023-01-27 16:45:09 +00:00
johnwildauer
e0e94a1094 use GradScaler in model only if dtype is float16 2023-01-24 15:53:31 -07:00
Andrej
3611338959
Merge pull request #71 from cchan/patch-1
Zero-grad more aggressively to save memory
2023-01-20 14:38:10 -08:00
Andrej Karpathy
1f77d03024 make mentions of mps in docs. ty good people in issue #28 2023-01-20 21:28:20 +00:00
Clive Chan
67166079c9
Zero-grad more aggressively to save memory 2023-01-19 22:10:44 -08:00
Andrej Karpathy
46ce9971df small tweaks to docs and variable names stylistically 2023-01-16 16:56:05 +00:00
Andrej Karpathy
684800dd87 clarify that these should be run on two separate machines 2023-01-16 06:02:46 +00:00
Andrej Karpathy
9352df23de docs for multinode ddp 2023-01-16 05:57:33 +00:00
Andrej Karpathy
c3dddbff3d get rid of gpu_id, the world is more complicated than that when world_size > 8 2023-01-16 05:44:50 +00:00
Andrej Karpathy
f5e6ac8b02 local rank -> rank 2023-01-16 05:13:13 +00:00
Andrej Karpathy
cf99914886 add gradient accumulation support to simulate larger batch sizes. ty @VHellendoorn for original PR 2023-01-15 17:49:55 +00:00
Andrej Karpathy
57735f532d correctly propagate the vocab_size from the rendered dataset into the model args 2023-01-14 02:26:44 +00:00
Andrej Karpathy
8f85b83347 inference time mini-optimization low-hanging fruit ty @jxtps for raising: when we are running inference we can apply lm_head on only the very last token 2023-01-12 06:02:50 +00:00
Andrej Karpathy
d17350a31d add support for character-level language models, a new character-level shakespeare dataset, a new config file that shows how to train a character-level baby GPT on it, and adjust the sample function to figure out if it should decode with characters or GPT2 bpe tokens. The current implementation is a bit hacky and basically assumes just these two possibilities. In the future we may want to support more general encoders or decoders. 2023-01-11 05:27:19 +00:00
Andrej Karpathy
c2a402f7f7 guess the config from globals() and log all of it with wandb 2023-01-11 01:00:22 +00:00
Andrej Karpathy
a855d316fd add device and dtype support to train.py args 2023-01-08 19:20:38 +00:00
Luca Antiga
09f1f458e8 Move conditional import 2023-01-08 15:51:50 +01:00
Luca Antiga
aba47f0a35 Make wandb import conditioned to wandb_log=True 2023-01-08 15:42:08 +01:00
Andrej Karpathy
9629093e53 minor args re-arranging and removing some spurious ones like wandb entity ty @tcapelle 2023-01-05 01:14:02 +00:00
Andrej Karpathy
d562b3e550 shuttling the poor mans configurator aside into its own file and adding it to all of train,sample,bench. because i am leaving args in globals() so i can avoid having to prepend every single variable with an args., i have to exec the configurator and the optional configs. so we're left with something very gross by standard convention but also quite simple and functional. *ducks* 2023-01-05 00:44:35 +00:00
Andrej Karpathy
9f95aca93e better hyperparams for gpt2 124M model on A100 40GB. still uncertain about max_iters especially, and a bit about weight decay, betas 2023-01-03 17:45:49 +00:00
Andrej Karpathy
ec9b1f8182 add a patch to fix mysterious unwanted prefix in state dict? maybe remove later 2023-01-02 01:25:02 +00:00
Andrej Karpathy
35f51974c4 rename to compile it's shorter 2023-01-02 01:14:46 +00:00
Andrej Karpathy
2febf4463c candidate changes to apis, have to think through more 2023-01-01 01:29:48 +00:00
Andrej Karpathy
5a725d9098 add torch.compile by default, shows almost 1.8X improvement in throughput nice 2022-12-30 00:07:13 +00:00
Andrej Karpathy
682a0ac8f1 properly resume training, also loading iter_num and best_val_loss from checkpoints 2022-12-29 18:23:15 +00:00
Andrej Karpathy
dea1507252 add support for DDP training. the scaling timings right now do not look good by default, have to dig more into 2022-12-29 05:06:07 +00:00
Andrej Karpathy
5d2b4807bf adding a lightweight configurator that may be a terrible mistake lol. also adding configs to evaluate the baseline GPT2 versions released by OpenAI on OWT. we have some ways to go to match those numbers atm 2022-12-28 23:31:23 +00:00
Andrej Karpathy
c9fe00c0e9 small readme clarification and training script defaults changes 2022-12-28 01:45:55 +00:00
Andrej Karpathy
fe8042867c first very bad commit 2022-12-28 00:58:19 +00:00