Andrej Karpathy
|
a74e8363a2
|
clean up TODOs a bit, they are stale
|
2023-02-04 21:11:25 +00:00 |
|
Andrej Karpathy
|
25d95dbd65
|
mildly dramatic refactor for handing all these usage cases across all possible supported and unsupported devices for all the possible switches and flags
|
2023-02-04 21:06:17 +00:00 |
|
Andrej Karpathy
|
e108ffb973
|
very slight refactor, bit cleaner
|
2023-02-04 19:34:24 +00:00 |
|
Nan Yang
|
b8286f343e
|
Pin memory only when training on GPU
|
2023-02-04 11:16:26 -08:00 |
|
Andrej Karpathy
|
77e7e04c26
|
padding 50257 -> 50304 vocab_size, the nerest multiple of 64. the biggest deal smallest optimization i've made in recent past, about 25% faster. this is because the last layer is a major latency bottleneck consuming about 40% of latency due to the very high channel count.
|
2023-02-04 16:06:18 +00:00 |
|
Andrej Karpathy
|
b3c17c6c6a
|
slight tweak compressing LOC
|
2023-02-04 15:57:29 +00:00 |
|
Ramtin Gharleghi
|
9da1627c7f
|
Explicitly set ddp device
|
2023-02-04 15:07:36 +11:00 |
|
Andrej Karpathy
|
3fd4c0c5ef
|
who needs a dataloader? overlap the prefetching of the next batch with GPU compute, ehiding the data loading latency entirely. this saves about 1ms lol
|
2023-02-04 02:52:48 +00:00 |
|
Andrej
|
7d44bdf6b5
|
Merge pull request #106 from YassineYousfi/master
use the ``enabled`` arg in GradScaler
|
2023-02-02 17:23:22 -08:00 |
|
Andrej Karpathy
|
d8b1a94519
|
change grad accum to default off because i think it just confuses everyone
|
2023-02-02 18:38:49 +00:00 |
|
Yassine Yousfi
|
40f4d6ff70
|
use the enabled arg in GradScaler
|
2023-01-31 21:12:49 -08:00 |
|
Andrej Karpathy
|
038ce89438
|
rename iter to it, because iter is a concrete Python builtin
|
2023-01-31 23:34:02 +00:00 |
|
Andrej Karpathy
|
924a0873eb
|
merge, make cleaner, careful with gradient clipping when using grad scaler fp16 training
|
2023-01-30 23:40:35 +00:00 |
|
Andrej Karpathy
|
0e90ee9d48
|
based on my experiments these biases are indeed not needed. code runs faster, identical results. keeping the option just because it deviates from the gpt-2 setup
|
2023-01-30 08:07:58 +00:00 |
|
Andrej Karpathy
|
001c1e7be7
|
stay true to the README file and set grad accum to 5, so the default batch size is about 0.5M and is reproducing gpt2
|
2023-01-27 20:51:50 +00:00 |
|
Andrej Karpathy
|
79dbe0086d
|
let me set bias=True until I validate it properly, but this should be ok to merge to master for now, is equivalent to previous functionality
|
2023-01-27 20:45:28 +00:00 |
|
Andrej Karpathy
|
e808a67149
|
bunch of plumbing of bias all around. measuring bias=False to be about 6% faster
|
2023-01-27 20:41:17 +00:00 |
|
Andrej Karpathy
|
3cb3fc059c
|
grad clipping seems to slightly speed up training in the beginning but i can't see a big difference later in the training. it costs non-negligeable compute to clip. adding it for now because it is standard, and i think more necessary as the model becomes larger. practitioners may consider turning it off for minor efficiency gains
|
2023-01-27 16:45:09 +00:00 |
|
johnwildauer
|
e0e94a1094
|
use GradScaler in model only if dtype is float16
|
2023-01-24 15:53:31 -07:00 |
|
Andrej
|
3611338959
|
Merge pull request #71 from cchan/patch-1
Zero-grad more aggressively to save memory
|
2023-01-20 14:38:10 -08:00 |
|
Andrej Karpathy
|
1f77d03024
|
make mentions of mps in docs. ty good people in issue #28
|
2023-01-20 21:28:20 +00:00 |
|
Clive Chan
|
67166079c9
|
Zero-grad more aggressively to save memory
|
2023-01-19 22:10:44 -08:00 |
|
Andrej Karpathy
|
46ce9971df
|
small tweaks to docs and variable names stylistically
|
2023-01-16 16:56:05 +00:00 |
|
Andrej Karpathy
|
684800dd87
|
clarify that these should be run on two separate machines
|
2023-01-16 06:02:46 +00:00 |
|
Andrej Karpathy
|
9352df23de
|
docs for multinode ddp
|
2023-01-16 05:57:33 +00:00 |
|
Andrej Karpathy
|
c3dddbff3d
|
get rid of gpu_id, the world is more complicated than that when world_size > 8
|
2023-01-16 05:44:50 +00:00 |
|
Andrej Karpathy
|
f5e6ac8b02
|
local rank -> rank
|
2023-01-16 05:13:13 +00:00 |
|
Andrej Karpathy
|
cf99914886
|
add gradient accumulation support to simulate larger batch sizes. ty @VHellendoorn for original PR
|
2023-01-15 17:49:55 +00:00 |
|
Andrej Karpathy
|
57735f532d
|
correctly propagate the vocab_size from the rendered dataset into the model args
|
2023-01-14 02:26:44 +00:00 |
|
Andrej Karpathy
|
8f85b83347
|
inference time mini-optimization low-hanging fruit ty @jxtps for raising: when we are running inference we can apply lm_head on only the very last token
|
2023-01-12 06:02:50 +00:00 |
|
Andrej Karpathy
|
d17350a31d
|
add support for character-level language models, a new character-level shakespeare dataset, a new config file that shows how to train a character-level baby GPT on it, and adjust the sample function to figure out if it should decode with characters or GPT2 bpe tokens. The current implementation is a bit hacky and basically assumes just these two possibilities. In the future we may want to support more general encoders or decoders.
|
2023-01-11 05:27:19 +00:00 |
|
Andrej Karpathy
|
c2a402f7f7
|
guess the config from globals() and log all of it with wandb
|
2023-01-11 01:00:22 +00:00 |
|
Andrej Karpathy
|
a855d316fd
|
add device and dtype support to train.py args
|
2023-01-08 19:20:38 +00:00 |
|
Luca Antiga
|
09f1f458e8
|
Move conditional import
|
2023-01-08 15:51:50 +01:00 |
|
Luca Antiga
|
aba47f0a35
|
Make wandb import conditioned to wandb_log=True
|
2023-01-08 15:42:08 +01:00 |
|
Andrej Karpathy
|
9629093e53
|
minor args re-arranging and removing some spurious ones like wandb entity ty @tcapelle
|
2023-01-05 01:14:02 +00:00 |
|
Andrej Karpathy
|
d562b3e550
|
shuttling the poor mans configurator aside into its own file and adding it to all of train,sample,bench. because i am leaving args in globals() so i can avoid having to prepend every single variable with an args., i have to exec the configurator and the optional configs. so we're left with something very gross by standard convention but also quite simple and functional. *ducks*
|
2023-01-05 00:44:35 +00:00 |
|
Andrej Karpathy
|
9f95aca93e
|
better hyperparams for gpt2 124M model on A100 40GB. still uncertain about max_iters especially, and a bit about weight decay, betas
|
2023-01-03 17:45:49 +00:00 |
|
Andrej Karpathy
|
ec9b1f8182
|
add a patch to fix mysterious unwanted prefix in state dict? maybe remove later
|
2023-01-02 01:25:02 +00:00 |
|
Andrej Karpathy
|
35f51974c4
|
rename to compile it's shorter
|
2023-01-02 01:14:46 +00:00 |
|
Andrej Karpathy
|
2febf4463c
|
candidate changes to apis, have to think through more
|
2023-01-01 01:29:48 +00:00 |
|
Andrej Karpathy
|
5a725d9098
|
add torch.compile by default, shows almost 1.8X improvement in throughput nice
|
2022-12-30 00:07:13 +00:00 |
|
Andrej Karpathy
|
682a0ac8f1
|
properly resume training, also loading iter_num and best_val_loss from checkpoints
|
2022-12-29 18:23:15 +00:00 |
|
Andrej Karpathy
|
dea1507252
|
add support for DDP training. the scaling timings right now do not look good by default, have to dig more into
|
2022-12-29 05:06:07 +00:00 |
|
Andrej Karpathy
|
5d2b4807bf
|
adding a lightweight configurator that may be a terrible mistake lol. also adding configs to evaluate the baseline GPT2 versions released by OpenAI on OWT. we have some ways to go to match those numbers atm
|
2022-12-28 23:31:23 +00:00 |
|
Andrej Karpathy
|
c9fe00c0e9
|
small readme clarification and training script defaults changes
|
2022-12-28 01:45:55 +00:00 |
|
Andrej Karpathy
|
fe8042867c
|
first very bad commit
|
2022-12-28 00:58:19 +00:00 |
|