|
f3118fe74d
|
Add note about fix
|
2024-06-24 20:13:10 +01:00 |
|
|
0194d45e43
|
experiments
|
2024-06-24 19:10:15 +00:00 |
|
Kevin Slagle
|
5156fef93c
|
fix np.memmap memory leak
nn.memmap doesn't free memory that it accesses. Thus, the entire dataset gets stored in RAM as the dataset has been fully accessed. The simplest workaround on stackoverflow is to just recreate the memmap for each batch. The extra overhead is negligible.
https://stackoverflow.com/questions/45132940/numpy-memmap-memory-usage-want-to-iterate-once/61472122#61472122
|
2024-01-25 11:41:01 -08:00 |
|
o
|
1eaceae193
|
Fix AssertionError on macOS - need to check CUDA availability for bf16
|
2023-06-19 18:05:09 -04:00 |
|
Andrej Karpathy
|
7339b904ef
|
use WORLD_SIZE instead of device_count, supports both the case where the number of gpus we train on is smaller than gpus available, and also multinode training may be a bugfix
|
2023-06-14 23:33:07 +00:00 |
|
Alexander Pivovarov
|
eb33b8bf1c
|
Use bf16 only if supported
|
2023-05-17 03:26:48 +00:00 |
|
Andrej
|
a6a708c7f1
|
Merge branch 'master' into grad_accum
|
2023-04-17 20:11:00 -07:00 |
|
Andrej
|
2457471c9c
|
Merge pull request #236 from ymurenko/master
fix "cuda out of memory" when resuming training
|
2023-04-12 22:09:42 -07:00 |
|
Andrej Karpathy
|
553f949f46
|
fix minor bug where we have to scale the loss to account for gradient accumulation, which sums before backprop. note that this is not a major bug because AdamW is scale invariant. however, this did affect gradient clipping
|
2023-04-13 04:59:11 +00:00 |
|
ymurenko
|
4ac2e8ce3a
|
fix "cuda out of memory" when resuming training
|
2023-04-05 17:28:55 -04:00 |
|
Otavio Good
|
978d4fe538
|
Fix for gradient_accumulation_steps training slow
|
2023-03-25 00:04:45 -07:00 |
|
Otavio Good
|
086ebe1822
|
fix for training stability on single GPU
|
2023-02-13 10:42:44 -08:00 |
|
Andrej Karpathy
|
e58f0cfa94
|
oops i should not be needing or multiplying by world_size to calculate mfu
|
2023-02-07 21:38:39 +00:00 |
|
Andrej Karpathy
|
8b1e43209e
|
small tweaks, make default WD be 0.1 as is often cited, and remove spurious init of LayerNorm, which is already initialized at 1,0
|
2023-02-06 23:07:25 +00:00 |
|
Andrej Karpathy
|
ab21d6c15d
|
bugfix we have to call the raw_model's estimate_mfu ty @jprobichaud for original PR
|
2023-02-06 19:55:35 +00:00 |
|
Andrej Karpathy
|
ab0718a7dd
|
add the estimation of model flops utilization (MFU), a very commonly looked at metric that estimates the token throughput in units of A100 bfloat16 peak flops (312 TFLOPS). this gives us a sense of the hardware utilization we're achieving
|
2023-02-05 00:48:58 +00:00 |
|
Andrej Karpathy
|
a74e8363a2
|
clean up TODOs a bit, they are stale
|
2023-02-04 21:11:25 +00:00 |
|
Andrej Karpathy
|
25d95dbd65
|
mildly dramatic refactor for handing all these usage cases across all possible supported and unsupported devices for all the possible switches and flags
|
2023-02-04 21:06:17 +00:00 |
|
Andrej Karpathy
|
e108ffb973
|
very slight refactor, bit cleaner
|
2023-02-04 19:34:24 +00:00 |
|
Nan Yang
|
b8286f343e
|
Pin memory only when training on GPU
|
2023-02-04 11:16:26 -08:00 |
|
Andrej Karpathy
|
77e7e04c26
|
padding 50257 -> 50304 vocab_size, the nerest multiple of 64. the biggest deal smallest optimization i've made in recent past, about 25% faster. this is because the last layer is a major latency bottleneck consuming about 40% of latency due to the very high channel count.
|
2023-02-04 16:06:18 +00:00 |
|
Andrej Karpathy
|
b3c17c6c6a
|
slight tweak compressing LOC
|
2023-02-04 15:57:29 +00:00 |
|
Ramtin Gharleghi
|
9da1627c7f
|
Explicitly set ddp device
|
2023-02-04 15:07:36 +11:00 |
|
Andrej Karpathy
|
3fd4c0c5ef
|
who needs a dataloader? overlap the prefetching of the next batch with GPU compute, ehiding the data loading latency entirely. this saves about 1ms lol
|
2023-02-04 02:52:48 +00:00 |
|
Andrej
|
7d44bdf6b5
|
Merge pull request #106 from YassineYousfi/master
use the ``enabled`` arg in GradScaler
|
2023-02-02 17:23:22 -08:00 |
|
Andrej Karpathy
|
d8b1a94519
|
change grad accum to default off because i think it just confuses everyone
|
2023-02-02 18:38:49 +00:00 |
|
Yassine Yousfi
|
40f4d6ff70
|
use the enabled arg in GradScaler
|
2023-01-31 21:12:49 -08:00 |
|
Andrej Karpathy
|
038ce89438
|
rename iter to it, because iter is a concrete Python builtin
|
2023-01-31 23:34:02 +00:00 |
|
Andrej Karpathy
|
924a0873eb
|
merge, make cleaner, careful with gradient clipping when using grad scaler fp16 training
|
2023-01-30 23:40:35 +00:00 |
|
Andrej Karpathy
|
0e90ee9d48
|
based on my experiments these biases are indeed not needed. code runs faster, identical results. keeping the option just because it deviates from the gpt-2 setup
|
2023-01-30 08:07:58 +00:00 |
|
Andrej Karpathy
|
001c1e7be7
|
stay true to the README file and set grad accum to 5, so the default batch size is about 0.5M and is reproducing gpt2
|
2023-01-27 20:51:50 +00:00 |
|
Andrej Karpathy
|
79dbe0086d
|
let me set bias=True until I validate it properly, but this should be ok to merge to master for now, is equivalent to previous functionality
|
2023-01-27 20:45:28 +00:00 |
|
Andrej Karpathy
|
e808a67149
|
bunch of plumbing of bias all around. measuring bias=False to be about 6% faster
|
2023-01-27 20:41:17 +00:00 |
|
Andrej Karpathy
|
3cb3fc059c
|
grad clipping seems to slightly speed up training in the beginning but i can't see a big difference later in the training. it costs non-negligeable compute to clip. adding it for now because it is standard, and i think more necessary as the model becomes larger. practitioners may consider turning it off for minor efficiency gains
|
2023-01-27 16:45:09 +00:00 |
|
johnwildauer
|
e0e94a1094
|
use GradScaler in model only if dtype is float16
|
2023-01-24 15:53:31 -07:00 |
|
Andrej
|
3611338959
|
Merge pull request #71 from cchan/patch-1
Zero-grad more aggressively to save memory
|
2023-01-20 14:38:10 -08:00 |
|
Andrej Karpathy
|
1f77d03024
|
make mentions of mps in docs. ty good people in issue #28
|
2023-01-20 21:28:20 +00:00 |
|
Clive Chan
|
67166079c9
|
Zero-grad more aggressively to save memory
|
2023-01-19 22:10:44 -08:00 |
|
Andrej Karpathy
|
46ce9971df
|
small tweaks to docs and variable names stylistically
|
2023-01-16 16:56:05 +00:00 |
|
Andrej Karpathy
|
684800dd87
|
clarify that these should be run on two separate machines
|
2023-01-16 06:02:46 +00:00 |
|
Andrej Karpathy
|
9352df23de
|
docs for multinode ddp
|
2023-01-16 05:57:33 +00:00 |
|
Andrej Karpathy
|
c3dddbff3d
|
get rid of gpu_id, the world is more complicated than that when world_size > 8
|
2023-01-16 05:44:50 +00:00 |
|
Andrej Karpathy
|
f5e6ac8b02
|
local rank -> rank
|
2023-01-16 05:13:13 +00:00 |
|
Andrej Karpathy
|
cf99914886
|
add gradient accumulation support to simulate larger batch sizes. ty @VHellendoorn for original PR
|
2023-01-15 17:49:55 +00:00 |
|
Andrej Karpathy
|
57735f532d
|
correctly propagate the vocab_size from the rendered dataset into the model args
|
2023-01-14 02:26:44 +00:00 |
|
Andrej Karpathy
|
8f85b83347
|
inference time mini-optimization low-hanging fruit ty @jxtps for raising: when we are running inference we can apply lm_head on only the very last token
|
2023-01-12 06:02:50 +00:00 |
|
Andrej Karpathy
|
d17350a31d
|
add support for character-level language models, a new character-level shakespeare dataset, a new config file that shows how to train a character-level baby GPT on it, and adjust the sample function to figure out if it should decode with characters or GPT2 bpe tokens. The current implementation is a bit hacky and basically assumes just these two possibilities. In the future we may want to support more general encoders or decoders.
|
2023-01-11 05:27:19 +00:00 |
|
Andrej Karpathy
|
c2a402f7f7
|
guess the config from globals() and log all of it with wandb
|
2023-01-11 01:00:22 +00:00 |
|
Andrej Karpathy
|
a855d316fd
|
add device and dtype support to train.py args
|
2023-01-08 19:20:38 +00:00 |
|
Luca Antiga
|
09f1f458e8
|
Move conditional import
|
2023-01-08 15:51:50 +01:00 |
|