Yassine Yousfi
40f4d6ff70
use the enabled arg in GradScaler
2023-01-31 21:12:49 -08:00
Andrej Karpathy
d995c22128
fix bug with loading GPT-2 parameters, assert gets incorrectly tripped due to .bias missing since it is now optionally present depending on flash or not
2023-02-01 02:05:34 +00:00
Andrej Karpathy
038ce89438
rename iter to it, because iter is a concrete Python builtin
2023-01-31 23:34:02 +00:00
Andrej Karpathy
d2705bd92a
tune cited numbers and reproductions and more explicitly point out the problems w.r.t. the OWT vs WT domain gap
2023-01-31 21:57:07 +00:00
Andrej Karpathy
4386bce1f4
adjust teaser figure with a more tuned result
2023-01-31 21:43:30 +00:00
Andrej Karpathy
924a0873eb
merge, make cleaner, careful with gradient clipping when using grad scaler fp16 training
2023-01-30 23:40:35 +00:00
Andrej Karpathy
ae06d0b15a
add flash attention support, resolving last few issues but for now seems to work ok
2023-01-30 23:18:26 +00:00
Andrej Karpathy
0e90ee9d48
based on my experiments these biases are indeed not needed. code runs faster, identical results. keeping the option just because it deviates from the gpt-2 setup
2023-01-30 08:07:58 +00:00
Andrej Karpathy
001c1e7be7
stay true to the README file and set grad accum to 5, so the default batch size is about 0.5M and is reproducing gpt2
2023-01-27 20:51:50 +00:00
Andrej Karpathy
79dbe0086d
let me set bias=True until I validate it properly, but this should be ok to merge to master for now, is equivalent to previous functionality
2023-01-27 20:45:28 +00:00
Andrej Karpathy
e808a67149
bunch of plumbing of bias all around. measuring bias=False to be about 6% faster
2023-01-27 20:41:17 +00:00
Andrej Karpathy
cc5444e194
add the bias option to config, default it to True for now
2023-01-27 20:29:45 +00:00
Andrej Karpathy
2bf07a3fbf
rewrite model class so layernorm has an optional bias= parameter
2023-01-27 20:17:32 +00:00
Andrej Karpathy
2892858ce7
attempt a non-biased model, per few papers that cite this as working well
2023-01-27 18:54:08 +00:00
Andrej Karpathy
f29a9ff5bf
ok i tried bringing back original init again and this time it makes a ton of difference and works much better than default. i'm not sure what was different with my earlier experiment where i saw a slight regression. may try to dissect commits later, for now merged the original mingpt init (following gpt-2 paper) as default.
2023-01-27 17:56:18 +00:00
Andrej Karpathy
23a0bfac20
try bring back mingpt init
2023-01-27 16:52:18 +00:00
Andrej Karpathy
3cb3fc059c
grad clipping seems to slightly speed up training in the beginning but i can't see a big difference later in the training. it costs non-negligeable compute to clip. adding it for now because it is standard, and i think more necessary as the model becomes larger. practitioners may consider turning it off for minor efficiency gains
2023-01-27 16:45:09 +00:00
Andrej Karpathy
e0c689cf38
allow the prompt to compe from a file
2023-01-25 01:12:43 +00:00
Andrej Karpathy
21675d7755
allow sample.py to init from a pretrained gpt2 checkpoints as well, in similar style to train.py
2023-01-25 00:55:29 +00:00
johnwildauer
e0e94a1094
use GradScaler in model only if dtype is float16
2023-01-24 15:53:31 -07:00
Andrej
6c40a08b41
Merge pull request #82 from danielgross/master
...
Missed two spots while relative pathing
2023-01-22 13:47:32 -08:00
DG
2f7fd0ac57
add relative import in shakespeare
2023-01-22 12:18:24 -08:00
DG
bf779456f3
add relative import in shakespeare_char
2023-01-22 11:11:25 -08:00
Andrej
3611338959
Merge pull request #71 from cchan/patch-1
...
Zero-grad more aggressively to save memory
2023-01-20 14:38:10 -08:00
Andrej Karpathy
1f77d03024
make mentions of mps in docs. ty good people in issue #28
2023-01-20 21:28:20 +00:00
Andrej
a6bffeee59
Merge pull request #73 from danielgross/master
...
Use relative paths
2023-01-20 12:21:33 -08:00
DG
edb7a7eab0
use relative paths so that running the data prep scripts always create files in local folder, no matter where run from
2023-01-20 10:39:45 -08:00
Clive Chan
67166079c9
Zero-grad more aggressively to save memory
2023-01-19 22:10:44 -08:00
Andrej Karpathy
2c7806db6e
for consistency with previous commit
2023-01-19 23:10:51 +00:00
Andrej
c1c20a0311
Merge pull request #57 from ryouze/patch-1
...
Improve readability of huge numbers
2023-01-19 15:08:35 -08:00
Andrej
9e150b808e
Merge pull request #66 from PWhiddy/patch-1
...
fix typo ( params -> tokens)
2023-01-18 22:29:51 -08:00
Peter Whidden
ff9085d0bc
fix typo ( params -> tokens)
2023-01-18 21:17:15 -05:00
Andrej Karpathy
8dd2061e4d
fix temperature comment, slightly wrong
2023-01-18 16:10:05 +00:00
Andrej Karpathy
2b083fbfde
the badge is a bit ugly, move it down to troubleshooting section
2023-01-18 03:16:59 +00:00
Andrej Karpathy
aa8e4c2546
screwed up the link, fix
2023-01-18 03:11:31 +00:00
Andrej Karpathy
6dab32c003
experimenting with badges, and discord link to start specifically. issues sometimes feel a little too heavy
2023-01-18 03:09:42 +00:00
リョウゼ
be571fff2c
Improve readability of huge numbers
...
Before:
length of dataset in characters: 1115394
all the unique characters:
!$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
vocab size: 65
train has 1003854 tokens
val has 111540 tokens
After:
length of dataset in characters: 1,115,394
all the unique characters:
!$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
vocab size: 65
train has 1,003,854 tokens
val has 111,540 tokens
2023-01-16 22:05:32 +01:00
Andrej Karpathy
7f74652843
add docs on multinode training to main README too
2023-01-16 17:11:02 +00:00
Andrej Karpathy
46ce9971df
small tweaks to docs and variable names stylistically
2023-01-16 16:56:05 +00:00
Andrej Karpathy
684800dd87
clarify that these should be run on two separate machines
2023-01-16 06:02:46 +00:00
Andrej Karpathy
9352df23de
docs for multinode ddp
2023-01-16 05:57:33 +00:00
Andrej Karpathy
c3dddbff3d
get rid of gpu_id, the world is more complicated than that when world_size > 8
2023-01-16 05:44:50 +00:00
Andrej Karpathy
f5e6ac8b02
local rank -> rank
2023-01-16 05:13:13 +00:00
Andrej Karpathy
cf99914886
add gradient accumulation support to simulate larger batch sizes. ty @VHellendoorn for original PR
2023-01-15 17:49:55 +00:00
Andrej Karpathy
89da79eee1
add note of caution for the produced warning, investigate later
2023-01-14 20:38:22 +00:00
Andrej Karpathy
7d7ded25ce
a bit better settings... for a single gpu at least. these settings would fry a simple cpu though i think
2023-01-14 03:59:53 +00:00
Andrej Karpathy
91d02510ce
fix bug... if topk > vocab_size, torch.topk will throw error
2023-01-14 03:57:00 +00:00
Andrej Karpathy
57735f532d
correctly propagate the vocab_size from the rendered dataset into the model args
2023-01-14 02:26:44 +00:00
Andrej Karpathy
43b37fd568
reverse the order, making sure that the final layer init is preserved, and becomes the token embedding instead of the other way around. otherwise the loss can be all messed up from a bad init
2023-01-14 02:16:10 +00:00
Andrej Karpathy
7c8288552b
tie the weights of lm_head.weight and transformer.wte.weight, i.e. the last linear layer of decoder and the token embeddings.
2023-01-14 01:00:55 +00:00