Yassine Yousfi
40f4d6ff70
use the enabled arg in GradScaler
2023-01-31 21:12:49 -08:00
Andrej Karpathy
d995c22128
fix bug with loading GPT-2 parameters, assert gets incorrectly tripped due to .bias missing since it is now optionally present depending on flash or not
2023-02-01 02:05:34 +00:00
Andrej Karpathy
038ce89438
rename iter to it, because iter is a concrete Python builtin
2023-01-31 23:34:02 +00:00
Andrej Karpathy
d2705bd92a
tune cited numbers and reproductions and more explicitly point out the problems w.r.t. the OWT vs WT domain gap
2023-01-31 21:57:07 +00:00
Andrej Karpathy
4386bce1f4
adjust teaser figure with a more tuned result
2023-01-31 21:43:30 +00:00
Andrej Karpathy
924a0873eb
merge, make cleaner, careful with gradient clipping when using grad scaler fp16 training
2023-01-30 23:40:35 +00:00
Andrej Karpathy
ae06d0b15a
add flash attention support, resolving last few issues but for now seems to work ok
2023-01-30 23:18:26 +00:00
Andrej Karpathy
0e90ee9d48
based on my experiments these biases are indeed not needed. code runs faster, identical results. keeping the option just because it deviates from the gpt-2 setup
2023-01-30 08:07:58 +00:00
Andrej Karpathy
001c1e7be7
stay true to the README file and set grad accum to 5, so the default batch size is about 0.5M and is reproducing gpt2
2023-01-27 20:51:50 +00:00
Andrej Karpathy
79dbe0086d
let me set bias=True until I validate it properly, but this should be ok to merge to master for now, is equivalent to previous functionality
2023-01-27 20:45:28 +00:00
Andrej Karpathy
e808a67149
bunch of plumbing of bias all around. measuring bias=False to be about 6% faster
2023-01-27 20:41:17 +00:00
Andrej Karpathy
cc5444e194
add the bias option to config, default it to True for now
2023-01-27 20:29:45 +00:00
Andrej Karpathy
2bf07a3fbf
rewrite model class so layernorm has an optional bias= parameter
2023-01-27 20:17:32 +00:00
Andrej Karpathy
2892858ce7
attempt a non-biased model, per few papers that cite this as working well
2023-01-27 18:54:08 +00:00
Andrej Karpathy
f29a9ff5bf
ok i tried bringing back original init again and this time it makes a ton of difference and works much better than default. i'm not sure what was different with my earlier experiment where i saw a slight regression. may try to dissect commits later, for now merged the original mingpt init (following gpt-2 paper) as default.
2023-01-27 17:56:18 +00:00
Andrej Karpathy
23a0bfac20
try bring back mingpt init
2023-01-27 16:52:18 +00:00
Andrej Karpathy
3cb3fc059c
grad clipping seems to slightly speed up training in the beginning but i can't see a big difference later in the training. it costs non-negligeable compute to clip. adding it for now because it is standard, and i think more necessary as the model becomes larger. practitioners may consider turning it off for minor efficiency gains
2023-01-27 16:45:09 +00:00
Andrej Karpathy
e0c689cf38
allow the prompt to compe from a file
2023-01-25 01:12:43 +00:00
Andrej Karpathy
21675d7755
allow sample.py to init from a pretrained gpt2 checkpoints as well, in similar style to train.py
2023-01-25 00:55:29 +00:00
johnwildauer
e0e94a1094
use GradScaler in model only if dtype is float16
2023-01-24 15:53:31 -07:00
Andrej
6c40a08b41
Merge pull request #82 from danielgross/master
...
Missed two spots while relative pathing
2023-01-22 13:47:32 -08:00
DG
2f7fd0ac57
add relative import in shakespeare
2023-01-22 12:18:24 -08:00
DG
bf779456f3
add relative import in shakespeare_char
2023-01-22 11:11:25 -08:00
venusatuluri
f9d8020f48
Fix decode fn in shakespeare_char/prepare.py
2023-01-21 06:14:16 +00:00
Andrej
3611338959
Merge pull request #71 from cchan/patch-1
...
Zero-grad more aggressively to save memory
2023-01-20 14:38:10 -08:00
Andrej Karpathy
1f77d03024
make mentions of mps in docs. ty good people in issue #28
2023-01-20 21:28:20 +00:00
Andrej
a6bffeee59
Merge pull request #73 from danielgross/master
...
Use relative paths
2023-01-20 12:21:33 -08:00
DG
edb7a7eab0
use relative paths so that running the data prep scripts always create files in local folder, no matter where run from
2023-01-20 10:39:45 -08:00
Clive Chan
67166079c9
Zero-grad more aggressively to save memory
2023-01-19 22:10:44 -08:00
Andrej Karpathy
2c7806db6e
for consistency with previous commit
2023-01-19 23:10:51 +00:00
Andrej
c1c20a0311
Merge pull request #57 from ryouze/patch-1
...
Improve readability of huge numbers
2023-01-19 15:08:35 -08:00
Andrej
9e150b808e
Merge pull request #66 from PWhiddy/patch-1
...
fix typo ( params -> tokens)
2023-01-18 22:29:51 -08:00
Peter Whidden
ff9085d0bc
fix typo ( params -> tokens)
2023-01-18 21:17:15 -05:00
Andrej Karpathy
8dd2061e4d
fix temperature comment, slightly wrong
2023-01-18 16:10:05 +00:00
Andrej Karpathy
2b083fbfde
the badge is a bit ugly, move it down to troubleshooting section
2023-01-18 03:16:59 +00:00
Andrej Karpathy
aa8e4c2546
screwed up the link, fix
2023-01-18 03:11:31 +00:00
Andrej Karpathy
6dab32c003
experimenting with badges, and discord link to start specifically. issues sometimes feel a little too heavy
2023-01-18 03:09:42 +00:00
リョウゼ
be571fff2c
Improve readability of huge numbers
...
Before:
length of dataset in characters: 1115394
all the unique characters:
!$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
vocab size: 65
train has 1003854 tokens
val has 111540 tokens
After:
length of dataset in characters: 1,115,394
all the unique characters:
!$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
vocab size: 65
train has 1,003,854 tokens
val has 111,540 tokens
2023-01-16 22:05:32 +01:00
Andrej Karpathy
7f74652843
add docs on multinode training to main README too
2023-01-16 17:11:02 +00:00
Andrej Karpathy
46ce9971df
small tweaks to docs and variable names stylistically
2023-01-16 16:56:05 +00:00
Andrej Karpathy
684800dd87
clarify that these should be run on two separate machines
2023-01-16 06:02:46 +00:00
Andrej Karpathy
9352df23de
docs for multinode ddp
2023-01-16 05:57:33 +00:00
Andrej Karpathy
c3dddbff3d
get rid of gpu_id, the world is more complicated than that when world_size > 8
2023-01-16 05:44:50 +00:00
Andrej Karpathy
f5e6ac8b02
local rank -> rank
2023-01-16 05:13:13 +00:00
MicroPanda123
d5ee965974
Update README.md
2023-01-15 20:29:15 +00:00
Andrej Karpathy
cf99914886
add gradient accumulation support to simulate larger batch sizes. ty @VHellendoorn for original PR
2023-01-15 17:49:55 +00:00
Andrej Karpathy
89da79eee1
add note of caution for the produced warning, investigate later
2023-01-14 20:38:22 +00:00
Andrej Karpathy
7d7ded25ce
a bit better settings... for a single gpu at least. these settings would fry a simple cpu though i think
2023-01-14 03:59:53 +00:00
Andrej Karpathy
91d02510ce
fix bug... if topk > vocab_size, torch.topk will throw error
2023-01-14 03:57:00 +00:00
Andrej Karpathy
57735f532d
correctly propagate the vocab_size from the rendered dataset into the model args
2023-01-14 02:26:44 +00:00