nanogpt-experiments

mirror of https://github.com/osmarks/nanogpt-experiments.git synced 2025-09-03 03:17:58 +00:00

Author	SHA1	Message	Date
Andrej Karpathy	1e87509e47	if dropout > 0.0 disable Flash until pytorch fix. don't assert fail sigh	2023-02-02 23:22:56 +00:00
Andrej Karpathy	d8b1a94519	change grad accum to default off because i think it just confuses everyone	2023-02-02 18:38:49 +00:00
Andrej Karpathy	d01863ef01	small usability tweaks to bench	2023-02-02 17:23:46 +00:00
Andrej Karpathy	d995c22128	fix bug with loading GPT-2 parameters, assert gets incorrectly tripped due to .bias missing since it is now optionally present depending on flash or not	2023-02-01 02:05:34 +00:00
Andrej Karpathy	038ce89438	rename iter to it, because iter is a concrete Python builtin	2023-01-31 23:34:02 +00:00
Andrej Karpathy	d2705bd92a	tune cited numbers and reproductions and more explicitly point out the problems w.r.t. the OWT vs WT domain gap	2023-01-31 21:57:07 +00:00
Andrej Karpathy	4386bce1f4	adjust teaser figure with a more tuned result	2023-01-31 21:43:30 +00:00
Andrej Karpathy	924a0873eb	merge, make cleaner, careful with gradient clipping when using grad scaler fp16 training	2023-01-30 23:40:35 +00:00
Andrej Karpathy	ae06d0b15a	add flash attention support, resolving last few issues but for now seems to work ok	2023-01-30 23:18:26 +00:00
Andrej Karpathy	0e90ee9d48	based on my experiments these biases are indeed not needed. code runs faster, identical results. keeping the option just because it deviates from the gpt-2 setup	2023-01-30 08:07:58 +00:00
Andrej Karpathy	001c1e7be7	stay true to the README file and set grad accum to 5, so the default batch size is about 0.5M and is reproducing gpt2	2023-01-27 20:51:50 +00:00
Andrej Karpathy	79dbe0086d	let me set bias=True until I validate it properly, but this should be ok to merge to master for now, is equivalent to previous functionality	2023-01-27 20:45:28 +00:00
Andrej Karpathy	e808a67149	bunch of plumbing of bias all around. measuring bias=False to be about 6% faster	2023-01-27 20:41:17 +00:00
Andrej Karpathy	cc5444e194	add the bias option to config, default it to True for now	2023-01-27 20:29:45 +00:00
Andrej Karpathy	2bf07a3fbf	rewrite model class so layernorm has an optional bias= parameter	2023-01-27 20:17:32 +00:00
Andrej Karpathy	2892858ce7	attempt a non-biased model, per few papers that cite this as working well	2023-01-27 18:54:08 +00:00
Andrej Karpathy	f29a9ff5bf	ok i tried bringing back original init again and this time it makes a ton of difference and works much better than default. i'm not sure what was different with my earlier experiment where i saw a slight regression. may try to dissect commits later, for now merged the original mingpt init (following gpt-2 paper) as default.	2023-01-27 17:56:18 +00:00
Andrej Karpathy	23a0bfac20	try bring back mingpt init	2023-01-27 16:52:18 +00:00
Andrej Karpathy	3cb3fc059c	grad clipping seems to slightly speed up training in the beginning but i can't see a big difference later in the training. it costs non-negligeable compute to clip. adding it for now because it is standard, and i think more necessary as the model becomes larger. practitioners may consider turning it off for minor efficiency gains	2023-01-27 16:45:09 +00:00
Andrej Karpathy	e0c689cf38	allow the prompt to compe from a file	2023-01-25 01:12:43 +00:00
Andrej Karpathy	21675d7755	allow sample.py to init from a pretrained gpt2 checkpoints as well, in similar style to train.py	2023-01-25 00:55:29 +00:00
johnwildauer	e0e94a1094	use GradScaler in model only if dtype is float16	2023-01-24 15:53:31 -07:00
Andrej	6c40a08b41	Merge pull request #82 from danielgross/master Missed two spots while relative pathing	2023-01-22 13:47:32 -08:00
DG	2f7fd0ac57	add relative import in shakespeare	2023-01-22 12:18:24 -08:00
DG	bf779456f3	add relative import in shakespeare_char	2023-01-22 11:11:25 -08:00
Andrej	3611338959	Merge pull request #71 from cchan/patch-1 Zero-grad more aggressively to save memory	2023-01-20 14:38:10 -08:00
Andrej Karpathy	1f77d03024	make mentions of mps in docs. ty good people in issue #28	2023-01-20 21:28:20 +00:00
Andrej	a6bffeee59	Merge pull request #73 from danielgross/master Use relative paths	2023-01-20 12:21:33 -08:00
DG	edb7a7eab0	use relative paths so that running the data prep scripts always create files in local folder, no matter where run from	2023-01-20 10:39:45 -08:00
Clive Chan	67166079c9	Zero-grad more aggressively to save memory	2023-01-19 22:10:44 -08:00
Andrej Karpathy	2c7806db6e	for consistency with previous commit	2023-01-19 23:10:51 +00:00
Andrej	c1c20a0311	Merge pull request #57 from ryouze/patch-1 Improve readability of huge numbers	2023-01-19 15:08:35 -08:00
Andrej	9e150b808e	Merge pull request #66 from PWhiddy/patch-1 fix typo ( params -> tokens)	2023-01-18 22:29:51 -08:00
Peter Whidden	ff9085d0bc	fix typo ( params -> tokens)	2023-01-18 21:17:15 -05:00
Andrej Karpathy	8dd2061e4d	fix temperature comment, slightly wrong	2023-01-18 16:10:05 +00:00
Andrej Karpathy	2b083fbfde	the badge is a bit ugly, move it down to troubleshooting section	2023-01-18 03:16:59 +00:00
Andrej Karpathy	aa8e4c2546	screwed up the link, fix	2023-01-18 03:11:31 +00:00
Andrej Karpathy	6dab32c003	experimenting with badges, and discord link to start specifically. issues sometimes feel a little too heavy	2023-01-18 03:09:42 +00:00
リョウゼ	be571fff2c	Improve readability of huge numbers Before: length of dataset in characters: 1115394 all the unique characters: !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz vocab size: 65 train has 1003854 tokens val has 111540 tokens After: length of dataset in characters: 1,115,394 all the unique characters: !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz vocab size: 65 train has 1,003,854 tokens val has 111,540 tokens	2023-01-16 22:05:32 +01:00
Andrej Karpathy	7f74652843	add docs on multinode training to main README too	2023-01-16 17:11:02 +00:00
Andrej Karpathy	46ce9971df	small tweaks to docs and variable names stylistically	2023-01-16 16:56:05 +00:00
Andrej Karpathy	684800dd87	clarify that these should be run on two separate machines	2023-01-16 06:02:46 +00:00
Andrej Karpathy	9352df23de	docs for multinode ddp	2023-01-16 05:57:33 +00:00
Andrej Karpathy	c3dddbff3d	get rid of gpu_id, the world is more complicated than that when world_size > 8	2023-01-16 05:44:50 +00:00
Andrej Karpathy	f5e6ac8b02	local rank -> rank	2023-01-16 05:13:13 +00:00
Andrej Karpathy	cf99914886	add gradient accumulation support to simulate larger batch sizes. ty @VHellendoorn for original PR	2023-01-15 17:49:55 +00:00
Andrej Karpathy	89da79eee1	add note of caution for the produced warning, investigate later	2023-01-14 20:38:22 +00:00
Andrej Karpathy	7d7ded25ce	a bit better settings... for a single gpu at least. these settings would fry a simple cpu though i think	2023-01-14 03:59:53 +00:00
Andrej Karpathy	91d02510ce	fix bug... if topk > vocab_size, torch.topk will throw error	2023-01-14 03:57:00 +00:00
Andrej Karpathy	57735f532d	correctly propagate the vocab_size from the rendered dataset into the model args	2023-01-14 02:26:44 +00:00

1 2 3

110 Commits