nanogpt-experiments

mirror of https://github.com/osmarks/nanogpt-experiments.git synced 2025-09-04 11:57:58 +00:00

Author	SHA1	Message	Date
Andrej Karpathy	924a0873eb	merge, make cleaner, careful with gradient clipping when using grad scaler fp16 training	2023-01-30 23:40:35 +00:00
Andrej Karpathy	ae06d0b15a	add flash attention support, resolving last few issues but for now seems to work ok	2023-01-30 23:18:26 +00:00
Andrej Karpathy	0e90ee9d48	based on my experiments these biases are indeed not needed. code runs faster, identical results. keeping the option just because it deviates from the gpt-2 setup	2023-01-30 08:07:58 +00:00
Andrej Karpathy	001c1e7be7	stay true to the README file and set grad accum to 5, so the default batch size is about 0.5M and is reproducing gpt2	2023-01-27 20:51:50 +00:00
Andrej Karpathy	79dbe0086d	let me set bias=True until I validate it properly, but this should be ok to merge to master for now, is equivalent to previous functionality	2023-01-27 20:45:28 +00:00
Andrej Karpathy	e808a67149	bunch of plumbing of bias all around. measuring bias=False to be about 6% faster	2023-01-27 20:41:17 +00:00
Andrej Karpathy	cc5444e194	add the bias option to config, default it to True for now	2023-01-27 20:29:45 +00:00
Andrej Karpathy	2bf07a3fbf	rewrite model class so layernorm has an optional bias= parameter	2023-01-27 20:17:32 +00:00
Andrej Karpathy	2892858ce7	attempt a non-biased model, per few papers that cite this as working well	2023-01-27 18:54:08 +00:00
Andrej Karpathy	f29a9ff5bf	ok i tried bringing back original init again and this time it makes a ton of difference and works much better than default. i'm not sure what was different with my earlier experiment where i saw a slight regression. may try to dissect commits later, for now merged the original mingpt init (following gpt-2 paper) as default.	2023-01-27 17:56:18 +00:00
Andrej Karpathy	23a0bfac20	try bring back mingpt init	2023-01-27 16:52:18 +00:00
Andrej Karpathy	3cb3fc059c	grad clipping seems to slightly speed up training in the beginning but i can't see a big difference later in the training. it costs non-negligeable compute to clip. adding it for now because it is standard, and i think more necessary as the model becomes larger. practitioners may consider turning it off for minor efficiency gains	2023-01-27 16:45:09 +00:00
Andrej Karpathy	e0c689cf38	allow the prompt to compe from a file	2023-01-25 01:12:43 +00:00
Andrej Karpathy	21675d7755	allow sample.py to init from a pretrained gpt2 checkpoints as well, in similar style to train.py	2023-01-25 00:55:29 +00:00
johnwildauer	e0e94a1094	use GradScaler in model only if dtype is float16	2023-01-24 15:53:31 -07:00
Andrej	6c40a08b41	Merge pull request #82 from danielgross/master Missed two spots while relative pathing	2023-01-22 13:47:32 -08:00
DG	2f7fd0ac57	add relative import in shakespeare	2023-01-22 12:18:24 -08:00
DG	bf779456f3	add relative import in shakespeare_char	2023-01-22 11:11:25 -08:00
venusatuluri	f9d8020f48	Fix decode fn in shakespeare_char/prepare.py	2023-01-21 06:14:16 +00:00
Andrej	3611338959	Merge pull request #71 from cchan/patch-1 Zero-grad more aggressively to save memory	2023-01-20 14:38:10 -08:00
Andrej Karpathy	1f77d03024	make mentions of mps in docs. ty good people in issue #28	2023-01-20 21:28:20 +00:00
Andrej	a6bffeee59	Merge pull request #73 from danielgross/master Use relative paths	2023-01-20 12:21:33 -08:00
DG	edb7a7eab0	use relative paths so that running the data prep scripts always create files in local folder, no matter where run from	2023-01-20 10:39:45 -08:00
Clive Chan	67166079c9	Zero-grad more aggressively to save memory	2023-01-19 22:10:44 -08:00
Andrej Karpathy	2c7806db6e	for consistency with previous commit	2023-01-19 23:10:51 +00:00
Andrej	c1c20a0311	Merge pull request #57 from ryouze/patch-1 Improve readability of huge numbers	2023-01-19 15:08:35 -08:00
Andrej	9e150b808e	Merge pull request #66 from PWhiddy/patch-1 fix typo ( params -> tokens)	2023-01-18 22:29:51 -08:00
Peter Whidden	ff9085d0bc	fix typo ( params -> tokens)	2023-01-18 21:17:15 -05:00
Andrej Karpathy	8dd2061e4d	fix temperature comment, slightly wrong	2023-01-18 16:10:05 +00:00
Andrej Karpathy	2b083fbfde	the badge is a bit ugly, move it down to troubleshooting section	2023-01-18 03:16:59 +00:00
Andrej Karpathy	aa8e4c2546	screwed up the link, fix	2023-01-18 03:11:31 +00:00
Andrej Karpathy	6dab32c003	experimenting with badges, and discord link to start specifically. issues sometimes feel a little too heavy	2023-01-18 03:09:42 +00:00
リョウゼ	be571fff2c	Improve readability of huge numbers Before: length of dataset in characters: 1115394 all the unique characters: !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz vocab size: 65 train has 1003854 tokens val has 111540 tokens After: length of dataset in characters: 1,115,394 all the unique characters: !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz vocab size: 65 train has 1,003,854 tokens val has 111,540 tokens	2023-01-16 22:05:32 +01:00
Andrej Karpathy	7f74652843	add docs on multinode training to main README too	2023-01-16 17:11:02 +00:00
Andrej Karpathy	46ce9971df	small tweaks to docs and variable names stylistically	2023-01-16 16:56:05 +00:00
Andrej Karpathy	684800dd87	clarify that these should be run on two separate machines	2023-01-16 06:02:46 +00:00
Andrej Karpathy	9352df23de	docs for multinode ddp	2023-01-16 05:57:33 +00:00
Andrej Karpathy	c3dddbff3d	get rid of gpu_id, the world is more complicated than that when world_size > 8	2023-01-16 05:44:50 +00:00
Andrej Karpathy	f5e6ac8b02	local rank -> rank	2023-01-16 05:13:13 +00:00
MicroPanda123	d5ee965974	Update README.md	2023-01-15 20:29:15 +00:00
Andrej Karpathy	cf99914886	add gradient accumulation support to simulate larger batch sizes. ty @VHellendoorn for original PR	2023-01-15 17:49:55 +00:00
Andrej Karpathy	89da79eee1	add note of caution for the produced warning, investigate later	2023-01-14 20:38:22 +00:00
Andrej Karpathy	7d7ded25ce	a bit better settings... for a single gpu at least. these settings would fry a simple cpu though i think	2023-01-14 03:59:53 +00:00
Andrej Karpathy	91d02510ce	fix bug... if topk > vocab_size, torch.topk will throw error	2023-01-14 03:57:00 +00:00
Andrej Karpathy	57735f532d	correctly propagate the vocab_size from the rendered dataset into the model args	2023-01-14 02:26:44 +00:00
Andrej Karpathy	43b37fd568	reverse the order, making sure that the final layer init is preserved, and becomes the token embedding instead of the other way around. otherwise the loss can be all messed up from a bad init	2023-01-14 02:16:10 +00:00
Andrej Karpathy	7c8288552b	tie the weights of lm_head.weight and transformer.wte.weight, i.e. the last linear layer of decoder and the token embeddings.	2023-01-14 01:00:55 +00:00
Andrej Karpathy	32b4f08d9d	it's true	2023-01-13 23:43:00 +00:00
Andrej Karpathy	3e0fd42579	more scaling laws, clarification, and add simple interpolation of Approach 2	2023-01-13 00:57:15 +00:00
Andrej Karpathy	8f85b83347	inference time mini-optimization low-hanging fruit ty @jxtps for raising: when we are running inference we can apply lm_head on only the very last token	2023-01-12 06:02:50 +00:00

1 2 3 4 5

206 Commits