nanogpt-experiments

mirror of https://github.com/osmarks/nanogpt-experiments.git synced 2025-09-08 13:56:02 +00:00

Author	SHA1	Message	Date
Andrej Karpathy	2bf07a3fbf	rewrite model class so layernorm has an optional bias= parameter	2023-01-27 20:17:32 +00:00
Andrej Karpathy	2892858ce7	attempt a non-biased model, per few papers that cite this as working well	2023-01-27 18:54:08 +00:00
Andrej Karpathy	f29a9ff5bf	ok i tried bringing back original init again and this time it makes a ton of difference and works much better than default. i'm not sure what was different with my earlier experiment where i saw a slight regression. may try to dissect commits later, for now merged the original mingpt init (following gpt-2 paper) as default.	2023-01-27 17:56:18 +00:00
Andrej Karpathy	23a0bfac20	try bring back mingpt init	2023-01-27 16:52:18 +00:00
Andrej Karpathy	3cb3fc059c	grad clipping seems to slightly speed up training in the beginning but i can't see a big difference later in the training. it costs non-negligeable compute to clip. adding it for now because it is standard, and i think more necessary as the model becomes larger. practitioners may consider turning it off for minor efficiency gains	2023-01-27 16:45:09 +00:00
Andrej Karpathy	e0c689cf38	allow the prompt to compe from a file	2023-01-25 01:12:43 +00:00
Andrej Karpathy	21675d7755	allow sample.py to init from a pretrained gpt2 checkpoints as well, in similar style to train.py	2023-01-25 00:55:29 +00:00
Andrej	6c40a08b41	Merge pull request #82 from danielgross/master Missed two spots while relative pathing	2023-01-22 13:47:32 -08:00
DG	2f7fd0ac57	add relative import in shakespeare	2023-01-22 12:18:24 -08:00
DG	bf779456f3	add relative import in shakespeare_char	2023-01-22 11:11:25 -08:00
Andrej	3611338959	Merge pull request #71 from cchan/patch-1 Zero-grad more aggressively to save memory	2023-01-20 14:38:10 -08:00
Andrej Karpathy	1f77d03024	make mentions of mps in docs. ty good people in issue #28	2023-01-20 21:28:20 +00:00
Andrej	a6bffeee59	Merge pull request #73 from danielgross/master Use relative paths	2023-01-20 12:21:33 -08:00
DG	edb7a7eab0	use relative paths so that running the data prep scripts always create files in local folder, no matter where run from	2023-01-20 10:39:45 -08:00
Clive Chan	67166079c9	Zero-grad more aggressively to save memory	2023-01-19 22:10:44 -08:00
Andrej Karpathy	2c7806db6e	for consistency with previous commit	2023-01-19 23:10:51 +00:00
Andrej	c1c20a0311	Merge pull request #57 from ryouze/patch-1 Improve readability of huge numbers	2023-01-19 15:08:35 -08:00
Andrej	9e150b808e	Merge pull request #66 from PWhiddy/patch-1 fix typo ( params -> tokens)	2023-01-18 22:29:51 -08:00
Peter Whidden	ff9085d0bc	fix typo ( params -> tokens)	2023-01-18 21:17:15 -05:00
Andrej Karpathy	8dd2061e4d	fix temperature comment, slightly wrong	2023-01-18 16:10:05 +00:00
Andrej Karpathy	2b083fbfde	the badge is a bit ugly, move it down to troubleshooting section	2023-01-18 03:16:59 +00:00
Andrej Karpathy	aa8e4c2546	screwed up the link, fix	2023-01-18 03:11:31 +00:00
Andrej Karpathy	6dab32c003	experimenting with badges, and discord link to start specifically. issues sometimes feel a little too heavy	2023-01-18 03:09:42 +00:00
リョウゼ	be571fff2c	Improve readability of huge numbers Before: length of dataset in characters: 1115394 all the unique characters: !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz vocab size: 65 train has 1003854 tokens val has 111540 tokens After: length of dataset in characters: 1,115,394 all the unique characters: !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz vocab size: 65 train has 1,003,854 tokens val has 111,540 tokens	2023-01-16 22:05:32 +01:00
Andrej Karpathy	7f74652843	add docs on multinode training to main README too	2023-01-16 17:11:02 +00:00
Andrej Karpathy	46ce9971df	small tweaks to docs and variable names stylistically	2023-01-16 16:56:05 +00:00
Andrej Karpathy	684800dd87	clarify that these should be run on two separate machines	2023-01-16 06:02:46 +00:00
Andrej Karpathy	9352df23de	docs for multinode ddp	2023-01-16 05:57:33 +00:00
Andrej Karpathy	c3dddbff3d	get rid of gpu_id, the world is more complicated than that when world_size > 8	2023-01-16 05:44:50 +00:00
Andrej Karpathy	f5e6ac8b02	local rank -> rank	2023-01-16 05:13:13 +00:00
Andrej Karpathy	cf99914886	add gradient accumulation support to simulate larger batch sizes. ty @VHellendoorn for original PR	2023-01-15 17:49:55 +00:00
Andrej Karpathy	89da79eee1	add note of caution for the produced warning, investigate later	2023-01-14 20:38:22 +00:00
Andrej Karpathy	7d7ded25ce	a bit better settings... for a single gpu at least. these settings would fry a simple cpu though i think	2023-01-14 03:59:53 +00:00
Andrej Karpathy	91d02510ce	fix bug... if topk > vocab_size, torch.topk will throw error	2023-01-14 03:57:00 +00:00
Andrej Karpathy	57735f532d	correctly propagate the vocab_size from the rendered dataset into the model args	2023-01-14 02:26:44 +00:00
Andrej Karpathy	43b37fd568	reverse the order, making sure that the final layer init is preserved, and becomes the token embedding instead of the other way around. otherwise the loss can be all messed up from a bad init	2023-01-14 02:16:10 +00:00
Andrej Karpathy	7c8288552b	tie the weights of lm_head.weight and transformer.wte.weight, i.e. the last linear layer of decoder and the token embeddings.	2023-01-14 01:00:55 +00:00
Andrej Karpathy	32b4f08d9d	it's true	2023-01-13 23:43:00 +00:00
Andrej Karpathy	3e0fd42579	more scaling laws, clarification, and add simple interpolation of Approach 2	2023-01-13 00:57:15 +00:00
Andrej Karpathy	8f85b83347	inference time mini-optimization low-hanging fruit ty @jxtps for raising: when we are running inference we can apply lm_head on only the very last token	2023-01-12 06:02:50 +00:00
Andrej Karpathy	e21cbf887f	meant to set always_save_checkpoint to False instead, so we only write when val improves	2023-01-12 05:47:34 +00:00
Andrej Karpathy	c1ac2d58f1	including transformers as a dependency of the repo as well	2023-01-12 02:42:38 +00:00
Andrej Karpathy	7f51d17977	add note about windows and pytorch 2.0 and torch compile in general	2023-01-12 02:17:52 +00:00
Andrej Karpathy	bb49751439	oh no nanoGPT is trending quickly explain the character-level functionality I added late last night	2023-01-11 17:11:15 +00:00
Andrej Karpathy	d17350a31d	add support for character-level language models, a new character-level shakespeare dataset, a new config file that shows how to train a character-level baby GPT on it, and adjust the sample function to figure out if it should decode with characters or GPT2 bpe tokens. The current implementation is a bit hacky and basically assumes just these two possibilities. In the future we may want to support more general encoders or decoders.	2023-01-11 05:27:19 +00:00
Andrej Karpathy	c2a402f7f7	guess the config from globals() and log all of it with wandb	2023-01-11 01:00:22 +00:00
Andrej Karpathy	8b2e622b27	adjust the readme to reflect changes in the autocast branch	2023-01-08 19:40:46 +00:00
Andrej Karpathy	b77c2e86d3	copy pasting what seems to work to bench,sample as well. ty @lantiga	2023-01-08 19:32:13 +00:00
Andrej Karpathy	a855d316fd	add device and dtype support to train.py args	2023-01-08 19:20:38 +00:00
Andrej	e7cd674ce7	Merge pull request #20 from lantiga/wandb-optional-import Make wandb import conditioned to wandb_log=True	2023-01-08 10:19:40 -08:00

1 2

95 Commits