nanogpt-experiments

mirror of https://github.com/osmarks/nanogpt-experiments.git synced 2024-12-18 22:20:29 +00:00

Author	SHA1	Message	Date
Andrej Karpathy	3341b4cecc	oops forgot to subtract embedding params, which don't enter the 6ND equation	2023-02-04 22:33:35 +00:00
Andrej Karpathy	5a162bc773	fix silly error, i don't want to confuse a future GPT training on this notebook in the future	2023-02-04 22:11:16 +00:00
Andrej Karpathy	0bb96d3fff	add reference for 6ND to notebook too	2023-02-04 22:07:32 +00:00
Andrej Karpathy	eae986c2d2	new notebook with a bunch of calculations related to flops and memory of Transformer	2023-02-04 22:02:53 +00:00
Andrej Karpathy	a74e8363a2	clean up TODOs a bit, they are stale	2023-02-04 21:11:25 +00:00
Andrej Karpathy	25d95dbd65	mildly dramatic refactor for handing all these usage cases across all possible supported and unsupported devices for all the possible switches and flags	2023-02-04 21:06:17 +00:00
Andrej Karpathy	e108ffb973	very slight refactor, bit cleaner	2023-02-04 19:34:24 +00:00
Andrej	dc149891b6	Merge pull request #120 from nynyg/remove_cpu_pin_mem Pin memory only when training on GPU	2023-02-04 11:28:08 -08:00
Nan Yang	b8286f343e	Pin memory only when training on GPU	2023-02-04 11:16:26 -08:00
Andrej Karpathy	77e7e04c26	padding 50257 -> 50304 vocab_size, the nerest multiple of 64. the biggest deal smallest optimization i've made in recent past, about 25% faster. this is because the last layer is a major latency bottleneck consuming about 40% of latency due to the very high channel count.	2023-02-04 16:06:18 +00:00
Andrej Karpathy	b3c17c6c6a	slight tweak compressing LOC	2023-02-04 15:57:29 +00:00
Andrej	53d56b82f1	Merge pull request #116 from ramtingh/master Minor change to allow using ddp with exclusive process mode	2023-02-04 07:42:32 -08:00
Ramtin Gharleghi	9da1627c7f	Explicitly set ddp device	2023-02-04 15:07:36 +11:00
Andrej Karpathy	3fd4c0c5ef	who needs a dataloader? overlap the prefetching of the next batch with GPU compute, ehiding the data loading latency entirely. this saves about 1ms lol	2023-02-04 02:52:48 +00:00
Andrej	46428d3142	Merge pull request #115 from akashmjn/akashmjn/fix-notebook-stats add template .gitattributes that fixes language stats	2023-02-03 17:23:44 -08:00
Akash Mahajan	d9a73374ed	keep only what's needed	2023-02-03 15:13:13 -08:00
Andrej Karpathy	3969860ff5	include launch command too. anyone should be able to do this now	2023-02-03 22:17:05 +00:00
Andrej Karpathy	f9348f3f18	add gpt2 training config	2023-02-03 22:14:37 +00:00
Akash Mahajan	0e2c12b5ae	add template .gitattributes that fixes language stats	2023-02-03 13:36:36 -08:00
Andrej Karpathy	e170e40872	use the new fused AdamW from pytorch nightly, if available	2023-02-03 17:56:51 +00:00
Andrej	7d44bdf6b5	Merge pull request #106 from YassineYousfi/master use the ``enabled`` arg in GradScaler	2023-02-02 17:23:22 -08:00
Andrej Karpathy	1e87509e47	if dropout > 0.0 disable Flash until pytorch fix. don't assert fail sigh	2023-02-02 23:22:56 +00:00
Andrej Karpathy	d8b1a94519	change grad accum to default off because i think it just confuses everyone	2023-02-02 18:38:49 +00:00
Andrej Karpathy	d01863ef01	small usability tweaks to bench	2023-02-02 17:23:46 +00:00
Yassine Yousfi	40f4d6ff70	use the enabled arg in GradScaler	2023-01-31 21:12:49 -08:00
Andrej Karpathy	d995c22128	fix bug with loading GPT-2 parameters, assert gets incorrectly tripped due to .bias missing since it is now optionally present depending on flash or not	2023-02-01 02:05:34 +00:00
Andrej Karpathy	038ce89438	rename iter to it, because iter is a concrete Python builtin	2023-01-31 23:34:02 +00:00
Andrej Karpathy	d2705bd92a	tune cited numbers and reproductions and more explicitly point out the problems w.r.t. the OWT vs WT domain gap	2023-01-31 21:57:07 +00:00
Andrej Karpathy	4386bce1f4	adjust teaser figure with a more tuned result	2023-01-31 21:43:30 +00:00
Andrej Karpathy	924a0873eb	merge, make cleaner, careful with gradient clipping when using grad scaler fp16 training	2023-01-30 23:40:35 +00:00
Andrej Karpathy	ae06d0b15a	add flash attention support, resolving last few issues but for now seems to work ok	2023-01-30 23:18:26 +00:00
Andrej Karpathy	0e90ee9d48	based on my experiments these biases are indeed not needed. code runs faster, identical results. keeping the option just because it deviates from the gpt-2 setup	2023-01-30 08:07:58 +00:00
Andrej Karpathy	001c1e7be7	stay true to the README file and set grad accum to 5, so the default batch size is about 0.5M and is reproducing gpt2	2023-01-27 20:51:50 +00:00
Andrej Karpathy	79dbe0086d	let me set bias=True until I validate it properly, but this should be ok to merge to master for now, is equivalent to previous functionality	2023-01-27 20:45:28 +00:00
Andrej Karpathy	e808a67149	bunch of plumbing of bias all around. measuring bias=False to be about 6% faster	2023-01-27 20:41:17 +00:00
Andrej Karpathy	cc5444e194	add the bias option to config, default it to True for now	2023-01-27 20:29:45 +00:00
Andrej Karpathy	2bf07a3fbf	rewrite model class so layernorm has an optional bias= parameter	2023-01-27 20:17:32 +00:00
Andrej Karpathy	2892858ce7	attempt a non-biased model, per few papers that cite this as working well	2023-01-27 18:54:08 +00:00
Andrej Karpathy	f29a9ff5bf	ok i tried bringing back original init again and this time it makes a ton of difference and works much better than default. i'm not sure what was different with my earlier experiment where i saw a slight regression. may try to dissect commits later, for now merged the original mingpt init (following gpt-2 paper) as default.	2023-01-27 17:56:18 +00:00
Andrej Karpathy	23a0bfac20	try bring back mingpt init	2023-01-27 16:52:18 +00:00
Andrej Karpathy	3cb3fc059c	grad clipping seems to slightly speed up training in the beginning but i can't see a big difference later in the training. it costs non-negligeable compute to clip. adding it for now because it is standard, and i think more necessary as the model becomes larger. practitioners may consider turning it off for minor efficiency gains	2023-01-27 16:45:09 +00:00
Andrej Karpathy	e0c689cf38	allow the prompt to compe from a file	2023-01-25 01:12:43 +00:00
Andrej Karpathy	21675d7755	allow sample.py to init from a pretrained gpt2 checkpoints as well, in similar style to train.py	2023-01-25 00:55:29 +00:00
johnwildauer	e0e94a1094	use GradScaler in model only if dtype is float16	2023-01-24 15:53:31 -07:00
Andrej	6c40a08b41	Merge pull request #82 from danielgross/master Missed two spots while relative pathing	2023-01-22 13:47:32 -08:00
DG	2f7fd0ac57	add relative import in shakespeare	2023-01-22 12:18:24 -08:00
DG	bf779456f3	add relative import in shakespeare_char	2023-01-22 11:11:25 -08:00
Andrej	3611338959	Merge pull request #71 from cchan/patch-1 Zero-grad more aggressively to save memory	2023-01-20 14:38:10 -08:00
Andrej Karpathy	1f77d03024	make mentions of mps in docs. ty good people in issue #28	2023-01-20 21:28:20 +00:00
Andrej	a6bffeee59	Merge pull request #73 from danielgross/master Use relative paths	2023-01-20 12:21:33 -08:00

1 2 3

132 Commits