nanogpt-experiments

mirror of https://github.com/osmarks/nanogpt-experiments.git synced 2025-09-12 07:46:01 +00:00

Author	SHA1	Message	Date
Andrej Karpathy	553f949f46	fix minor bug where we have to scale the loss to account for gradient accumulation, which sums before backprop. note that this is not a major bug because AdamW is scale invariant. however, this did affect gradient clipping	2023-04-13 04:59:11 +00:00
Andrej	a82b33b525	Merge pull request #199 from ChristianOrr/patch-1 bugfix in decode function	2023-03-12 13:40:20 -07:00
Christian Orr	36c7db8c44	bugfix in decode function Return was left out of the decoder, so it didn't work.	2023-03-08 10:16:19 +02:00
Andrej	0d8fbd11ae	Merge pull request #195 from drisspg/enable_sdpa_with_nonzero_dropout Enable sdpa for nonzero dropout	2023-03-06 21:47:20 -08:00
Driss Guessous	6170531b8a	enable sdpa for nonzero dropout	2023-03-05 19:29:29 +00:00
Andrej	ae3a8d5fdd	Merge pull request #145 from otaviogood/gradAccumStability fix for training stability on single GPU	2023-02-14 18:48:54 -08:00
Otavio Good	086ebe1822	fix for training stability on single GPU	2023-02-13 10:42:44 -08:00
Andrej Karpathy	55c5069696	fix misinformation in readme	2023-02-10 16:34:46 +00:00
Andrej Karpathy	e58f0cfa94	oops i should not be needing or multiplying by world_size to calculate mfu	2023-02-07 21:38:39 +00:00
Andrej Karpathy	8b1e43209e	small tweaks, make default WD be 0.1 as is often cited, and remove spurious init of LayerNorm, which is already initialized at 1,0	2023-02-06 23:07:25 +00:00
Andrej Karpathy	ab21d6c15d	bugfix we have to call the raw_model's estimate_mfu ty @jprobichaud for original PR	2023-02-06 19:55:35 +00:00
Andrej Karpathy	f83dd034e1	also add a sampling/inference section	2023-02-05 21:02:30 +00:00
Andrej Karpathy	23a8e701d2	revamp the readme file to be a bit better and more accessible, i hope	2023-02-05 19:31:32 +00:00
Andrej Karpathy	fce706cbe6	tune the hyperparams a bit, in configs	2023-02-05 19:31:18 +00:00
Andrej Karpathy	ab0718a7dd	add the estimation of model flops utilization (MFU), a very commonly looked at metric that estimates the token throughput in units of A100 bfloat16 peak flops (312 TFLOPS). this gives us a sense of the hardware utilization we're achieving	2023-02-05 00:48:58 +00:00
Andrej Karpathy	580902617c	oops optimizer now demands to know device_type	2023-02-05 00:43:15 +00:00
Andrej Karpathy	34720df284	make more accurate the way in which we count parameters. previous count incorrectly included the positional encoding params, when typically only the number of weight parameters is reported for these models	2023-02-04 23:51:18 +00:00
Andrej Karpathy	3341b4cecc	oops forgot to subtract embedding params, which don't enter the 6ND equation	2023-02-04 22:33:35 +00:00
Andrej Karpathy	5a162bc773	fix silly error, i don't want to confuse a future GPT training on this notebook in the future	2023-02-04 22:11:16 +00:00
Andrej Karpathy	0bb96d3fff	add reference for 6ND to notebook too	2023-02-04 22:07:32 +00:00
Andrej Karpathy	eae986c2d2	new notebook with a bunch of calculations related to flops and memory of Transformer	2023-02-04 22:02:53 +00:00
Andrej Karpathy	a74e8363a2	clean up TODOs a bit, they are stale	2023-02-04 21:11:25 +00:00
Andrej Karpathy	25d95dbd65	mildly dramatic refactor for handing all these usage cases across all possible supported and unsupported devices for all the possible switches and flags	2023-02-04 21:06:17 +00:00
Andrej Karpathy	e108ffb973	very slight refactor, bit cleaner	2023-02-04 19:34:24 +00:00
Andrej	dc149891b6	Merge pull request #120 from nynyg/remove_cpu_pin_mem Pin memory only when training on GPU	2023-02-04 11:28:08 -08:00
Nan Yang	b8286f343e	Pin memory only when training on GPU	2023-02-04 11:16:26 -08:00
Andrej Karpathy	77e7e04c26	padding 50257 -> 50304 vocab_size, the nerest multiple of 64. the biggest deal smallest optimization i've made in recent past, about 25% faster. this is because the last layer is a major latency bottleneck consuming about 40% of latency due to the very high channel count.	2023-02-04 16:06:18 +00:00
Andrej Karpathy	b3c17c6c6a	slight tweak compressing LOC	2023-02-04 15:57:29 +00:00
Andrej	53d56b82f1	Merge pull request #116 from ramtingh/master Minor change to allow using ddp with exclusive process mode	2023-02-04 07:42:32 -08:00
Ramtin Gharleghi	9da1627c7f	Explicitly set ddp device	2023-02-04 15:07:36 +11:00
Andrej Karpathy	3fd4c0c5ef	who needs a dataloader? overlap the prefetching of the next batch with GPU compute, ehiding the data loading latency entirely. this saves about 1ms lol	2023-02-04 02:52:48 +00:00
Andrej	46428d3142	Merge pull request #115 from akashmjn/akashmjn/fix-notebook-stats add template .gitattributes that fixes language stats	2023-02-03 17:23:44 -08:00
Akash Mahajan	d9a73374ed	keep only what's needed	2023-02-03 15:13:13 -08:00
Andrej Karpathy	3969860ff5	include launch command too. anyone should be able to do this now	2023-02-03 22:17:05 +00:00
Andrej Karpathy	f9348f3f18	add gpt2 training config	2023-02-03 22:14:37 +00:00
Akash Mahajan	0e2c12b5ae	add template .gitattributes that fixes language stats	2023-02-03 13:36:36 -08:00
Andrej Karpathy	e170e40872	use the new fused AdamW from pytorch nightly, if available	2023-02-03 17:56:51 +00:00
Andrej	7d44bdf6b5	Merge pull request #106 from YassineYousfi/master use the ``enabled`` arg in GradScaler	2023-02-02 17:23:22 -08:00
Andrej Karpathy	1e87509e47	if dropout > 0.0 disable Flash until pytorch fix. don't assert fail sigh	2023-02-02 23:22:56 +00:00
Andrej Karpathy	d8b1a94519	change grad accum to default off because i think it just confuses everyone	2023-02-02 18:38:49 +00:00
Andrej Karpathy	d01863ef01	small usability tweaks to bench	2023-02-02 17:23:46 +00:00
Yassine Yousfi	40f4d6ff70	use the enabled arg in GradScaler	2023-01-31 21:12:49 -08:00
Andrej Karpathy	d995c22128	fix bug with loading GPT-2 parameters, assert gets incorrectly tripped due to .bias missing since it is now optionally present depending on flash or not	2023-02-01 02:05:34 +00:00
Andrej Karpathy	038ce89438	rename iter to it, because iter is a concrete Python builtin	2023-01-31 23:34:02 +00:00
Andrej Karpathy	d2705bd92a	tune cited numbers and reproductions and more explicitly point out the problems w.r.t. the OWT vs WT domain gap	2023-01-31 21:57:07 +00:00
Andrej Karpathy	4386bce1f4	adjust teaser figure with a more tuned result	2023-01-31 21:43:30 +00:00
Andrej Karpathy	924a0873eb	merge, make cleaner, careful with gradient clipping when using grad scaler fp16 training	2023-01-30 23:40:35 +00:00
Andrej Karpathy	ae06d0b15a	add flash attention support, resolving last few issues but for now seems to work ok	2023-01-30 23:18:26 +00:00
Andrej Karpathy	0e90ee9d48	based on my experiments these biases are indeed not needed. code runs faster, identical results. keeping the option just because it deviates from the gpt-2 setup	2023-01-30 08:07:58 +00:00
Andrej Karpathy	001c1e7be7	stay true to the README file and set grad accum to 5, so the default batch size is about 0.5M and is reproducing gpt2	2023-01-27 20:51:50 +00:00

1 2 3

149 Commits