nanogpt-experiments

mirror of https://github.com/osmarks/nanogpt-experiments.git synced 2024-09-21 11:49:46 +00:00

Author	SHA1	Message	Date
Andrej Karpathy	8b1e43209e	small tweaks, make default WD be 0.1 as is often cited, and remove spurious init of LayerNorm, which is already initialized at 1,0	2023-02-06 23:07:25 +00:00
Andrej Karpathy	ab21d6c15d	bugfix we have to call the raw_model's estimate_mfu ty @jprobichaud for original PR	2023-02-06 19:55:35 +00:00
Andrej Karpathy	ab0718a7dd	add the estimation of model flops utilization (MFU), a very commonly looked at metric that estimates the token throughput in units of A100 bfloat16 peak flops (312 TFLOPS). this gives us a sense of the hardware utilization we're achieving	2023-02-05 00:48:58 +00:00
Andrej Karpathy	a74e8363a2	clean up TODOs a bit, they are stale	2023-02-04 21:11:25 +00:00
Andrej Karpathy	25d95dbd65	mildly dramatic refactor for handing all these usage cases across all possible supported and unsupported devices for all the possible switches and flags	2023-02-04 21:06:17 +00:00
Andrej Karpathy	e108ffb973	very slight refactor, bit cleaner	2023-02-04 19:34:24 +00:00
Nan Yang	b8286f343e	Pin memory only when training on GPU	2023-02-04 11:16:26 -08:00
Andrej Karpathy	77e7e04c26	padding 50257 -> 50304 vocab_size, the nerest multiple of 64. the biggest deal smallest optimization i've made in recent past, about 25% faster. this is because the last layer is a major latency bottleneck consuming about 40% of latency due to the very high channel count.	2023-02-04 16:06:18 +00:00
Andrej Karpathy	b3c17c6c6a	slight tweak compressing LOC	2023-02-04 15:57:29 +00:00
Ramtin Gharleghi	9da1627c7f	Explicitly set ddp device	2023-02-04 15:07:36 +11:00
Andrej Karpathy	3fd4c0c5ef	who needs a dataloader? overlap the prefetching of the next batch with GPU compute, ehiding the data loading latency entirely. this saves about 1ms lol	2023-02-04 02:52:48 +00:00
Andrej	7d44bdf6b5	Merge pull request #106 from YassineYousfi/master use the ``enabled`` arg in GradScaler	2023-02-02 17:23:22 -08:00
Andrej Karpathy	d8b1a94519	change grad accum to default off because i think it just confuses everyone	2023-02-02 18:38:49 +00:00
Yassine Yousfi	40f4d6ff70	use the enabled arg in GradScaler	2023-01-31 21:12:49 -08:00
Andrej Karpathy	038ce89438	rename iter to it, because iter is a concrete Python builtin	2023-01-31 23:34:02 +00:00
Andrej Karpathy	924a0873eb	merge, make cleaner, careful with gradient clipping when using grad scaler fp16 training	2023-01-30 23:40:35 +00:00
Andrej Karpathy	0e90ee9d48	based on my experiments these biases are indeed not needed. code runs faster, identical results. keeping the option just because it deviates from the gpt-2 setup	2023-01-30 08:07:58 +00:00
Andrej Karpathy	001c1e7be7	stay true to the README file and set grad accum to 5, so the default batch size is about 0.5M and is reproducing gpt2	2023-01-27 20:51:50 +00:00
Andrej Karpathy	79dbe0086d	let me set bias=True until I validate it properly, but this should be ok to merge to master for now, is equivalent to previous functionality	2023-01-27 20:45:28 +00:00
Andrej Karpathy	e808a67149	bunch of plumbing of bias all around. measuring bias=False to be about 6% faster	2023-01-27 20:41:17 +00:00
Andrej Karpathy	3cb3fc059c	grad clipping seems to slightly speed up training in the beginning but i can't see a big difference later in the training. it costs non-negligeable compute to clip. adding it for now because it is standard, and i think more necessary as the model becomes larger. practitioners may consider turning it off for minor efficiency gains	2023-01-27 16:45:09 +00:00
johnwildauer	e0e94a1094	use GradScaler in model only if dtype is float16	2023-01-24 15:53:31 -07:00
Andrej	3611338959	Merge pull request #71 from cchan/patch-1 Zero-grad more aggressively to save memory	2023-01-20 14:38:10 -08:00
Andrej Karpathy	1f77d03024	make mentions of mps in docs. ty good people in issue #28	2023-01-20 21:28:20 +00:00
Clive Chan	67166079c9	Zero-grad more aggressively to save memory	2023-01-19 22:10:44 -08:00
Andrej Karpathy	46ce9971df	small tweaks to docs and variable names stylistically	2023-01-16 16:56:05 +00:00
Andrej Karpathy	684800dd87	clarify that these should be run on two separate machines	2023-01-16 06:02:46 +00:00
Andrej Karpathy	9352df23de	docs for multinode ddp	2023-01-16 05:57:33 +00:00
Andrej Karpathy	c3dddbff3d	get rid of gpu_id, the world is more complicated than that when world_size > 8	2023-01-16 05:44:50 +00:00
Andrej Karpathy	f5e6ac8b02	local rank -> rank	2023-01-16 05:13:13 +00:00
Andrej Karpathy	cf99914886	add gradient accumulation support to simulate larger batch sizes. ty @VHellendoorn for original PR	2023-01-15 17:49:55 +00:00
Andrej Karpathy	57735f532d	correctly propagate the vocab_size from the rendered dataset into the model args	2023-01-14 02:26:44 +00:00
Andrej Karpathy	8f85b83347	inference time mini-optimization low-hanging fruit ty @jxtps for raising: when we are running inference we can apply lm_head on only the very last token	2023-01-12 06:02:50 +00:00
Andrej Karpathy	d17350a31d	add support for character-level language models, a new character-level shakespeare dataset, a new config file that shows how to train a character-level baby GPT on it, and adjust the sample function to figure out if it should decode with characters or GPT2 bpe tokens. The current implementation is a bit hacky and basically assumes just these two possibilities. In the future we may want to support more general encoders or decoders.	2023-01-11 05:27:19 +00:00
Andrej Karpathy	c2a402f7f7	guess the config from globals() and log all of it with wandb	2023-01-11 01:00:22 +00:00
Andrej Karpathy	a855d316fd	add device and dtype support to train.py args	2023-01-08 19:20:38 +00:00
Luca Antiga	09f1f458e8	Move conditional import	2023-01-08 15:51:50 +01:00
Luca Antiga	aba47f0a35	Make wandb import conditioned to wandb_log=True	2023-01-08 15:42:08 +01:00
Andrej Karpathy	9629093e53	minor args re-arranging and removing some spurious ones like wandb entity ty @tcapelle	2023-01-05 01:14:02 +00:00
Andrej Karpathy	d562b3e550	shuttling the poor mans configurator aside into its own file and adding it to all of train,sample,bench. because i am leaving args in globals() so i can avoid having to prepend every single variable with an args., i have to exec the configurator and the optional configs. so we're left with something very gross by standard convention but also quite simple and functional. ducks	2023-01-05 00:44:35 +00:00
Andrej Karpathy	9f95aca93e	better hyperparams for gpt2 124M model on A100 40GB. still uncertain about max_iters especially, and a bit about weight decay, betas	2023-01-03 17:45:49 +00:00
Andrej Karpathy	ec9b1f8182	add a patch to fix mysterious unwanted prefix in state dict? maybe remove later	2023-01-02 01:25:02 +00:00
Andrej Karpathy	35f51974c4	rename to compile it's shorter	2023-01-02 01:14:46 +00:00
Andrej Karpathy	2febf4463c	candidate changes to apis, have to think through more	2023-01-01 01:29:48 +00:00
Andrej Karpathy	5a725d9098	add torch.compile by default, shows almost 1.8X improvement in throughput nice	2022-12-30 00:07:13 +00:00
Andrej Karpathy	682a0ac8f1	properly resume training, also loading iter_num and best_val_loss from checkpoints	2022-12-29 18:23:15 +00:00
Andrej Karpathy	dea1507252	add support for DDP training. the scaling timings right now do not look good by default, have to dig more into	2022-12-29 05:06:07 +00:00
Andrej Karpathy	5d2b4807bf	adding a lightweight configurator that may be a terrible mistake lol. also adding configs to evaluate the baseline GPT2 versions released by OpenAI on OWT. we have some ways to go to match those numbers atm	2022-12-28 23:31:23 +00:00
Andrej Karpathy	c9fe00c0e9	small readme clarification and training script defaults changes	2022-12-28 01:45:55 +00:00
Andrej Karpathy	fe8042867c	first very bad commit	2022-12-28 00:58:19 +00:00

50 Commits