Andrej Karpathy
							
						 
					 | 
					
						
						
							
						
						77e7e04c26
					 | 
					
						
						
							
							padding 50257 -> 50304 vocab_size, the nerest multiple of 64. the biggest deal smallest optimization i've made in recent past, about 25% faster. this is because the last layer is a major latency bottleneck consuming about 40% of latency due to the very high channel count.
						
						
						
						
						
						
					 | 
					
						2023-02-04 16:06:18 +00:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Andrej Karpathy
							
						 
					 | 
					
						
						
							
						
						b3c17c6c6a
					 | 
					
						
						
							
							slight tweak compressing LOC
						
						
						
						
						
						
					 | 
					
						2023-02-04 15:57:29 +00:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Andrej
							
						 
					 | 
					
						
						
							
						
						53d56b82f1
					 | 
					
						
						
							
							Merge pull request #116 from ramtingh/master
						
						
						
						
						
						
						
						Minor change to allow using ddp with exclusive process mode 
						
						
					 | 
					
						2023-02-04 07:42:32 -08:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Ramtin Gharleghi
							
						 
					 | 
					
						
						
							
						
						9da1627c7f
					 | 
					
						
						
							
							Explicitly set ddp device
						
						
						
						
						
						
					 | 
					
						2023-02-04 15:07:36 +11:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Andrej Karpathy
							
						 
					 | 
					
						
						
							
						
						3fd4c0c5ef
					 | 
					
						
						
							
							who needs a dataloader? overlap the prefetching of the next batch with GPU compute, ehiding the data loading latency entirely. this saves about 1ms lol
						
						
						
						
						
						
					 | 
					
						2023-02-04 02:52:48 +00:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Andrej
							
						 
					 | 
					
						
						
							
						
						46428d3142
					 | 
					
						
						
							
							Merge pull request #115 from akashmjn/akashmjn/fix-notebook-stats
						
						
						
						
						
						
						
						add template .gitattributes that fixes language stats 
						
						
					 | 
					
						2023-02-03 17:23:44 -08:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Akash Mahajan
							
						 
					 | 
					
						
						
							
						
						d9a73374ed
					 | 
					
						
						
							
							keep only what's needed
						
						
						
						
						
						
					 | 
					
						2023-02-03 15:13:13 -08:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Andrej Karpathy
							
						 
					 | 
					
						
						
							
						
						3969860ff5
					 | 
					
						
						
							
							include launch command too. anyone should be able to do this now
						
						
						
						
						
						
					 | 
					
						2023-02-03 22:17:05 +00:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Andrej Karpathy
							
						 
					 | 
					
						
						
							
						
						f9348f3f18
					 | 
					
						
						
							
							add gpt2 training config
						
						
						
						
						
						
					 | 
					
						2023-02-03 22:14:37 +00:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Akash Mahajan
							
						 
					 | 
					
						
						
							
						
						0e2c12b5ae
					 | 
					
						
						
							
							add template .gitattributes that fixes language stats
						
						
						
						
						
						
					 | 
					
						2023-02-03 13:36:36 -08:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Andrej Karpathy
							
						 
					 | 
					
						
						
							
						
						e170e40872
					 | 
					
						
						
							
							use the new fused AdamW from pytorch nightly, if available
						
						
						
						
						
						
					 | 
					
						2023-02-03 17:56:51 +00:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Andrej
							
						 
					 | 
					
						
						
							
						
						7d44bdf6b5
					 | 
					
						
						
							
							Merge pull request #106 from YassineYousfi/master
						
						
						
						
						
						
						
						use the ``enabled`` arg in GradScaler 
						
						
					 | 
					
						2023-02-02 17:23:22 -08:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Andrej Karpathy
							
						 
					 | 
					
						
						
							
						
						1e87509e47
					 | 
					
						
						
							
							if dropout > 0.0 disable Flash until pytorch fix. don't assert fail sigh
						
						
						
						
						
						
					 | 
					
						2023-02-02 23:22:56 +00:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Andrej Karpathy
							
						 
					 | 
					
						
						
							
						
						d8b1a94519
					 | 
					
						
						
							
							change grad accum to default off because i think it just confuses everyone
						
						
						
						
						
						
					 | 
					
						2023-02-02 18:38:49 +00:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Andrej Karpathy
							
						 
					 | 
					
						
						
							
						
						d01863ef01
					 | 
					
						
						
							
							small usability tweaks to bench
						
						
						
						
						
						
					 | 
					
						2023-02-02 17:23:46 +00:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Yassine Yousfi
							
						 
					 | 
					
						
						
							
						
						40f4d6ff70
					 | 
					
						
						
							
							use the enabled arg in GradScaler
						
						
						
						
						
						
					 | 
					
						2023-01-31 21:12:49 -08:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Andrej Karpathy
							
						 
					 | 
					
						
						
							
						
						d995c22128
					 | 
					
						
						
							
							fix bug with loading GPT-2 parameters, assert gets incorrectly tripped due to .bias missing since it is now optionally present depending on flash or not
						
						
						
						
						
						
					 | 
					
						2023-02-01 02:05:34 +00:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Andrej Karpathy
							
						 
					 | 
					
						
						
							
						
						038ce89438
					 | 
					
						
						
							
							rename iter to it, because iter is a concrete Python builtin
						
						
						
						
						
						
					 | 
					
						2023-01-31 23:34:02 +00:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Andrej Karpathy
							
						 
					 | 
					
						
						
							
						
						d2705bd92a
					 | 
					
						
						
							
							tune cited numbers and reproductions and more explicitly point out the problems w.r.t. the OWT vs WT domain gap
						
						
						
						
						
						
					 | 
					
						2023-01-31 21:57:07 +00:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Andrej Karpathy
							
						 
					 | 
					
						
						
							
						
						4386bce1f4
					 | 
					
						
						
							
							adjust teaser figure with a more tuned result
						
						
						
						
						
						
					 | 
					
						2023-01-31 21:43:30 +00:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Andrej Karpathy
							
						 
					 | 
					
						
						
							
						
						924a0873eb
					 | 
					
						
						
							
							merge, make cleaner, careful with gradient clipping when using grad scaler fp16 training
						
						
						
						
						
						
					 | 
					
						2023-01-30 23:40:35 +00:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Andrej Karpathy
							
						 
					 | 
					
						
						
							
						
						ae06d0b15a
					 | 
					
						
						
							
							add flash attention support, resolving last few issues but for now seems to work ok
						
						
						
						
						
						
					 | 
					
						2023-01-30 23:18:26 +00:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Andrej Karpathy
							
						 
					 | 
					
						
						
							
						
						0e90ee9d48
					 | 
					
						
						
							
							based on my experiments these biases are indeed not needed. code runs faster, identical results. keeping the option just because it deviates from the gpt-2 setup
						
						
						
						
						
						
					 | 
					
						2023-01-30 08:07:58 +00:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Andrej Karpathy
							
						 
					 | 
					
						
						
							
						
						001c1e7be7
					 | 
					
						
						
							
							stay true to the README file and set grad accum to 5, so the default batch size is about 0.5M and is reproducing gpt2
						
						
						
						
						
						
					 | 
					
						2023-01-27 20:51:50 +00:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Andrej Karpathy
							
						 
					 | 
					
						
						
							
						
						79dbe0086d
					 | 
					
						
						
							
							let me set bias=True until I validate it properly, but this should be ok to merge to master for now, is equivalent to previous functionality
						
						
						
						
						
						
					 | 
					
						2023-01-27 20:45:28 +00:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Andrej Karpathy
							
						 
					 | 
					
						
						
							
						
						e808a67149
					 | 
					
						
						
							
							bunch of plumbing of bias all around. measuring bias=False to be about 6% faster
						
						
						
						
						
						
					 | 
					
						2023-01-27 20:41:17 +00:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Andrej Karpathy
							
						 
					 | 
					
						
						
							
						
						cc5444e194
					 | 
					
						
						
							
							add the bias option to config, default it to True for now
						
						
						
						
						
						
					 | 
					
						2023-01-27 20:29:45 +00:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Andrej Karpathy
							
						 
					 | 
					
						
						
							
						
						2bf07a3fbf
					 | 
					
						
						
							
							rewrite model class so layernorm has an optional bias= parameter
						
						
						
						
						
						
					 | 
					
						2023-01-27 20:17:32 +00:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Andrej Karpathy
							
						 
					 | 
					
						
						
							
						
						2892858ce7
					 | 
					
						
						
							
							attempt a non-biased model, per few papers that cite this as working well
						
						
						
						
						
						
					 | 
					
						2023-01-27 18:54:08 +00:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Andrej Karpathy
							
						 
					 | 
					
						
						
							
						
						f29a9ff5bf
					 | 
					
						
						
							
							ok i tried bringing back original init again and this time it makes a ton of difference and works much better than default. i'm not sure what was different with my earlier experiment where i saw a slight regression. may try to dissect commits later, for now merged the original mingpt init (following gpt-2 paper) as default.
						
						
						
						
						
						
					 | 
					
						2023-01-27 17:56:18 +00:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Andrej Karpathy
							
						 
					 | 
					
						
						
							
						
						23a0bfac20
					 | 
					
						
						
							
							try bring back mingpt init
						
						
						
						
						
						
					 | 
					
						2023-01-27 16:52:18 +00:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Andrej Karpathy
							
						 
					 | 
					
						
						
							
						
						3cb3fc059c
					 | 
					
						
						
							
							grad clipping seems to slightly speed up training in the beginning but i can't see a big difference later in the training. it costs non-negligeable compute to clip. adding it for now because it is standard, and i think more necessary as the model becomes larger. practitioners may consider turning it off for minor efficiency gains
						
						
						
						
						
						
					 | 
					
						2023-01-27 16:45:09 +00:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Andrej Karpathy
							
						 
					 | 
					
						
						
							
						
						e0c689cf38
					 | 
					
						
						
							
							allow the prompt to compe from a file
						
						
						
						
						
						
					 | 
					
						2023-01-25 01:12:43 +00:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Andrej Karpathy
							
						 
					 | 
					
						
						
							
						
						21675d7755
					 | 
					
						
						
							
							allow sample.py to init from a pretrained gpt2 checkpoints as well, in similar style to train.py
						
						
						
						
						
						
					 | 
					
						2023-01-25 00:55:29 +00:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								johnwildauer
							
						 
					 | 
					
						
						
							
						
						e0e94a1094
					 | 
					
						
						
							
							use GradScaler in model only if dtype is float16
						
						
						
						
						
						
					 | 
					
						2023-01-24 15:53:31 -07:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Andrej
							
						 
					 | 
					
						
						
							
						
						6c40a08b41
					 | 
					
						
						
							
							Merge pull request #82 from danielgross/master
						
						
						
						
						
						
						
						Missed two spots while relative pathing 
						
						
					 | 
					
						2023-01-22 13:47:32 -08:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								DG
							
						 
					 | 
					
						
						
							
						
						2f7fd0ac57
					 | 
					
						
						
							
							add relative import in shakespeare
						
						
						
						
						
						
					 | 
					
						2023-01-22 12:18:24 -08:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								DG
							
						 
					 | 
					
						
						
							
						
						bf779456f3
					 | 
					
						
						
							
							add relative import in shakespeare_char
						
						
						
						
						
						
					 | 
					
						2023-01-22 11:11:25 -08:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Andrej
							
						 
					 | 
					
						
						
							
						
						3611338959
					 | 
					
						
						
							
							Merge pull request #71 from cchan/patch-1
						
						
						
						
						
						
						
						Zero-grad more aggressively to save memory 
						
						
					 | 
					
						2023-01-20 14:38:10 -08:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Andrej Karpathy
							
						 
					 | 
					
						
						
							
						
						1f77d03024
					 | 
					
						
						
							
							make mentions of mps in docs. ty good people in issue #28
						
						
						
						
						
						
					 | 
					
						2023-01-20 21:28:20 +00:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Andrej
							
						 
					 | 
					
						
						
							
						
						a6bffeee59
					 | 
					
						
						
							
							Merge pull request #73 from danielgross/master
						
						
						
						
						
						
						
						Use relative paths 
						
						
					 | 
					
						2023-01-20 12:21:33 -08:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								DG
							
						 
					 | 
					
						
						
							
						
						edb7a7eab0
					 | 
					
						
						
							
							use relative paths so that running the data prep scripts always create files in local folder, no matter where run from
						
						
						
						
						
						
					 | 
					
						2023-01-20 10:39:45 -08:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Clive Chan
							
						 
					 | 
					
						
						
							
						
						67166079c9
					 | 
					
						
						
							
							Zero-grad more aggressively to save memory
						
						
						
						
						
						
					 | 
					
						2023-01-19 22:10:44 -08:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Andrej Karpathy
							
						 
					 | 
					
						
						
							
						
						2c7806db6e
					 | 
					
						
						
							
							for consistency with previous commit
						
						
						
						
						
						
					 | 
					
						2023-01-19 23:10:51 +00:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Andrej
							
						 
					 | 
					
						
						
							
						
						c1c20a0311
					 | 
					
						
						
							
							Merge pull request #57 from ryouze/patch-1
						
						
						
						
						
						
						
						Improve readability of huge numbers 
						
						
					 | 
					
						2023-01-19 15:08:35 -08:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Andrej
							
						 
					 | 
					
						
						
							
						
						9e150b808e
					 | 
					
						
						
							
							Merge pull request #66 from PWhiddy/patch-1
						
						
						
						
						
						
						
						fix typo ( params -> tokens) 
						
						
					 | 
					
						2023-01-18 22:29:51 -08:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Peter Whidden
							
						 
					 | 
					
						
						
							
						
						ff9085d0bc
					 | 
					
						
						
							
							fix typo ( params -> tokens)
						
						
						
						
						
						
					 | 
					
						2023-01-18 21:17:15 -05:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Andrej Karpathy
							
						 
					 | 
					
						
						
							
						
						8dd2061e4d
					 | 
					
						
						
							
							fix temperature comment, slightly wrong
						
						
						
						
						
						
					 | 
					
						2023-01-18 16:10:05 +00:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Andrej Karpathy
							
						 
					 | 
					
						
						
							
						
						2b083fbfde
					 | 
					
						
						
							
							the badge is a bit ugly, move it down to troubleshooting section
						
						
						
						
						
						
					 | 
					
						2023-01-18 03:16:59 +00:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 | 
				
			
				
					
						
							
							
								 
								Andrej Karpathy
							
						 
					 | 
					
						
						
							
						
						aa8e4c2546
					 | 
					
						
						
							
							screwed up the link, fix
						
						
						
						
						
						
					 | 
					
						2023-01-18 03:11:31 +00:00 | 
					
					
						
						
						
							
							
							
							
							
							
						
					 |