mirror of
				https://github.com/osmarks/nanogpt-experiments.git
				synced 2025-11-04 09:13:02 +00:00 
			
		
		
		
	tune cited numbers and reproductions and more explicitly point out the problems w.r.t. the OWT vs WT domain gap
This commit is contained in:
		@@ -3,7 +3,7 @@
 | 
			
		||||
 | 
			
		||||

 | 
			
		||||
 | 
			
		||||
The simplest, fastest repository for training/finetuning medium-sized GPTs. It is a rewrite of [minGPT](https://github.com/karpathy/minGPT) that prioritizes teeth over education. Still under active development, but currently the file `train.py` reproduces GPT-2 (124M) on OpenWebText, running on a single 8XA100 40GB node in 38 hours of training. The code itself is plain and readable: `train.py` is a ~300-line boilerplate training loop and `model.py` a ~300-line GPT model definition, which can optionally load the GPT-2 weights from OpenAI. That's it.
 | 
			
		||||
The simplest, fastest repository for training/finetuning medium-sized GPTs. It is a rewrite of [minGPT](https://github.com/karpathy/minGPT) that prioritizes teeth over education. Still under active development, but currently the file `train.py` reproduces GPT-2 (124M) on OpenWebText, running on a single 8XA100 40GB node in about 4 days of training. The code itself is plain and readable: `train.py` is a ~300-line boilerplate training loop and `model.py` a ~300-line GPT model definition, which can optionally load the GPT-2 weights from OpenAI. That's it.
 | 
			
		||||
 | 
			
		||||

 | 
			
		||||
 | 
			
		||||
@@ -59,8 +59,6 @@ By default checkpoints are periodically written to the `--out_dir` (`./out` by d
 | 
			
		||||
$ python sample.py
 | 
			
		||||
```
 | 
			
		||||
 | 
			
		||||
Training on 1 A100 40GB GPU overnight currently gets loss ~3.74, training on 4 gets ~3.60. Training on an 8 x A100 40GB node for ~500,000 iters (~1 day) atm gets down to ~3.1. Random chance at init is -ln(1/50257) = 10.82. Which brings us to baselines.
 | 
			
		||||
 | 
			
		||||
## baselines
 | 
			
		||||
 | 
			
		||||
OpenAI GPT-2 checkpoints allow us to get some baselines in place for openwebtext. We can get the numbers as follows:
 | 
			
		||||
@@ -81,7 +79,7 @@ and observe the following losses on train and val:
 | 
			
		||||
| gpt2-large | 774M   | 2.66  | 2.67     |
 | 
			
		||||
| gpt2-xl | 1558M     | 2.56  | 2.54     |
 | 
			
		||||
 | 
			
		||||
I briefly tried finetuning gpt2 a bit more on our OWT and didn't notice dramatic improvements, suggesting that OWT is not much much different from WT in terms of the data distribution, but this needs a bit more thorough attempt once the code is in a better place.
 | 
			
		||||
However, we have to note that GPT-2 was trained on (closed, never released) WebText, while OpenWebText is just a best-effort open reproduction of this dataset. This means there is a dataset domain gap. Indeed, taking the GPT-2 (124M) checkpoint and finetuning on OWT directly for a while reaches loss down to ~2.9. This then becomes the more appropriate baseline w.r.t. reproduction.
 | 
			
		||||
 | 
			
		||||
## finetuning
 | 
			
		||||
 | 
			
		||||
 
 | 
			
		||||
		Reference in New Issue
	
	Block a user