mirror of
				https://github.com/osmarks/nanogpt-experiments.git
				synced 2025-10-24 20:07:41 +00:00 
			
		
		
		
	openwebtext dataset
after running prepare.py (preprocess) we get:
- train.bin is ~17GB, val.bin ~8.5MB
- train has ~9B tokens (9,035,582,198)
- val has ~4M tokens (4,434,897)
this came from 8,013,769 documents in total.
references:
- OpenAI's WebText dataset is discussed in GPT-2 paper
- OpenWebText dataset
