mirror of
https://github.com/osmarks/nanogpt-experiments.git
synced 2024-12-21 15:40:28 +00:00
.. | ||
prepare.py | ||
readme.md |
openwebtext dataset
after running prepare.py
(preprocess) we get:
- train.bin is ~17GB, val.bin ~8.5MB
- train has ~9B tokens (9,035,582,198)
- val has ~4M tokens (4,434,897)
this came from 8,013,769 documents in total.
references:
- OpenAI's WebText dataset is discussed in GPT-2 paper
- OpenWebText dataset