1
0
mirror of https://github.com/osmarks/nanogpt-experiments.git synced 2024-12-18 14:10:28 +00:00
nanogpt-experiments/data/openwebtext
2022-12-30 22:18:20 +00:00
..
prepare.py simplify the prepare script a lot, write only using one process, seems sufficient for now. ty @LaihoE for suggestion and @proger for flagging 2022-12-30 22:18:20 +00:00
readme.md first very bad commit 2022-12-28 00:58:19 +00:00

openwebtext dataset

after running prepare.py (preprocess) we get:

  • train.bin is ~17GB, val.bin ~8.5MB
  • train has ~9B tokens (9,035,582,198)
  • val has ~4M tokens (4,434,897)

this came from 8,013,769 documents in total.

references: