nanogpt-experiments/data/openwebtext at 0bb96d3fff31046eea294c30d73aedafe0b23fd3 - nanogpt-experiments - osmarks projects hosting

osmarks/nanogpt-experiments

mirror of https://github.com/osmarks/nanogpt-experiments.git synced 2026-03-19 14:19:45 +00:00

Files

History

DG edb7a7eab0 use relative paths so that running the data prep scripts always create files in local folder, no matter where run from

2023-01-20 10:39:45 -08:00

..

prepare.py

use relative paths so that running the data prep scripts always create files in local folder, no matter where run from

2023-01-20 10:39:45 -08:00

readme.md

first very bad commit

2022-12-28 00:58:19 +00:00

readme.md

openwebtext dataset

after running prepare.py (preprocess) we get:

train.bin is ~17GB, val.bin ~8.5MB
train has ~9B tokens (9,035,582,198)
val has ~4M tokens (4,434,897)

this came from 8,013,769 documents in total.

references:

OpenAI's WebText dataset is discussed in GPT-2 paper
OpenWebText dataset