nanogpt-experiments

mirror of https://github.com/osmarks/nanogpt-experiments.git synced 2025-11-01 07:43:01 +00:00

Files

Oleksandr Kuvshynov 542ac51d1f nanogpt: fix multiprocessing in load_dataset on os x

The issue seems to be that _fixup_main_from_path in multiprocessing
module in python is unable to find entry point, thus, adding
```
if __name__ == '__main__'
```

2023-06-17 20:35:38 -04:00

prepare.py

nanogpt: fix multiprocessing in load_dataset on os x

2023-06-17 20:35:38 -04:00

readme.md

first very bad commit

2022-12-28 00:58:19 +00:00

readme.md

openwebtext dataset

after running prepare.py (preprocess) we get:

train.bin is ~17GB, val.bin ~8.5MB
train has ~9B tokens (9,035,582,198)
val has ~4M tokens (4,434,897)

this came from 8,013,769 documents in total.

references:

OpenAI's WebText dataset is discussed in GPT-2 paper
OpenWebText dataset