mirror of
https://github.com/osmarks/nanogpt-experiments.git
synced 2025-02-07 14:40:03 +00:00
![Oleksandr Kuvshynov](/assets/img/avatar_default.png)
The issue seems to be that _fixup_main_from_path in multiprocessing module in python is unable to find entry point, thus, adding ``` if __name__ == '__main__' ```
openwebtext dataset
after running prepare.py
(preprocess) we get:
- train.bin is ~17GB, val.bin ~8.5MB
- train has ~9B tokens (9,035,582,198)
- val has ~4M tokens (4,434,897)
this came from 8,013,769 documents in total.
references:
- OpenAI's WebText dataset is discussed in GPT-2 paper
- OpenWebText dataset