mirror of
https://github.com/osmarks/nanogpt-experiments.git
synced 2024-12-18 06:00:29 +00:00
542ac51d1f
The issue seems to be that _fixup_main_from_path in multiprocessing module in python is unable to find entry point, thus, adding ``` if __name__ == '__main__' ``` |
||
---|---|---|
.. | ||
prepare.py | ||
readme.md |
openwebtext dataset
after running prepare.py
(preprocess) we get:
- train.bin is ~17GB, val.bin ~8.5MB
- train has ~9B tokens (9,035,582,198)
- val has ~4M tokens (4,434,897)
this came from 8,013,769 documents in total.
references:
- OpenAI's WebText dataset is discussed in GPT-2 paper
- OpenWebText dataset