mirror of
https://github.com/osmarks/nanogpt-experiments.git
synced 2025-09-04 11:57:58 +00:00
first very bad commit
This commit is contained in:
15
data/openwebtext/readme.md
Normal file
15
data/openwebtext/readme.md
Normal file
@@ -0,0 +1,15 @@
|
||||
|
||||
## openwebtext dataset
|
||||
|
||||
after running `prepare.py` (preprocess) we get:
|
||||
|
||||
- train.bin is ~17GB, val.bin ~8.5MB
|
||||
- train has ~9B tokens (9,035,582,198)
|
||||
- val has ~4M tokens (4,434,897)
|
||||
|
||||
this came from 8,013,769 documents in total.
|
||||
|
||||
references:
|
||||
|
||||
- OpenAI's WebText dataset is discussed in [GPT-2 paper](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)
|
||||
- [OpenWebText](https://skylion007.github.io/OpenWebTextCorpus/) dataset
|
Reference in New Issue
Block a user