mirror of
https://github.com/osmarks/nanogpt-experiments.git
synced 2025-05-15 21:54:06 +00:00

Before: length of dataset in characters: 1115394 all the unique characters: !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz vocab size: 65 train has 1003854 tokens val has 111540 tokens After: length of dataset in characters: 1,115,394 all the unique characters: !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz vocab size: 65 train has 1,003,854 tokens val has 111,540 tokens
tiny shakespeare, character-level
Tiny shakespeare, of the good old char-rnn fame :) Treated on character-level.
After running prepare.py
:
- train.bin has 1,003,854 tokens
- val.bin has 111,540 tokens