osmarks/nanogpt-experiments

mirror of https://github.com/osmarks/nanogpt-experiments.git synced 2024-11-10 20:09:58 +00:00

History

リョウゼ be571fff2c Improve readability of huge numbers Before: length of dataset in characters: 1115394 all the unique characters: !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz vocab size: 65 train has 1003854 tokens val has 111540 tokens After: length of dataset in characters: 1,115,394 all the unique characters: !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz vocab size: 65 train has 1,003,854 tokens val has 111,540 tokens		2023-01-16 22:05:32 +01:00
..
prepare.py	Improve readability of huge numbers	2023-01-16 22:05:32 +01:00
readme.md	add support for character-level language models, a new character-level shakespeare dataset, a new config file that shows how to train a character-level baby GPT on it, and adjust the sample function to figure out if it should decode with characters or GPT2 bpe tokens. The current implementation is a bit hacky and basically assumes just these two possibilities. In the future we may want to support more general encoders or decoders.	2023-01-11 05:27:19 +00:00

readme.md

tiny shakespeare, character-level

Tiny shakespeare, of the good old char-rnn fame :) Treated on character-level.

After running prepare.py:

train.bin has 1,003,854 tokens
val.bin has 111,540 tokens