1
0
mirror of https://github.com/osmarks/nanogpt-experiments.git synced 2024-11-10 20:09:58 +00:00
Commit Graph

18 Commits

Author SHA1 Message Date
Andrej
325be85d9b
Merge pull request #420 from vinjn/fix-371-enc-is-not-defined
Move enc to gloabal namespace to fix #371
2024-02-27 09:27:01 -08:00
Adam Isakov
f35dc82437 fix: prepare.py - added input file opening in UTF-8 encoding 2024-01-26 01:34:44 +03:00
vinjn
dccf362c2b Move enc to gloabal namespace 2024-01-12 12:53:20 -08:00
Oleksandr Kuvshynov
542ac51d1f nanogpt: fix multiprocessing in load_dataset on os x
The issue seems to be that _fixup_main_from_path in multiprocessing
module in python is unable to find entry point, thus, adding
```
if __name__ == '__main__'
```
2023-06-17 20:35:38 -04:00
Oleksandr Kuvshynov
bb7e96754a nanogpt: allow multithreading in load dataset 2023-06-16 20:00:17 -04:00
Laiho
6649b299eb np.sum overflows on windows 2023-05-09 16:36:59 +03:00
Andrej
d9f4735f5e
Merge pull request #10 from LaihoE/master
batch file write
2023-04-13 00:39:41 -07:00
Christian Orr
36c7db8c44
bugfix in decode function
Return was left out of the decoder, so it didn't work.
2023-03-08 10:16:19 +02:00
DG
2f7fd0ac57 add relative import in shakespeare 2023-01-22 12:18:24 -08:00
DG
bf779456f3 add relative import in shakespeare_char 2023-01-22 11:11:25 -08:00
DG
edb7a7eab0 use relative paths so that running the data prep scripts always create files in local folder, no matter where run from 2023-01-20 10:39:45 -08:00
Andrej Karpathy
2c7806db6e for consistency with previous commit 2023-01-19 23:10:51 +00:00
リョウゼ
be571fff2c
Improve readability of huge numbers
Before:
  length of dataset in characters:  1115394
  all the unique characters: 
   !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
  vocab size: 65
  train has 1003854 tokens
  val has 111540 tokens

After:
  length of dataset in characters: 1,115,394
  all the unique characters: 
   !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
  vocab size: 65
  train has 1,003,854 tokens
  val has 111,540 tokens
2023-01-16 22:05:32 +01:00
Andrej Karpathy
d17350a31d add support for character-level language models, a new character-level shakespeare dataset, a new config file that shows how to train a character-level baby GPT on it, and adjust the sample function to figure out if it should decode with characters or GPT2 bpe tokens. The current implementation is a bit hacky and basically assumes just these two possibilities. In the future we may want to support more general encoders or decoders. 2023-01-11 05:27:19 +00:00
Laiho
0a2ea95338 batch file write 2023-01-02 17:49:21 +02:00
Andrej Karpathy
2febf4463c candidate changes to apis, have to think through more 2023-01-01 01:29:48 +00:00
Andrej Karpathy
7c6ea8409e simplify the prepare script a lot, write only using one process, seems sufficient for now. ty @LaihoE for suggestion and @proger for flagging 2022-12-30 22:18:20 +00:00
Andrej Karpathy
fe8042867c first very bad commit 2022-12-28 00:58:19 +00:00