1
0
mirror of https://github.com/osmarks/nanogpt-experiments.git synced 2024-11-10 20:09:58 +00:00
Commit Graph

95 Commits

Author SHA1 Message Date
Andrej Karpathy
2bf07a3fbf rewrite model class so layernorm has an optional bias= parameter 2023-01-27 20:17:32 +00:00
Andrej Karpathy
2892858ce7 attempt a non-biased model, per few papers that cite this as working well 2023-01-27 18:54:08 +00:00
Andrej Karpathy
f29a9ff5bf ok i tried bringing back original init again and this time it makes a ton of difference and works much better than default. i'm not sure what was different with my earlier experiment where i saw a slight regression. may try to dissect commits later, for now merged the original mingpt init (following gpt-2 paper) as default. 2023-01-27 17:56:18 +00:00
Andrej Karpathy
23a0bfac20 try bring back mingpt init 2023-01-27 16:52:18 +00:00
Andrej Karpathy
3cb3fc059c grad clipping seems to slightly speed up training in the beginning but i can't see a big difference later in the training. it costs non-negligeable compute to clip. adding it for now because it is standard, and i think more necessary as the model becomes larger. practitioners may consider turning it off for minor efficiency gains 2023-01-27 16:45:09 +00:00
Andrej Karpathy
e0c689cf38 allow the prompt to compe from a file 2023-01-25 01:12:43 +00:00
Andrej Karpathy
21675d7755 allow sample.py to init from a pretrained gpt2 checkpoints as well, in similar style to train.py 2023-01-25 00:55:29 +00:00
Andrej
6c40a08b41
Merge pull request #82 from danielgross/master
Missed two spots while relative pathing
2023-01-22 13:47:32 -08:00
DG
2f7fd0ac57 add relative import in shakespeare 2023-01-22 12:18:24 -08:00
DG
bf779456f3 add relative import in shakespeare_char 2023-01-22 11:11:25 -08:00
Andrej
3611338959
Merge pull request #71 from cchan/patch-1
Zero-grad more aggressively to save memory
2023-01-20 14:38:10 -08:00
Andrej Karpathy
1f77d03024 make mentions of mps in docs. ty good people in issue #28 2023-01-20 21:28:20 +00:00
Andrej
a6bffeee59
Merge pull request #73 from danielgross/master
Use relative paths
2023-01-20 12:21:33 -08:00
DG
edb7a7eab0 use relative paths so that running the data prep scripts always create files in local folder, no matter where run from 2023-01-20 10:39:45 -08:00
Clive Chan
67166079c9
Zero-grad more aggressively to save memory 2023-01-19 22:10:44 -08:00
Andrej Karpathy
2c7806db6e for consistency with previous commit 2023-01-19 23:10:51 +00:00
Andrej
c1c20a0311
Merge pull request #57 from ryouze/patch-1
Improve readability of huge numbers
2023-01-19 15:08:35 -08:00
Andrej
9e150b808e
Merge pull request #66 from PWhiddy/patch-1
fix typo ( params -> tokens)
2023-01-18 22:29:51 -08:00
Peter Whidden
ff9085d0bc
fix typo ( params -> tokens) 2023-01-18 21:17:15 -05:00
Andrej Karpathy
8dd2061e4d fix temperature comment, slightly wrong 2023-01-18 16:10:05 +00:00
Andrej Karpathy
2b083fbfde the badge is a bit ugly, move it down to troubleshooting section 2023-01-18 03:16:59 +00:00
Andrej Karpathy
aa8e4c2546 screwed up the link, fix 2023-01-18 03:11:31 +00:00
Andrej Karpathy
6dab32c003 experimenting with badges, and discord link to start specifically. issues sometimes feel a little too heavy 2023-01-18 03:09:42 +00:00
リョウゼ
be571fff2c
Improve readability of huge numbers
Before:
  length of dataset in characters:  1115394
  all the unique characters: 
   !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
  vocab size: 65
  train has 1003854 tokens
  val has 111540 tokens

After:
  length of dataset in characters: 1,115,394
  all the unique characters: 
   !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
  vocab size: 65
  train has 1,003,854 tokens
  val has 111,540 tokens
2023-01-16 22:05:32 +01:00
Andrej Karpathy
7f74652843 add docs on multinode training to main README too 2023-01-16 17:11:02 +00:00
Andrej Karpathy
46ce9971df small tweaks to docs and variable names stylistically 2023-01-16 16:56:05 +00:00
Andrej Karpathy
684800dd87 clarify that these should be run on two separate machines 2023-01-16 06:02:46 +00:00
Andrej Karpathy
9352df23de docs for multinode ddp 2023-01-16 05:57:33 +00:00
Andrej Karpathy
c3dddbff3d get rid of gpu_id, the world is more complicated than that when world_size > 8 2023-01-16 05:44:50 +00:00
Andrej Karpathy
f5e6ac8b02 local rank -> rank 2023-01-16 05:13:13 +00:00
Andrej Karpathy
cf99914886 add gradient accumulation support to simulate larger batch sizes. ty @VHellendoorn for original PR 2023-01-15 17:49:55 +00:00
Andrej Karpathy
89da79eee1 add note of caution for the produced warning, investigate later 2023-01-14 20:38:22 +00:00
Andrej Karpathy
7d7ded25ce a bit better settings... for a single gpu at least. these settings would fry a simple cpu though i think 2023-01-14 03:59:53 +00:00
Andrej Karpathy
91d02510ce fix bug... if topk > vocab_size, torch.topk will throw error 2023-01-14 03:57:00 +00:00
Andrej Karpathy
57735f532d correctly propagate the vocab_size from the rendered dataset into the model args 2023-01-14 02:26:44 +00:00
Andrej Karpathy
43b37fd568 reverse the order, making sure that the final layer init is preserved, and becomes the token embedding instead of the other way around. otherwise the loss can be all messed up from a bad init 2023-01-14 02:16:10 +00:00
Andrej Karpathy
7c8288552b tie the weights of lm_head.weight and transformer.wte.weight, i.e. the last linear layer of decoder and the token embeddings. 2023-01-14 01:00:55 +00:00
Andrej Karpathy
32b4f08d9d it's true 2023-01-13 23:43:00 +00:00
Andrej Karpathy
3e0fd42579 more scaling laws, clarification, and add simple interpolation of Approach 2 2023-01-13 00:57:15 +00:00
Andrej Karpathy
8f85b83347 inference time mini-optimization low-hanging fruit ty @jxtps for raising: when we are running inference we can apply lm_head on only the very last token 2023-01-12 06:02:50 +00:00
Andrej Karpathy
e21cbf887f meant to set always_save_checkpoint to False instead, so we only write when val improves 2023-01-12 05:47:34 +00:00
Andrej Karpathy
c1ac2d58f1 including transformers as a dependency of the repo as well 2023-01-12 02:42:38 +00:00
Andrej Karpathy
7f51d17977 add note about windows and pytorch 2.0 and torch compile in general 2023-01-12 02:17:52 +00:00
Andrej Karpathy
bb49751439 oh no nanoGPT is trending quickly explain the character-level functionality I added late last night 2023-01-11 17:11:15 +00:00
Andrej Karpathy
d17350a31d add support for character-level language models, a new character-level shakespeare dataset, a new config file that shows how to train a character-level baby GPT on it, and adjust the sample function to figure out if it should decode with characters or GPT2 bpe tokens. The current implementation is a bit hacky and basically assumes just these two possibilities. In the future we may want to support more general encoders or decoders. 2023-01-11 05:27:19 +00:00
Andrej Karpathy
c2a402f7f7 guess the config from globals() and log all of it with wandb 2023-01-11 01:00:22 +00:00
Andrej Karpathy
8b2e622b27 adjust the readme to reflect changes in the autocast branch 2023-01-08 19:40:46 +00:00
Andrej Karpathy
b77c2e86d3 copy pasting what seems to work to bench,sample as well. ty @lantiga 2023-01-08 19:32:13 +00:00
Andrej Karpathy
a855d316fd add device and dtype support to train.py args 2023-01-08 19:20:38 +00:00
Andrej
e7cd674ce7
Merge pull request #20 from lantiga/wandb-optional-import
Make wandb import conditioned to wandb_log=True
2023-01-08 10:19:40 -08:00