1
0
mirror of https://github.com/osmarks/nanogpt-experiments.git synced 2024-12-18 06:00:29 +00:00

Commit Graph

  • d8b1a94519 change grad accum to default off because i think it just confuses everyone Andrej Karpathy 2023-02-02 18:38:49 +0000
  • d01863ef01 small usability tweaks to bench Andrej Karpathy 2023-02-02 17:23:46 +0000
  • 40f4d6ff70 use the enabled arg in GradScaler Yassine Yousfi 2023-01-31 21:12:49 -0800
  • d995c22128 fix bug with loading GPT-2 parameters, assert gets incorrectly tripped due to .bias missing since it is now optionally present depending on flash or not Andrej Karpathy 2023-02-01 02:05:34 +0000
  • 038ce89438 rename iter to it, because iter is a concrete Python builtin Andrej Karpathy 2023-01-31 23:34:02 +0000
  • d2705bd92a tune cited numbers and reproductions and more explicitly point out the problems w.r.t. the OWT vs WT domain gap Andrej Karpathy 2023-01-31 21:57:07 +0000
  • 4386bce1f4 adjust teaser figure with a more tuned result Andrej Karpathy 2023-01-31 21:43:30 +0000
  • 924a0873eb merge, make cleaner, careful with gradient clipping when using grad scaler fp16 training Andrej Karpathy 2023-01-30 23:40:35 +0000
  • ae06d0b15a add flash attention support, resolving last few issues but for now seems to work ok Andrej Karpathy 2023-01-30 23:18:26 +0000
  • 0e90ee9d48 based on my experiments these biases are indeed not needed. code runs faster, identical results. keeping the option just because it deviates from the gpt-2 setup Andrej Karpathy 2023-01-30 08:07:58 +0000
  • 001c1e7be7 stay true to the README file and set grad accum to 5, so the default batch size is about 0.5M and is reproducing gpt2 Andrej Karpathy 2023-01-27 20:51:50 +0000
  • 79dbe0086d let me set bias=True until I validate it properly, but this should be ok to merge to master for now, is equivalent to previous functionality Andrej Karpathy 2023-01-27 20:45:28 +0000
  • e808a67149 bunch of plumbing of bias all around. measuring bias=False to be about 6% faster Andrej Karpathy 2023-01-27 20:41:17 +0000
  • cc5444e194 add the bias option to config, default it to True for now Andrej Karpathy 2023-01-27 20:29:45 +0000
  • 2bf07a3fbf rewrite model class so layernorm has an optional bias= parameter Andrej Karpathy 2023-01-27 20:17:32 +0000
  • 2892858ce7 attempt a non-biased model, per few papers that cite this as working well Andrej Karpathy 2023-01-27 18:54:08 +0000
  • f29a9ff5bf ok i tried bringing back original init again and this time it makes a ton of difference and works much better than default. i'm not sure what was different with my earlier experiment where i saw a slight regression. may try to dissect commits later, for now merged the original mingpt init (following gpt-2 paper) as default. Andrej Karpathy 2023-01-27 17:56:18 +0000
  • 23a0bfac20 try bring back mingpt init Andrej Karpathy 2023-01-27 16:52:18 +0000
  • 3cb3fc059c grad clipping seems to slightly speed up training in the beginning but i can't see a big difference later in the training. it costs non-negligeable compute to clip. adding it for now because it is standard, and i think more necessary as the model becomes larger. practitioners may consider turning it off for minor efficiency gains Andrej Karpathy 2023-01-27 16:45:09 +0000
  • e0c689cf38 allow the prompt to compe from a file Andrej Karpathy 2023-01-25 01:12:43 +0000
  • 21675d7755 allow sample.py to init from a pretrained gpt2 checkpoints as well, in similar style to train.py Andrej Karpathy 2023-01-25 00:55:29 +0000
  • e0e94a1094 use GradScaler in model only if dtype is float16 johnwildauer 2023-01-24 15:53:31 -0700
  • 6c40a08b41
    Merge pull request #82 from danielgross/master Andrej 2023-01-22 13:47:32 -0800
  • 2f7fd0ac57 add relative import in shakespeare DG 2023-01-22 12:18:24 -0800
  • bf779456f3 add relative import in shakespeare_char DG 2023-01-22 11:11:25 -0800
  • f9d8020f48 Fix decode fn in shakespeare_char/prepare.py venusatuluri 2023-01-21 06:14:16 +0000
  • 3611338959
    Merge pull request #71 from cchan/patch-1 Andrej 2023-01-20 14:38:10 -0800
  • 1f77d03024 make mentions of mps in docs. ty good people in issue #28 Andrej Karpathy 2023-01-20 21:28:20 +0000
  • a6bffeee59
    Merge pull request #73 from danielgross/master Andrej 2023-01-20 12:21:33 -0800
  • edb7a7eab0 use relative paths so that running the data prep scripts always create files in local folder, no matter where run from DG 2023-01-20 10:39:45 -0800
  • 67166079c9
    Zero-grad more aggressively to save memory Clive Chan 2023-01-19 22:10:44 -0800
  • 2c7806db6e for consistency with previous commit Andrej Karpathy 2023-01-19 23:10:51 +0000
  • c1c20a0311
    Merge pull request #57 from ryouze/patch-1 Andrej 2023-01-19 15:08:35 -0800
  • 9e150b808e
    Merge pull request #66 from PWhiddy/patch-1 Andrej 2023-01-18 22:29:51 -0800
  • ff9085d0bc
    fix typo ( params -> tokens) Peter Whidden 2023-01-18 21:17:15 -0500
  • 8dd2061e4d fix temperature comment, slightly wrong Andrej Karpathy 2023-01-18 16:10:05 +0000
  • 2b083fbfde the badge is a bit ugly, move it down to troubleshooting section Andrej Karpathy 2023-01-18 03:16:59 +0000
  • aa8e4c2546 screwed up the link, fix Andrej Karpathy 2023-01-18 03:11:31 +0000
  • 6dab32c003 experimenting with badges, and discord link to start specifically. issues sometimes feel a little too heavy Andrej Karpathy 2023-01-18 03:09:42 +0000
  • be571fff2c
    Improve readability of huge numbers リョウゼ 2023-01-16 22:05:32 +0100
  • 7f74652843 add docs on multinode training to main README too Andrej Karpathy 2023-01-16 17:11:02 +0000
  • 46ce9971df small tweaks to docs and variable names stylistically Andrej Karpathy 2023-01-16 16:56:05 +0000
  • 684800dd87 clarify that these should be run on two separate machines Andrej Karpathy 2023-01-16 06:02:46 +0000
  • 9352df23de docs for multinode ddp Andrej Karpathy 2023-01-16 05:57:33 +0000
  • c3dddbff3d get rid of gpu_id, the world is more complicated than that when world_size > 8 Andrej Karpathy 2023-01-16 05:44:50 +0000
  • f5e6ac8b02 local rank -> rank Andrej Karpathy 2023-01-16 05:13:13 +0000
  • d5ee965974
    Update README.md MicroPanda123 2023-01-15 20:29:15 +0000
  • cf99914886 add gradient accumulation support to simulate larger batch sizes. ty @VHellendoorn for original PR Andrej Karpathy 2023-01-15 17:49:55 +0000
  • 89da79eee1 add note of caution for the produced warning, investigate later Andrej Karpathy 2023-01-14 20:38:22 +0000
  • 7d7ded25ce a bit better settings... for a single gpu at least. these settings would fry a simple cpu though i think Andrej Karpathy 2023-01-14 03:59:53 +0000
  • 91d02510ce fix bug... if topk > vocab_size, torch.topk will throw error Andrej Karpathy 2023-01-14 03:57:00 +0000
  • 57735f532d correctly propagate the vocab_size from the rendered dataset into the model args Andrej Karpathy 2023-01-14 02:26:44 +0000
  • 43b37fd568 reverse the order, making sure that the final layer init is preserved, and becomes the token embedding instead of the other way around. otherwise the loss can be all messed up from a bad init Andrej Karpathy 2023-01-14 02:16:10 +0000
  • 7c8288552b tie the weights of lm_head.weight and transformer.wte.weight, i.e. the last linear layer of decoder and the token embeddings. Andrej Karpathy 2023-01-14 01:00:55 +0000
  • 32b4f08d9d it's true Andrej Karpathy 2023-01-13 23:43:00 +0000
  • 3e0fd42579 more scaling laws, clarification, and add simple interpolation of Approach 2 Andrej Karpathy 2023-01-13 00:57:15 +0000
  • 8f85b83347 inference time mini-optimization low-hanging fruit ty @jxtps for raising: when we are running inference we can apply lm_head on only the very last token Andrej Karpathy 2023-01-12 06:02:50 +0000
  • e21cbf887f meant to set always_save_checkpoint to False instead, so we only write when val improves Andrej Karpathy 2023-01-12 05:47:34 +0000
  • c1ac2d58f1 including transformers as a dependency of the repo as well Andrej Karpathy 2023-01-12 02:42:38 +0000
  • 7f51d17977 add note about windows and pytorch 2.0 and torch compile in general Andrej Karpathy 2023-01-12 02:17:52 +0000
  • bb49751439 oh no nanoGPT is trending quickly explain the character-level functionality I added late last night Andrej Karpathy 2023-01-11 17:11:15 +0000
  • d17350a31d add support for character-level language models, a new character-level shakespeare dataset, a new config file that shows how to train a character-level baby GPT on it, and adjust the sample function to figure out if it should decode with characters or GPT2 bpe tokens. The current implementation is a bit hacky and basically assumes just these two possibilities. In the future we may want to support more general encoders or decoders. Andrej Karpathy 2023-01-11 05:27:19 +0000
  • c2a402f7f7 guess the config from globals() and log all of it with wandb Andrej Karpathy 2023-01-11 01:00:22 +0000
  • 8b2e622b27 adjust the readme to reflect changes in the autocast branch Andrej Karpathy 2023-01-08 19:40:46 +0000
  • b77c2e86d3 copy pasting what seems to work to bench,sample as well. ty @lantiga Andrej Karpathy 2023-01-08 19:32:13 +0000
  • a855d316fd add device and dtype support to train.py args Andrej Karpathy 2023-01-08 19:20:38 +0000
  • e7cd674ce7
    Merge pull request #20 from lantiga/wandb-optional-import Andrej 2023-01-08 10:19:40 -0800
  • 09f1f458e8 Move conditional import Luca Antiga 2023-01-08 15:51:50 +0100
  • aba47f0a35 Make wandb import conditioned to wandb_log=True Luca Antiga 2023-01-05 09:09:22 +0100
  • e53b9d28ff ran readme through spellchecker heh Andrej Karpathy 2023-01-08 01:46:54 +0000
  • df3b8a57ab tune the readme with new header image and the loss curve for 124M Andrej Karpathy 2023-01-08 00:41:14 +0000
  • d56bdf05a6 progress! based on chinchilla author correspondence Andrej Karpathy 2023-01-07 02:42:30 +0000
  • 27fc6a4112 small tweaks to notebook Andrej Karpathy 2023-01-06 02:13:04 +0000
  • 69d1a5f1af update scaling laws. basically i can't reproduce any of params, flops, or scaling laws of the Chinchilla paper atm... Andrej Karpathy 2023-01-06 02:01:08 +0000
  • 9629093e53 minor args re-arranging and removing some spurious ones like wandb entity ty @tcapelle Andrej Karpathy 2023-01-05 01:14:02 +0000
  • 529c967a65
    Merge pull request #19 from nat/patch-1 Andrej 2023-01-04 16:46:32 -0800
  • d562b3e550 shuttling the poor mans configurator aside into its own file and adding it to all of train,sample,bench. because i am leaving args in globals() so i can avoid having to prepend every single variable with an args., i have to exec the configurator and the optional configs. so we're left with something very gross by standard convention but also quite simple and functional. *ducks* Andrej Karpathy 2023-01-05 00:44:35 +0000
  • 2b9e168736 Strip unwanted prefix from state keys when loading model Nat Friedman 2023-01-04 16:34:00 -0800
  • ab04701f9f mention current 8GPU SOTA and shuffle sections a bit Andrej Karpathy 2023-01-04 18:59:10 +0000
  • 1eefbb2520
    Merge pull request #16 from jorahn/patch-1 Andrej 2023-01-04 09:08:50 -0800
  • 26aa5f3ead
    Update README.md Jonathan Rahn 2023-01-04 10:28:13 +0100
  • c72ecf5d93 add a notebook trying to reproduce chinchilla scaling laws. I can't get the numbers to be exactly right, have to look at more Andrej Karpathy 2023-01-04 00:59:34 +0000
  • 5acba4b005 ty lambda labs Andrej Karpathy 2023-01-03 21:16:07 +0000
  • 97fc42616e adding few more dependencies Andrej Karpathy 2023-01-03 17:54:48 +0000
  • 9f95aca93e better hyperparams for gpt2 124M model on A100 40GB. still uncertain about max_iters especially, and a bit about weight decay, betas Andrej Karpathy 2023-01-03 17:45:49 +0000
  • b45eec3e4b flesh out the remaining TODOs in readme a bit more Andrej Karpathy 2023-01-03 07:41:28 +0000
  • 177d5f7dc5 disabling torch.jit.script here for massive performance boost when using torch.compile, our default. see issue #11. thanks @vgoklani for flagging Andrej Karpathy 2023-01-02 23:05:01 +0000
  • 0a2ea95338 batch file write Laiho 2023-01-02 17:49:21 +0200
  • ea4de192e0 reshuffle args inside sample.py Andrej Karpathy 2023-01-02 02:11:39 +0000
  • ec9b1f8182 add a patch to fix mysterious unwanted prefix in state dict? maybe remove later Andrej Karpathy 2023-01-02 01:25:02 +0000
  • 41184a27f5 rename compile_model to compile, shroter, version 2 stragglers Andrej Karpathy 2023-01-02 01:15:55 +0000
  • 35f51974c4 rename to compile it's shorter Andrej Karpathy 2023-01-02 01:14:46 +0000
  • 2febf4463c candidate changes to apis, have to think through more Andrej Karpathy 2023-01-01 01:29:48 +0000
  • 7c6ea8409e simplify the prepare script a lot, write only using one process, seems sufficient for now. ty @LaihoE for suggestion and @proger for flagging Andrej Karpathy 2022-12-30 22:18:20 +0000
  • d8abd21258 typo fix in readme Andrej Karpathy 2022-12-30 00:07:58 +0000
  • 5a725d9098 add torch.compile by default, shows almost 1.8X improvement in throughput nice Andrej Karpathy 2022-12-30 00:07:13 +0000
  • fb52554ca8
    Merge pull request #1 from ankandrew/master Andrej 2022-12-29 13:45:20 -0800
  • 7f0e6d9a71 Frozen GPTConfig ankandrew 2022-12-29 17:07:19 -0300
  • 682a0ac8f1 properly resume training, also loading iter_num and best_val_loss from checkpoints Andrej Karpathy 2022-12-29 18:23:15 +0000
  • f88aa2c2fe add link to mingpt Andrej Karpathy 2022-12-29 17:38:33 +0000