ok i tried bringing back original init again and this time it makes a ton of difference and works much better than default. i'm not sure what was different with my earlier experiment where i saw a slight regression. may try to dissect commits later, for now merged the original mingpt init (following gpt-2 paper) as default.

2025-07-27 21:12:49 +00:00 · 2023-01-27 17:56:18 +00:00 · 2023-01-27 17:56:18 +00:00 · f29a9ff5bf
commit f29a9ff5bf
parent 23a0bfac20
1 changed files with 0 additions and 1 deletions
--- a/README.md
+++ b/README.md
@ -157,7 +157,6 @@ Features / APIs

 Suspiciousness

- Current initialization (PyTorch default) departs from GPT-2. In a very quick experiment I found it to be superior to the one suggested in the papers, but that can't be right?
 - I am still not 100% confident that my GPT-2 small reproduction hyperparameters are good, if someone has reproduced GPT-2 I'd be eager to exchange notes ty
 - I keep seeing different values cited for weight decay and AdamW betas, look into
 - I can't exactly reproduce Chinchilla paper results, see [scaling_laws.ipynb](scaling_laws.ipynb) notebook