1
0
mirror of https://github.com/osmarks/nanogpt-experiments.git synced 2024-12-18 06:00:29 +00:00

ok i tried bringing back original init again and this time it makes a ton of difference and works much better than default. i'm not sure what was different with my earlier experiment where i saw a slight regression. may try to dissect commits later, for now merged the original mingpt init (following gpt-2 paper) as default.

This commit is contained in:
Andrej Karpathy 2023-01-27 17:56:18 +00:00
parent 23a0bfac20
commit f29a9ff5bf

View File

@ -157,7 +157,6 @@ Features / APIs
Suspiciousness
- Current initialization (PyTorch default) departs from GPT-2. In a very quick experiment I found it to be superior to the one suggested in the papers, but that can't be right?
- I am still not 100% confident that my GPT-2 small reproduction hyperparameters are good, if someone has reproduced GPT-2 I'd be eager to exchange notes ty
- I keep seeing different values cited for weight decay and AdamW betas, look into
- I can't exactly reproduce Chinchilla paper results, see [scaling_laws.ipynb](scaling_laws.ipynb) notebook