mirror of
https://github.com/osmarks/nanogpt-experiments.git
synced 2024-12-18 06:00:29 +00:00
ok i tried bringing back original init again and this time it makes a ton of difference and works much better than default. i'm not sure what was different with my earlier experiment where i saw a slight regression. may try to dissect commits later, for now merged the original mingpt init (following gpt-2 paper) as default.
This commit is contained in:
parent
23a0bfac20
commit
f29a9ff5bf
@ -157,7 +157,6 @@ Features / APIs
|
||||
|
||||
Suspiciousness
|
||||
|
||||
- Current initialization (PyTorch default) departs from GPT-2. In a very quick experiment I found it to be superior to the one suggested in the papers, but that can't be right?
|
||||
- I am still not 100% confident that my GPT-2 small reproduction hyperparameters are good, if someone has reproduced GPT-2 I'd be eager to exchange notes ty
|
||||
- I keep seeing different values cited for weight decay and AdamW betas, look into
|
||||
- I can't exactly reproduce Chinchilla paper results, see [scaling_laws.ipynb](scaling_laws.ipynb) notebook
|
||||
|
Loading…
Reference in New Issue
Block a user