mirror of
https://github.com/osmarks/nanogpt-experiments.git
synced 2024-11-10 20:09:58 +00:00
add a notebook trying to reproduce chinchilla scaling laws. I can't get the numbers to be exactly right, have to look at more
This commit is contained in:
parent
5acba4b005
commit
c72ecf5d93
@ -109,7 +109,10 @@ Features / APIs
|
||||
Suspiciousness
|
||||
|
||||
- Current initialization (PyTorch default) departs from GPT-2. In a very quick experiment I found it to be superior to the one suggested in the papers, but that can't be right?
|
||||
- I don't currently seem to need gradient clipping but it is very often used (?). Nothing is exploding so far at these scales but maybe I'm laeving performance on the table. Evaluate with/without.
|
||||
- I am still not 100% confident that my GPT-2 small reproduction hyperparameters are good, if someone has reproduced GPT-2 I'd be eager to exchange notes ty
|
||||
- I keep seeing different values cited for weight decay and AdamW betas, look into
|
||||
- I can't exactly reproduce Chinchilla paper results, see [scaling_laws.ipynb](scaling_laws.ipynb) notebook
|
||||
|
||||
Results
|
||||
|
||||
|
442
scaling_laws.ipynb
Normal file
442
scaling_laws.ipynb
Normal file
File diff suppressed because one or more lines are too long
Loading…
Reference in New Issue
Block a user