mirror of
https://github.com/osmarks/nanogpt-experiments.git
synced 2024-12-18 14:10:28 +00:00
grad clipping seems to slightly speed up training in the beginning but i can't see a big difference later in the training. it costs non-negligeable compute to clip. adding it for now because it is standard, and i think more necessary as the model becomes larger. practitioners may consider turning it off for minor efficiency gains
This commit is contained in:
parent
e0c689cf38
commit
3cb3fc059c
@ -158,7 +158,6 @@ Features / APIs
|
|||||||
Suspiciousness
|
Suspiciousness
|
||||||
|
|
||||||
- Current initialization (PyTorch default) departs from GPT-2. In a very quick experiment I found it to be superior to the one suggested in the papers, but that can't be right?
|
- Current initialization (PyTorch default) departs from GPT-2. In a very quick experiment I found it to be superior to the one suggested in the papers, but that can't be right?
|
||||||
- I don't currently seem to need gradient clipping but it is very often used (?). Nothing is exploding so far at these scales but maybe I'm leaving performance on the table. Evaluate with/without.
|
|
||||||
- I am still not 100% confident that my GPT-2 small reproduction hyperparameters are good, if someone has reproduced GPT-2 I'd be eager to exchange notes ty
|
- I am still not 100% confident that my GPT-2 small reproduction hyperparameters are good, if someone has reproduced GPT-2 I'd be eager to exchange notes ty
|
||||||
- I keep seeing different values cited for weight decay and AdamW betas, look into
|
- I keep seeing different values cited for weight decay and AdamW betas, look into
|
||||||
- I can't exactly reproduce Chinchilla paper results, see [scaling_laws.ipynb](scaling_laws.ipynb) notebook
|
- I can't exactly reproduce Chinchilla paper results, see [scaling_laws.ipynb](scaling_laws.ipynb) notebook
|
||||||
|
3
train.py
3
train.py
@ -59,6 +59,7 @@ max_iters = 600000 # total number of training iterations
|
|||||||
weight_decay = 1e-2
|
weight_decay = 1e-2
|
||||||
beta1 = 0.9
|
beta1 = 0.9
|
||||||
beta2 = 0.95
|
beta2 = 0.95
|
||||||
|
grad_clip = 1.0 # clip gradients at this value, or disable if == 0.0
|
||||||
# learning rate decay settings
|
# learning rate decay settings
|
||||||
decay_lr = True # whether to decay the learning rate
|
decay_lr = True # whether to decay the learning rate
|
||||||
warmup_iters = 2000 # how many steps to warm up for
|
warmup_iters = 2000 # how many steps to warm up for
|
||||||
@ -270,6 +271,8 @@ while True:
|
|||||||
with ctx:
|
with ctx:
|
||||||
logits, loss = model(X, Y)
|
logits, loss = model(X, Y)
|
||||||
loss.backward()
|
loss.backward()
|
||||||
|
if grad_clip != 0:
|
||||||
|
torch.nn.utils.clip_grad_norm_(model.parameters(), grad_clip)
|
||||||
optimizer.step()
|
optimizer.step()
|
||||||
optimizer.zero_grad(set_to_none=True)
|
optimizer.zero_grad(set_to_none=True)
|
||||||
|
|
||||||
|
Loading…
Reference in New Issue
Block a user