mirror of
https://github.com/osmarks/nanogpt-experiments.git
synced 2024-12-30 12:00:29 +00:00
flesh out the remaining TODOs in readme a bit more
This commit is contained in:
parent
177d5f7dc5
commit
b45eec3e4b
31
README.md
31
README.md
@ -84,12 +84,31 @@ Code by default now uses [PyTorch 2.0](https://pytorch.org/get-started/pytorch-2
|
|||||||
|
|
||||||
## todos
|
## todos
|
||||||
|
|
||||||
A few that I'm aware of, other than the ones mentioned in code:
|
A few todos I'm aware of:
|
||||||
|
|
||||||
|
Optimizations
|
||||||
|
|
||||||
- Additional optimizations to the running time
|
- Additional optimizations to the running time
|
||||||
- Report and track other metrics e.g. PPL
|
- Investigate need for an actual Data Loader with a dedicated worker process for data
|
||||||
|
- Look into more efficient fused optimizers (e.g. apex)
|
||||||
|
- Re-evaluate use of flash attention (previously I wasn't able to get the forward pass to match up so I took it out)
|
||||||
|
- CUDA Graphs?
|
||||||
|
- Investigate potential speedups from Lightning or huggingface Accelerate
|
||||||
|
|
||||||
|
Features / APIs
|
||||||
|
|
||||||
|
- Add back fp16 support? (would need to also add back gradient scaler)
|
||||||
|
- Add CPU support
|
||||||
|
- Finetune the finetuning script, I think the hyperparams are not great
|
||||||
|
- Replace poor man's configurator, and make sample.py configurable...
|
||||||
|
- Report and track other metrics e.g. perplexity, num_tokens, MFU, ...
|
||||||
- Eval zero-shot perplexities on PTB, WikiText, other related benchmarks
|
- Eval zero-shot perplexities on PTB, WikiText, other related benchmarks
|
||||||
- Current initialization (PyTorch default) departs from GPT-2. In a very quick experiment I found it to be superior to the one suggested in the papers, but that can't be right
|
|
||||||
- Currently fp16 is much faster than bf16. Potentially revert back to using fp16 and re-introduce the gradient scaler?
|
Suspiciousness
|
||||||
- Add some finetuning dataset and guide on some dataset for demonstration.
|
|
||||||
- Reproduce GPT-2 results. It was estimated ~3 years ago that the training cost of 1.5B model was ~$50K
|
- Current initialization (PyTorch default) departs from GPT-2. In a very quick experiment I found it to be superior to the one suggested in the papers, but that can't be right?
|
||||||
|
- I am still not 100% confident that my GPT-2 small reproduction hyperparameters are good, if someone has reproduced GPT-2 I'd be eager to exchange notes ty
|
||||||
|
|
||||||
|
Results
|
||||||
|
|
||||||
|
- Actually reproduce GPT-2 results and have clean configs that reproduce the result. It was estimated ~3 years ago that the training cost of 1.5B model was ~$50K (?). Sounds a bit too high.
|
||||||
|
Loading…
Reference in New Issue
Block a user