To download and tokenize the openwebtext dataset. It will create a train.bin and val.bin which holds the GPT2 BPE token ids in a massive sequence. Then we're ready to kick off training. First open up train.py and read it, make sure the settings look ok. Then:
$ python train.py
Once some checkpoints are written to the output directory out, we're ready to sample from the model: