nanogpt-experiments/data/shakespeare/prepare.py

import os
import requests
import tiktoken
import numpy as np

# download the tiny shakespeare dataset
input_file_path = os.path.join(os.path.dirname(__file__), 'input.txt')
if not os.path.exists(input_file_path):
    data_url = 'https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt'
    with open(input_file_path, 'w') as f:
        f.write(requests.get(data_url).text)

with open(input_file_path, 'r') as f:
    data = f.read()
n = len(data)
train_data = data[:int(n*0.9)]
val_data = data[int(n*0.9):]

# encode with tiktoken gpt2 bpe
enc = tiktoken.get_encoding("gpt2")
train_ids = enc.encode_ordinary(train_data)
val_ids = enc.encode_ordinary(val_data)
print(f"train has {len(train_ids):,} tokens")
print(f"val has {len(val_ids):,} tokens")

# export to bin files
train_ids = np.array(train_ids, dtype=np.uint16)
val_ids = np.array(val_ids, dtype=np.uint16)
train_ids.tofile(os.path.join(os.path.dirname(__file__), 'train.bin'))
val_ids.tofile(os.path.join(os.path.dirname(__file__), 'val.bin'))

# train.bin has 301,966 tokens
# val.bin has 36,059 tokens
candidate changes to apis, have to think through more 2023-01-01 01:29:48 +00:00			`import os`
			`import requests`
			`import tiktoken`
			`import numpy as np`

			`# download the tiny shakespeare dataset`
add relative import in shakespeare 2023-01-22 20:18:24 +00:00			`input_file_path = os.path.join(os.path.dirname(__file__), 'input.txt')`
			`if not os.path.exists(input_file_path):`
candidate changes to apis, have to think through more 2023-01-01 01:29:48 +00:00			`data_url = 'https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt'`
add relative import in shakespeare 2023-01-22 20:18:24 +00:00			`with open(input_file_path, 'w') as f:`
candidate changes to apis, have to think through more 2023-01-01 01:29:48 +00:00			`f.write(requests.get(data_url).text)`

add relative import in shakespeare 2023-01-22 20:18:24 +00:00			`with open(input_file_path, 'r') as f:`
candidate changes to apis, have to think through more 2023-01-01 01:29:48 +00:00			`data = f.read()`
			`n = len(data)`
			`train_data = data[:int(n*0.9)]`
			`val_data = data[int(n*0.9):]`

			`# encode with tiktoken gpt2 bpe`
			`enc = tiktoken.get_encoding("gpt2")`
			`train_ids = enc.encode_ordinary(train_data)`
			`val_ids = enc.encode_ordinary(val_data)`
for consistency with previous commit 2023-01-19 23:10:51 +00:00			`print(f"train has {len(train_ids):,} tokens")`
			`print(f"val has {len(val_ids):,} tokens")`
candidate changes to apis, have to think through more 2023-01-01 01:29:48 +00:00
			`# export to bin files`
			`train_ids = np.array(train_ids, dtype=np.uint16)`
			`val_ids = np.array(val_ids, dtype=np.uint16)`
use relative paths so that running the data prep scripts always create files in local folder, no matter where run from 2023-01-20 18:39:45 +00:00			`train_ids.tofile(os.path.join(os.path.dirname(__file__), 'train.bin'))`
			`val_ids.tofile(os.path.join(os.path.dirname(__file__), 'val.bin'))`
candidate changes to apis, have to think through more 2023-01-01 01:29:48 +00:00
			`# train.bin has 301,966 tokens`
			`# val.bin has 36,059 tokens`