Alexander Pivovarov
|
594068e7ae
|
Use nn.GELU
|
2023-05-17 00:53:35 +00:00 |
|
Andrej Karpathy
|
7fe4a099ad
|
simplify configure_optimizers by a lot
|
2023-05-06 14:40:28 +00:00 |
|
Andrej
|
196160b849
|
Merge pull request #247 from gnobre/macbook-run-instructions
Macbook run instructions
|
2023-04-17 20:16:31 -07:00 |
|
Andrej
|
21f9bff7e4
|
Merge pull request #225 from otaviogood/grad_accum
Fix for gradient_accumulation_steps training slow
|
2023-04-17 20:11:25 -07:00 |
|
Andrej
|
a6a708c7f1
|
Merge branch 'master' into grad_accum
|
2023-04-17 20:11:00 -07:00 |
|
Guilherme Nobre
|
e30c8fda23
|
Merge branch 'karpathy:master' into macbook-run-instructions
|
2023-04-15 09:50:58 +01:00 |
|
Guilherme
|
4732c43af3
|
add macbook specific instructions to generate samples
|
2023-04-15 09:49:38 +01:00 |
|
Andrej
|
d9f4735f5e
|
Merge pull request #10 from LaihoE/master
batch file write
|
2023-04-13 00:39:41 -07:00 |
|
Andrej
|
b288f4cfb2
|
Merge pull request #146 from lutzroeder/master
Add .gitignore
|
2023-04-12 22:48:37 -07:00 |
|
Andrej
|
079df20748
|
Merge pull request #74 from venusatuluri/fix_decode
Small fix to decode fn in shakespeare_char/prepare.py
|
2023-04-12 22:45:01 -07:00 |
|
Andrej
|
01e48ec1ab
|
Merge pull request #240 from YassineYousfi/master
don't dropout in eval mode
|
2023-04-12 22:43:59 -07:00 |
|
Andrej
|
7840a66859
|
Merge pull request #54 from MicroPanda123/luv
Give tqdm some love :)
|
2023-04-12 22:25:18 -07:00 |
|
Andrej
|
8abe215fba
|
Merge pull request #128 from abrahamsangha/fix-typo
fix typo
|
2023-04-12 22:24:41 -07:00 |
|
Andrej
|
ad62003d7a
|
Merge pull request #142 from kovkev/patch-1
Fix the position of a comma
|
2023-04-12 22:24:06 -07:00 |
|
Andrej
|
ea24604b29
|
Merge pull request #220 from python273/patch-1
Fix GPT.crop_block_size when flash attention is available
|
2023-04-12 22:13:01 -07:00 |
|
Andrej
|
8aeea6d970
|
Merge pull request #224 from SnehalRaj/patch-1
fix small typo
|
2023-04-12 22:12:26 -07:00 |
|
Andrej
|
2457471c9c
|
Merge pull request #236 from ymurenko/master
fix "cuda out of memory" when resuming training
|
2023-04-12 22:09:42 -07:00 |
|
Andrej Karpathy
|
553f949f46
|
fix minor bug where we have to scale the loss to account for gradient accumulation, which sums before backprop. note that this is not a major bug because AdamW is scale invariant. however, this did affect gradient clipping
|
2023-04-13 04:59:11 +00:00 |
|
Yassine Yousfi
|
7399dfe39d
|
dont always dropout!
|
2023-04-10 22:56:22 -07:00 |
|
ymurenko
|
4ac2e8ce3a
|
fix "cuda out of memory" when resuming training
|
2023-04-05 17:28:55 -04:00 |
|
Snehal Raj
|
c58fc4605c
|
fix small typo
|
2023-03-25 20:36:46 +01:00 |
|
Otavio Good
|
978d4fe538
|
Fix for gradient_accumulation_steps training slow
|
2023-03-25 00:04:45 -07:00 |
|
Kirill
|
c3f254844d
|
Fix GPT.crop_block_size when flash attention is available
|
2023-03-24 14:51:02 +03:00 |
|
Andrej
|
a82b33b525
|
Merge pull request #199 from ChristianOrr/patch-1
bugfix in decode function
|
2023-03-12 13:40:20 -07:00 |
|
Christian Orr
|
36c7db8c44
|
bugfix in decode function
Return was left out of the decoder, so it didn't work.
|
2023-03-08 10:16:19 +02:00 |
|
Andrej
|
0d8fbd11ae
|
Merge pull request #195 from drisspg/enable_sdpa_with_nonzero_dropout
Enable sdpa for nonzero dropout
|
2023-03-06 21:47:20 -08:00 |
|
Driss Guessous
|
6170531b8a
|
enable sdpa for nonzero dropout
|
2023-03-05 19:29:29 +00:00 |
|
Andrej
|
ae3a8d5fdd
|
Merge pull request #145 from otaviogood/gradAccumStability
fix for training stability on single GPU
|
2023-02-14 18:48:54 -08:00 |
|
Lutz Roeder
|
10046a2ec0
|
Add .gitignore
|
2023-02-13 13:57:20 -08:00 |
|
Otavio Good
|
086ebe1822
|
fix for training stability on single GPU
|
2023-02-13 10:42:44 -08:00 |
|
kovkev
|
c2531159c7
|
Fix the position of a comma
|
2023-02-11 17:13:24 -08:00 |
|
Andrej Karpathy
|
55c5069696
|
fix misinformation in readme
|
2023-02-10 16:34:46 +00:00 |
|
Andrej Karpathy
|
e58f0cfa94
|
oops i should not be needing or multiplying by world_size to calculate mfu
|
2023-02-07 21:38:39 +00:00 |
|
Abraham Sangha
|
27a5d6f123
|
fix typos
|
2023-02-07 11:02:20 -07:00 |
|
Andrej Karpathy
|
8b1e43209e
|
small tweaks, make default WD be 0.1 as is often cited, and remove spurious init of LayerNorm, which is already initialized at 1,0
|
2023-02-06 23:07:25 +00:00 |
|
Andrej Karpathy
|
ab21d6c15d
|
bugfix we have to call the raw_model's estimate_mfu ty @jprobichaud for original PR
|
2023-02-06 19:55:35 +00:00 |
|
Andrej Karpathy
|
f83dd034e1
|
also add a sampling/inference section
|
2023-02-05 21:02:30 +00:00 |
|
Andrej Karpathy
|
23a8e701d2
|
revamp the readme file to be a bit better and more accessible, i hope
|
2023-02-05 19:31:32 +00:00 |
|
Andrej Karpathy
|
fce706cbe6
|
tune the hyperparams a bit, in configs
|
2023-02-05 19:31:18 +00:00 |
|
Andrej Karpathy
|
ab0718a7dd
|
add the estimation of model flops utilization (MFU), a very commonly looked at metric that estimates the token throughput in units of A100 bfloat16 peak flops (312 TFLOPS). this gives us a sense of the hardware utilization we're achieving
|
2023-02-05 00:48:58 +00:00 |
|
Andrej Karpathy
|
580902617c
|
oops optimizer now demands to know device_type
|
2023-02-05 00:43:15 +00:00 |
|
Andrej Karpathy
|
34720df284
|
make more accurate the way in which we count parameters. previous count incorrectly included the positional encoding params, when typically only the number of weight parameters is reported for these models
|
2023-02-04 23:51:18 +00:00 |
|
Andrej Karpathy
|
3341b4cecc
|
oops forgot to subtract embedding params, which don't enter the 6ND equation
|
2023-02-04 22:33:35 +00:00 |
|
Andrej Karpathy
|
5a162bc773
|
fix silly error, i don't want to confuse a future GPT training on this notebook in the future
|
2023-02-04 22:11:16 +00:00 |
|
Andrej Karpathy
|
0bb96d3fff
|
add reference for 6ND to notebook too
|
2023-02-04 22:07:32 +00:00 |
|
Andrej Karpathy
|
eae986c2d2
|
new notebook with a bunch of calculations related to flops and memory of Transformer
|
2023-02-04 22:02:53 +00:00 |
|
Andrej Karpathy
|
a74e8363a2
|
clean up TODOs a bit, they are stale
|
2023-02-04 21:11:25 +00:00 |
|
Andrej Karpathy
|
25d95dbd65
|
mildly dramatic refactor for handing all these usage cases across all possible supported and unsupported devices for all the possible switches and flags
|
2023-02-04 21:06:17 +00:00 |
|
Andrej Karpathy
|
e108ffb973
|
very slight refactor, bit cleaner
|
2023-02-04 19:34:24 +00:00 |
|
Andrej
|
dc149891b6
|
Merge pull request #120 from nynyg/remove_cpu_pin_mem
Pin memory only when training on GPU
|
2023-02-04 11:28:08 -08:00 |
|