nanogpt-experiments

mirror of https://github.com/osmarks/nanogpt-experiments.git synced 2025-09-09 06:16:02 +00:00

Author	SHA1	Message	Date
Alexander Pivovarov	594068e7ae	Use nn.GELU	2023-05-17 00:53:35 +00:00
Andrej Karpathy	7fe4a099ad	simplify configure_optimizers by a lot	2023-05-06 14:40:28 +00:00
Andrej	196160b849	Merge pull request #247 from gnobre/macbook-run-instructions Macbook run instructions	2023-04-17 20:16:31 -07:00
Andrej	21f9bff7e4	Merge pull request #225 from otaviogood/grad_accum Fix for gradient_accumulation_steps training slow	2023-04-17 20:11:25 -07:00
Andrej	a6a708c7f1	Merge branch 'master' into grad_accum	2023-04-17 20:11:00 -07:00
Guilherme Nobre	e30c8fda23	Merge branch 'karpathy:master' into macbook-run-instructions	2023-04-15 09:50:58 +01:00
Guilherme	4732c43af3	add macbook specific instructions to generate samples	2023-04-15 09:49:38 +01:00
Andrej	d9f4735f5e	Merge pull request #10 from LaihoE/master batch file write	2023-04-13 00:39:41 -07:00
Andrej	b288f4cfb2	Merge pull request #146 from lutzroeder/master Add .gitignore	2023-04-12 22:48:37 -07:00
Andrej	079df20748	Merge pull request #74 from venusatuluri/fix_decode Small fix to decode fn in shakespeare_char/prepare.py	2023-04-12 22:45:01 -07:00
Andrej	01e48ec1ab	Merge pull request #240 from YassineYousfi/master don't dropout in eval mode	2023-04-12 22:43:59 -07:00
Andrej	7840a66859	Merge pull request #54 from MicroPanda123/luv Give tqdm some love :)	2023-04-12 22:25:18 -07:00
Andrej	8abe215fba	Merge pull request #128 from abrahamsangha/fix-typo fix typo	2023-04-12 22:24:41 -07:00
Andrej	ad62003d7a	Merge pull request #142 from kovkev/patch-1 Fix the position of a comma	2023-04-12 22:24:06 -07:00
Andrej	ea24604b29	Merge pull request #220 from python273/patch-1 Fix GPT.crop_block_size when flash attention is available	2023-04-12 22:13:01 -07:00
Andrej	8aeea6d970	Merge pull request #224 from SnehalRaj/patch-1 fix small typo	2023-04-12 22:12:26 -07:00
Andrej	2457471c9c	Merge pull request #236 from ymurenko/master fix "cuda out of memory" when resuming training	2023-04-12 22:09:42 -07:00
Andrej Karpathy	553f949f46	fix minor bug where we have to scale the loss to account for gradient accumulation, which sums before backprop. note that this is not a major bug because AdamW is scale invariant. however, this did affect gradient clipping	2023-04-13 04:59:11 +00:00
Yassine Yousfi	7399dfe39d	dont always dropout!	2023-04-10 22:56:22 -07:00
ymurenko	4ac2e8ce3a	fix "cuda out of memory" when resuming training	2023-04-05 17:28:55 -04:00
Snehal Raj	c58fc4605c	fix small typo	2023-03-25 20:36:46 +01:00
Otavio Good	978d4fe538	Fix for gradient_accumulation_steps training slow	2023-03-25 00:04:45 -07:00
Kirill	c3f254844d	Fix GPT.crop_block_size when flash attention is available	2023-03-24 14:51:02 +03:00
Andrej	a82b33b525	Merge pull request #199 from ChristianOrr/patch-1 bugfix in decode function	2023-03-12 13:40:20 -07:00
Christian Orr	36c7db8c44	bugfix in decode function Return was left out of the decoder, so it didn't work.	2023-03-08 10:16:19 +02:00
Andrej	0d8fbd11ae	Merge pull request #195 from drisspg/enable_sdpa_with_nonzero_dropout Enable sdpa for nonzero dropout	2023-03-06 21:47:20 -08:00
Driss Guessous	6170531b8a	enable sdpa for nonzero dropout	2023-03-05 19:29:29 +00:00
Andrej	ae3a8d5fdd	Merge pull request #145 from otaviogood/gradAccumStability fix for training stability on single GPU	2023-02-14 18:48:54 -08:00
Lutz Roeder	10046a2ec0	Add .gitignore	2023-02-13 13:57:20 -08:00
Otavio Good	086ebe1822	fix for training stability on single GPU	2023-02-13 10:42:44 -08:00
kovkev	c2531159c7	Fix the position of a comma	2023-02-11 17:13:24 -08:00
Andrej Karpathy	55c5069696	fix misinformation in readme	2023-02-10 16:34:46 +00:00
Andrej Karpathy	e58f0cfa94	oops i should not be needing or multiplying by world_size to calculate mfu	2023-02-07 21:38:39 +00:00
Abraham Sangha	27a5d6f123	fix typos	2023-02-07 11:02:20 -07:00
Andrej Karpathy	8b1e43209e	small tweaks, make default WD be 0.1 as is often cited, and remove spurious init of LayerNorm, which is already initialized at 1,0	2023-02-06 23:07:25 +00:00
Andrej Karpathy	ab21d6c15d	bugfix we have to call the raw_model's estimate_mfu ty @jprobichaud for original PR	2023-02-06 19:55:35 +00:00
Andrej Karpathy	f83dd034e1	also add a sampling/inference section	2023-02-05 21:02:30 +00:00
Andrej Karpathy	23a8e701d2	revamp the readme file to be a bit better and more accessible, i hope	2023-02-05 19:31:32 +00:00
Andrej Karpathy	fce706cbe6	tune the hyperparams a bit, in configs	2023-02-05 19:31:18 +00:00
Andrej Karpathy	ab0718a7dd	add the estimation of model flops utilization (MFU), a very commonly looked at metric that estimates the token throughput in units of A100 bfloat16 peak flops (312 TFLOPS). this gives us a sense of the hardware utilization we're achieving	2023-02-05 00:48:58 +00:00
Andrej Karpathy	580902617c	oops optimizer now demands to know device_type	2023-02-05 00:43:15 +00:00
Andrej Karpathy	34720df284	make more accurate the way in which we count parameters. previous count incorrectly included the positional encoding params, when typically only the number of weight parameters is reported for these models	2023-02-04 23:51:18 +00:00
Andrej Karpathy	3341b4cecc	oops forgot to subtract embedding params, which don't enter the 6ND equation	2023-02-04 22:33:35 +00:00
Andrej Karpathy	5a162bc773	fix silly error, i don't want to confuse a future GPT training on this notebook in the future	2023-02-04 22:11:16 +00:00
Andrej Karpathy	0bb96d3fff	add reference for 6ND to notebook too	2023-02-04 22:07:32 +00:00
Andrej Karpathy	eae986c2d2	new notebook with a bunch of calculations related to flops and memory of Transformer	2023-02-04 22:02:53 +00:00
Andrej Karpathy	a74e8363a2	clean up TODOs a bit, they are stale	2023-02-04 21:11:25 +00:00
Andrej Karpathy	25d95dbd65	mildly dramatic refactor for handing all these usage cases across all possible supported and unsupported devices for all the possible switches and flags	2023-02-04 21:06:17 +00:00
Andrej Karpathy	e108ffb973	very slight refactor, bit cleaner	2023-02-04 19:34:24 +00:00
Andrej	dc149891b6	Merge pull request #120 from nynyg/remove_cpu_pin_mem Pin memory only when training on GPU	2023-02-04 11:28:08 -08:00

1 2 3 4

177 Commits