NanoGPT Slowrun

NanoGPT Slowrun is a new benchmark for language modeling algorithms in the infinite compute, fixed data regime: 100M tokens from FineWeb, no compute/time limit, lowest validation loss wins.¹ We call it a Slowrun since the goal is to spend as much time with the data as we need to maximize learning on it. We deliberately choose this setting in contrast to speedruns like modded-nanogpt, which assume infinite data and optimize for wall-clock time on fixed hardware. Loved by @karpathy himself!

When speed is not the binding constraint, the space of promising algorithms changes dramatically--for example, large models trained with heavy regularization, expensive optimizers, and evolutionary search are all fair game. We want leaps like GPT-3, where previously unimaginable compute led to better generalization. That doesn't happen if wall-clock time is your constraint.

The baseline trains in ~47 minutes on 8xH100 (~$12) and achieves 3.402 val loss. There are three tracks:

a limited compute track capped at a single 8xH100 node for 1 hour (this is 100x the compute used by the Nanochat 1-epoch baseline),
a tiny compute track capped at a single 8xH100 node for 15 minutes,
and an unlimited compute track with minimal restrictions on hardware or time.

For now the limited track lives in the root directory, the tiny track lives at tiny/, and the unlimited track lives at unlimited/. Submit an entry by opening a PR.

Running the current record

You can reproduce the limited-compute record by running the following commands:

git clone https://github.com/qlabs-eng/slowrun.git && cd slowrun
pip install -r requirements.txt
python prepare_data.py
torchrun --standalone --nproc_per_node=8 train.py

World Record History

We accept PRs that achieve a new World Record validation loss within the track's time limit, and add an entry below for each improvement.

Limited Compute Track (1 hour)

The limited-compute track caps runs at a single 8xH100 node for at most 1 hour.

#	Val Loss	Description	Date	Time	Script	Contributors
1	3.402	Baseline: 2.7B transformer, Muon, dropout 0.1, weight decay 1.6	02/26/26	~47 mins	Script	@akshayvegesna
2	3.376	Add shuffling every epoch	02/27/26	~47 mins	Script	@kvegesna
3	3.349	Change value embed tables to projections from x0	03/01/26	~47 mins	Script	@ms337
4	3.335	Use swiglu activation	03/01/26	52.1 mins	Script	@akshayvegesna
5	3.314	Add U-Net architecture	03/03/26	52.3 mins	Script	@em-see-squared
6	3.295	Add gating per attention head	03/03/26	53.3 mins	Script	@akshayvegesna
7	3.285	Repeat layers 15-20 for last 3 epochs, reduce warmdown	03/11/26	53.3 mins (training time only)	Script	@shmublu
8	3.278	Run layers 15-20 3 times before layers 21-29 for the last 3 epochs	03/11/26	55.7 mins	Script	@akshayvegesna
9	3.276	Add exclusive self attention (XSA)	03/12/26	57.7 mins	Script	@not-nonymous
10	3.270	LR tuning, warmdown tuning	03/16/26	55.5 mins	Script	@zhiweixx
11	3.252	EMA of weights, hyperparameter tuning	03/18/26	59.2 mins	Script	@ChinmayK0607, @ms337
12	3.248	Use weighted average of last 3 epoch checkpoints	03/23/26	58.2 mins	Script	@not-nonymous

Tiny Track (15 minutes)

The tiny track caps runs at a single 8xH100 node for at most 15 minutes.

#	Val Loss	Description	Date	Time	Script	Contributors
1	3.428	Baseline: 300M transformer, weight decay 0.8, dropout 0.1	03/02/26	13.7 mins	Script	@akshayvegesna
2	3.410	Add swiglu activation	03/02/26	14.4 mins	Script	@ChinmayK0607
3	3.395	Add U-Net architecture	03/03/26	14.5 mins	Script	@em-see-squared, @akshayvegesna
4	3.385	Add gating per attention head	03/04/26	14.6 mins	Script	@ChinmayK0607
5	3.383	Update warmdown ratio	03/06/26	14.6 mins	Script	@not-nonymous
6	3.374	Half truncated RoPE, partial key offset, residual lambdas to 1.1	03/06/26	14.8 mins	Script	@ChinmayK0607
7	3.365	Add weight decay schedule	03/15/26	14.8 mins	Script	@shmublu
8	3.353	Add EMA parameter averaging	03/21/26	14.9 mins	Script	@clarkkev

Unlimited Compute Track

#	Val Loss	Description	Date	Time	Script	Contributors
1	3.402	Baseline: 2.7B transformer, Muon, dropout 0.1, weight decay 1.6	02/26/26	~47 mins	Script	@akshayvegesna
2	3.264	Baseline: 8 × 2.7B transformer, Muon, dropout 0.1, weight decay 1.6, logit averaging	02/27/26	6h 44m	Script	@akshayvegesna
3	3.218	Use value projections and swiglu activation	03/02/26	6h 54m	Script	@akshayvegesna
4	3.185	Add U-Net and Attention Gating	03/04/26	7h 8m	Script	@akshayvegesna, @em-see-squared
5	3.166	Train each model for 1.5x longer	03/05/26	10h 35m	Script	@akshayvegesna
6	3.126	Train each model in ensemble to distill previous model + usual CE loss	03/07/26	16h 1m	Script	@not-nonymous
7	3.089	Ensemble of 10 models, looping of layers 15-20, tuned epoch counts, loss weight	03/13/26	19h 18m (2 nodes, 8xH100)	Script	@akshayvegesna
8	3.081	Ensemble of 12 models, distill alpha 0.5	03/18/26	42h 35m (1 node, 8xH100)	Script	@not-nonymous
9	3.045	More looping, hyperparam tuning, model size increase	03/19/26	~44h (2 nodes, 8xH100)	Script	@akshayvegesna

Why limited data, unlimited compute?

The bitter lesson tells us that we should strongly prefer algorithms that scale with compute alone. We can't improve models at the rate compute scales as long as performance is bottlenecked by data.

This repo builds on Nanochat, which took many ideas from the modded-nanogpt speedrun contest. To be fair, the speedrun contest did provide real data efficiency gains: using less data is one way to train faster. But because it sets speed as the binding constraint, it filters out an entire class of algorithms that yield learning gains.

Baseline Approach

Following Kim et al. (2025),² we developed the baseline in three steps:

Optimizer selection. We tested popular optimizers in the data-limited regime, training for multiple epochs on the 100M tokens. Muon outperforms AdamW, SOAP, and MAGMA.
Scaling up. We increased model size but found diminishing returns due to the limited data. Without appropriate regularization, a 1.4B parameter model outperforms a 2.7B parameter model.
Regularization. When we scale up parameter size also using heavy weight decay, we recover monotonic improvements with scale. We further find that dropout improves performance on top of weight decay. Our final model is a 2.7B parameter transformer, with 1.2B parameters in the transformer trunk and heavy embedding defaults from Nanochat. It is trained with dropout 0.1 and weight decay 1.6. This weight decay is very large by traditional standards, but consistent with Kim et al. (2025), who find optimal weight decay is up to 30× larger than standard practice in the data-constrained regime.

Given the strong performance by large models that are well regularized, we speculate that larger models have a strong simplicity bias, amplified by regularization.

Figure taken from Andrew Gordon Wilson, "Deep Learning is Not So Mysterious or Different."

Why 100M tokens?

We choose 100M tokens because it is small enough to affordably try radically different learning algorithms, while large enough that the winning techniques may work at a larger scale, though the scaling behavior is an open empirical question.

For practical purposes, we begin by providing an upper bound on time of 64 H100's for 7 days. For reference, nanogpt can be trained for 1 epoch in 30s, so using this amount of compute would be 100,000x the compute used by that baseline. ↩
Konwoo Kim, Suhas Kotha, Percy Liang, and Tatsunori Hashimoto. "Pre-training under infinite compute." arXiv:2509.14786, 2025. ↩

Name		Name	Last commit message	Last commit date
Latest commit History 85 Commits
data_eff		data_eff
dev		dev
tiny		tiny
unlimited		unlimited
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
karpathy.png		karpathy.png
overparametrization.png		overparametrization.png
prepare_data.py		prepare_data.py
requirements.txt		requirements.txt
train.py		train.py
val_loss_animation.gif		val_loss_animation.gif

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NanoGPT Slowrun

Running the current record

World Record History

Limited Compute Track (1 hour)

Tiny Track (15 minutes)

Unlimited Compute Track

Why limited data, unlimited compute?

Baseline Approach

Why 100M tokens?

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors 10

Languages

Folders and files

Latest commit

History

Repository files navigation

NanoGPT Slowrun

Running the current record

World Record History

Limited Compute Track (1 hour)

Tiny Track (15 minutes)

Unlimited Compute Track

Why limited data, unlimited compute?

Baseline Approach

Why 100M tokens?

Footnotes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors 10

Languages

Packages