TinyGPT

in-browser playground
Feedback

Train a small GPT — the kind of model behind ChatGPT, only ~0.8 million parameters instead of a trillion — from scratch, right here in this tab. No server, no install. Watch the loss curve fall, then ask it to write you a sentence. Source ↗

Train

Corpus
picks load automatically · or paste your own text below
Hyperparameters — preset above sets these. Click to edit individually.
Model
Estimated run time
Advanced
Your machine detecting… computing…

computing…

Want to go further?
4 Resources curated rabbit holes — 3Blue1Brown, Karpathy, papers, and this repo's own docs

Start here (visual + intuitive)

Go deeper (code + papers)

This project's own write-ups

Related projects

5 Diagnostics & how to go faster WebGPU matmul benchmark, the Python CLI for 10M+ models, and what makes training fast

Making it faster — what works, in order of impact

  1. Train locally with Python (50–100×). The python_ref/train.py path uses PyTorch with CUDA / Apple MPS and runs on every core. A 10M model trains in ~24 s / 1k steps on an M5 Pro — comfortable iteration speed.
  2. WebGPU backend (≈3–10× on real hardware). The shaders are correct end-to-end (24/24 kernel parity-checked); real-GPU speedup is unmeasured here because the project's CI ran on a software adapter (swiftshader) — see docs/notes.md §10. Try it on your machine and watch tokens/sec.
  3. WASM SIMD (1.6× over scalar WASM). Already on if your browser supports it — the green pill at top confirms. Four floats per cycle in the matmul inner loop instead of one.
  4. Bigger batch (sub-linear). Better cache utilisation per step, fewer kernel dispatches. Memory grows linearly in batch × ctx; the matmul cost dominates so it's near-free up to your RAM limit.
  5. Smaller model (linear). Throughput scales roughly as 1/params for transformer training, because the matmuls grow as params × batch × ctx. Drop d_model from 96 to 48 and per-step time quarters (d² in the inner kernel).
  6. Multi-threaded WASM (4–8×, not implemented here). Would need SharedArrayBuffer and worker threads. Open box.

Full speed write-up: docs/performance.md ↗ · The journey of every lever — shipped, blocked, open, and why: the performance journey ↗

Runs the same matmul on the WebGPU compute kernel and the WASM kernel, checks they agree, and reports the speedup. Needs Chrome / Edge 113+.

Not run yet.

Train larger models locally

In-browser is single-threaded WASM — comfortable up to ~1M params. For 5–25M+ (a few minutes on a laptop), run the Python reference where it uses your GPU (Apple MPS / CUDA).

git clone https://github.com/sarthakagrawal927/tinygpt
cd tinygpt
python -m venv python_ref/.venv && source python_ref/.venv/bin/activate
pip install -r python_ref/requirements.txt

# measure how fast your machine trains:
python python_ref/bench.py

# train a ~10.8M model on your own text:
python python_ref/train.py --model-config configs/model.small.json \
    --data your-text.txt --out checkpoints/run

# generate from it:
python python_ref/sample.py --checkpoint checkpoints/run --prompt "Once "

Keyboard shortcuts

?Show this sheet EscClose any popover / dialog ⌘ / Ctrl EnterStart training ⌘ / Ctrl GGenerate from the model TTake the tour SShare this setup PPause / resume training 15Pick a size preset (Tiny → XL)

Welcome to TinyGPT

A complete transformer that trains from scratch — right here, in this tab, with no server. The model is ~0.8M parameters, byte-level. Every layer was written by hand. The first time it's running, it helps to have someone walk you through. Want a 90-second tour?