T TinyGPT
Open-source transformer lab · 2.6× → 12.1× faster on WebGPU
TinyGPT trains a GPT-2 from scratch in your browser tab.
Hand-written WebGPU kernels run 2.6× → 12.1× faster
than the multi-thread WASM SIMD baseline as d_model climbs
from 96 to 256. Parity-tested to within 2.5% loss drift. Every modern
technique — MoE, MTP, ALiBi,
differential attention, distillation, DoRA,
StreamingLLM — ships as a single clean Swift or
TypeScript file, with a paper link next to it.
Open source under MIT. Source: github/tinygpt.
Single-machine roadmap: 7 parts, 10 shipping phases.
Architecture menu
Every block is one file. The CLI accepts each as an opt-in flag
(--moe-experts, --mtp-horizons,
--diff-attn, --yoco, …) so
configurations compose — you can train a Huge model with
MoE and MTP and differential attention
simultaneously, validated in a single shipped run.
- Pre-train on raw text (byte-level or BPE via any HuggingFace tokenizer)
- SFT with response-only loss masking (ChatML / Alpaca / Llama / plain)
- DPO + SimPO / ORPO / KTO — preference variants in one trainer (DPO, SimPO, ORPO, KTO)
- Distillation — KL + NLL mix, temperature, alpha (Hinton et al., 2015)
- Evolution Strategies — gradient-free, antithetic sampling (Salimans et al., 2017)
- LoRA + LoRA+ + DoRA + NEFTune + grad-clip + cosine LR
- KV cache with optional fp16/bf16 quantisation
- Prefix caching — save prompt KV state to disk, skip prefill on reload
- Speculative decoding (greedy) — draft + verify (Leviathan et al., 2023)
- StreamingLLM sink + window for unbounded context
- int4 / int8 weight quantisation via MLX, plus HQQ sub-quadratic solver (Badri & Shaji, 2023) and AWQ reader (Lin et al., 2023)
- LASER selective rank reduction (Sharma et al., 2024)
- Logit lens — per-layer next-token predictions (Nostalgebraist, 2020)
- Tuned lens — trained per-layer probes (Belrose et al., 2023)
- Attention heatmap — per-head weights, "watch the model think"
- Per-layer ablation — zero out attn / mlp / whole block, observe
- Activation patching — intervene on residual stream at (layer, position) (Meng et al., 2022)
- Any HuggingFace dataset via streaming importer (
python_ref/fetch_hf_corpus.py) - Wikipedia article fetcher (browser-side, in the playground)
- Project Gutenberg corpus (19 books, ~34 MB — bundled)
- Magpie synthetic SFT data — bootstrap from a chat-tuned base (Xu et al., 2024)
- Persistent tokenized corpus cache (skips re-tokenisation across runs)
Run it
The browser playground builds a transformer in WebAssembly + WebGPU, trains
it in a Web Worker so the UI never freezes, and lets you watch the loss curve
live as the model picks up structure. Every interpretability surface above is
wired into a single Sample card — benchmark, logit lens, ablation, all
under one collapsible “Inspect & evaluate” panel.
Open the playground →
The Mac binary uses MLX-Swift to train preset sizes up to Titan (1.3B
parameters) on Apple Silicon’s unified memory. The same architecture
ships as a CLI (tinygpt train, sft, dpo,
distill, es, laser, hqq,
tuned-lens, magpie) and a SwiftUI app for
point-and-click fine-tuning.
Build it from source →
Learn how it’s built
Phase-by-phase guides cover every shipped feature with paper
references and reproducible commands.
The model from scratch byte-level GPT, attention, MLP, LayerNorm — explained in code Three phases of training pretrain → SFT → DPO, with commands you can paste Knowledge distillation teacher → student with the KL+NLL mix loss Mixture-of-Experts router, experts, load balance, save/load — and the upstream gap that blocks real sparse compute Multi-Token Prediction training-only extra heads — checkpoints stay drop-in compatible Evolution Strategies gradient-free training (Salimans et al., 2017) Interpretability tools logit lens, tuned lens, ablation, activation patching Memory tradeoffs bf16, gradient accumulation, sliding window — what fits in 48 GB Validation report end-to-end runs with actual loss numbers and the bugs the validation caught Phase 9 + 10 status quantisation + architecture menu: what shipped, what’s queued Single-machine roadmap every technique that runs on one Mac, ROI-ranked Leaderboard guide how the three launch benchmarks are scored + how to submit Numbers
Concrete loss curves and benchmark scores live on the
leaderboard. The
validation report walks
through an end-to-end pipeline that exercises every shipped
feature with real corpus + real numbers.
- Phase 1 validation, Huge + DiffAttn + MoD on FineWeb-Edu: loss 11.22 → 6.46 over 500 steps (val 6.49, no overfit, no NaN; commit
6cbe693) - MoE distillation: tiny MoE student (4 experts × top-2) distilled from the above teacher; loss 1.93 → 0.21 in 30 steps; full MoE structure round-tripped through save/load/sample
- Tuned-lens probes: 4 layer probes trained on a frozen base, sidecar file format verified
- Two real bugs caught by the autonomous validation loop (tuned-lens crash, Astro build break) — commits
a64de95 + 9877bb7. Static analysis passed; runtime didn’t.