tinygpt

A transformer playground, in the browser and on your Mac.

Train an LLM from scratch in your browser via WebGPU, or run the same architecture natively on Apple Silicon via MLX. Every modern technique — MoE, MTP, ALiBi, differential attention, distillation, DoRA, StreamingLLM — ships as a single clean Swift or TypeScript file, with a paper link next to it.

Train in your browser WebGPU · works on M-series Macs + modern Chrome Run on your Mac CLI + SwiftUI app · MLX-Swift, Apple-native

Open source under MIT. Source: github/tinygpt. Single-machine roadmap: 7 parts, 10 shipping phases.

Architecture menu

Every block is one file. The CLI accepts each as an opt-in flag (--moe-experts, --mtp-horizons, --diff-attn, --yoco, …) so configurations compose — you can train a Huge model with MoE and MTP and differential attention simultaneously, validated in a single shipped run.

Attention variants

Plug-in alternatives to standard causal MHA.

RoPE + GQA — HF-Llama-compatible baseline (RoPE, GQA)
ALiBi — linear position bias (Press et al., 2021)
Sliding window — Mistral / GPT-OSS recipe
Differential attention — subtract two softmaxes (Ye et al., 2024)
YOCO — cross-layer KV sharing (Lin et al., 2024)
StreamingLLM — attention sink for infinite context (Xiao et al., 2024)

Sparsity & scale

Capacity per byte of memory.

MoE — router + experts + load-balance loss (Switch Transformer, Mixtral)
Multi-Token Prediction — H output heads (Gloeckle et al., 2024, used by DeepSeek-V3)
Mixture of Depths — per-token sigmoid gate per block (Raposo et al., 2024)
See docs/moe.md for save/load + the scatter_add upstream gap that limits sparse compute today.

Training

Pre-train, instruction-tune, prefer.

Pre-train on raw text (byte-level or BPE via any HuggingFace tokenizer)
SFT with response-only loss masking (ChatML / Alpaca / Llama / plain)
DPO + SimPO / ORPO / KTO — preference variants in one trainer (DPO, SimPO, ORPO, KTO)
Distillation — KL + NLL mix, temperature, alpha (Hinton et al., 2015)
Evolution Strategies — gradient-free, antithetic sampling (Salimans et al., 2017)
LoRA + LoRA+ + DoRA + NEFTune + grad-clip + cosine LR

Inference

Fast, small, long-context.

KV cache with optional fp16/bf16 quantisation
Prefix caching — save prompt KV state to disk, skip prefill on reload
Speculative decoding (greedy) — draft + verify (Leviathan et al., 2023)
StreamingLLM sink + window for unbounded context
int4 / int8 weight quantisation via MLX, plus HQQ sub-quadratic solver (Badri & Shaji, 2023) and AWQ reader (Lin et al., 2023)
LASER selective rank reduction (Sharma et al., 2024)

Interpretability

What is this thing actually doing?

Logit lens — per-layer next-token predictions (Nostalgebraist, 2020)
Tuned lens — trained per-layer probes (Belrose et al., 2023)
Attention heatmap — per-head weights, "watch the model think"
Per-layer ablation — zero out attn / mlp / whole block, observe
Activation patching — intervene on residual stream at (layer, position) (Meng et al., 2022)

Data

Where the bytes come from.

Any HuggingFace dataset via streaming importer (python_ref/fetch_hf_corpus.py)
Wikipedia article fetcher (browser-side, in the playground)
Project Gutenberg corpus (19 books, ~34 MB — bundled)
Magpie synthetic SFT data — bootstrap from a chat-tuned base (Xu et al., 2024)
Persistent tokenized corpus cache (skips re-tokenisation across runs)

Run it

In your browser

WebGPU · in-tab training · works on M-series Macs and modern Chrome

The browser playground builds a transformer in WebAssembly + WebGPU, trains it in a Web Worker so the UI never freezes, and lets you watch the loss curve live as the model picks up structure. Every interpretability surface above is wired into a single Sample card — benchmark, logit lens, ablation, all under one collapsible “Inspect & evaluate” panel.

Open the playground →

On your Mac

MLX-Swift · CLI + SwiftUI app · trains models the browser can’t fit

The Mac binary uses MLX-Swift to train preset sizes up to Titan (1.3B parameters) on Apple Silicon’s unified memory. The same architecture ships as a CLI (tinygpt train, sft, dpo, distill, es, laser, hqq, tuned-lens, magpie) and a SwiftUI app for point-and-click fine-tuning.

Build it from source →

Learn how it’s built

Phase-by-phase guides cover every shipped feature with paper references and reproducible commands.

The model from scratch byte-level GPT, attention, MLP, LayerNorm — explained in code Three phases of training pretrain → SFT → DPO, with commands you can paste Knowledge distillation teacher → student with the KL+NLL mix loss Mixture-of-Experts router, experts, load balance, save/load — and the upstream gap that blocks real sparse compute Multi-Token Prediction training-only extra heads — checkpoints stay drop-in compatible Evolution Strategies gradient-free training (Salimans et al., 2017) Interpretability tools logit lens, tuned lens, ablation, activation patching Memory tradeoffs bf16, gradient accumulation, sliding window — what fits in 48 GB Validation report end-to-end runs with actual loss numbers and the bugs the validation caught Phase 9 + 10 status quantisation + architecture menu: what shipped, what’s queued Single-machine roadmap every technique that runs on one Mac, ROI-ranked Leaderboard guide how the three launch benchmarks are scored + how to submit

Numbers

Concrete loss curves and benchmark scores live on the leaderboard. The validation report walks through an end-to-end pipeline that exercises every shipped feature with real corpus + real numbers.

Phase 1 validation, Huge + DiffAttn + MoD on FineWeb-Edu: loss 11.22 → 6.46 over 500 steps (val 6.49, no overfit, no NaN; commit 6cbe693)
MoE distillation: tiny MoE student (4 experts × top-2) distilled from the above teacher; loss 1.93 → 0.21 in 30 steps; full MoE structure round-tripped through save/load/sample
Tuned-lens probes: 4 layer probes trained on a frozen base, sidecar file format verified
Two real bugs caught by the autonomous validation loop (tuned-lens crash, Astro build break) — commits a64de95 + 9877bb7. Static analysis passed; runtime didn’t.