tinygpt

A transformer playground, in the browser and on your Mac.

Train an LLM from scratch in your browser via WebGPU, or run the same architecture natively on Apple Silicon via MLX. Every modern technique — MoE, MTP, ALiBi, differential attention, distillation, DoRA, StreamingLLM — ships as a single clean Swift or TypeScript file, with a paper link next to it.

Open source under MIT. Source: github/tinygpt. Single-machine roadmap: 7 parts, 10 shipping phases.

Architecture menu

Every block is one file. The CLI accepts each as an opt-in flag (--moe-experts, --mtp-horizons, --diff-attn, --yoco, …) so configurations compose — you can train a Huge model with MoE and MTP and differential attention simultaneously, validated in a single shipped run.

Attention variants

Plug-in alternatives to standard causal MHA.

Sparsity & scale

Capacity per byte of memory.

Training

Pre-train, instruction-tune, prefer.

  • Pre-train on raw text (byte-level or BPE via any HuggingFace tokenizer)
  • SFT with response-only loss masking (ChatML / Alpaca / Llama / plain)
  • DPO + SimPO / ORPO / KTO — preference variants in one trainer (DPO, SimPO, ORPO, KTO)
  • Distillation — KL + NLL mix, temperature, alpha (Hinton et al., 2015)
  • Evolution Strategies — gradient-free, antithetic sampling (Salimans et al., 2017)
  • LoRA + LoRA+ + DoRA + NEFTune + grad-clip + cosine LR

Inference

Fast, small, long-context.

  • KV cache with optional fp16/bf16 quantisation
  • Prefix caching — save prompt KV state to disk, skip prefill on reload
  • Speculative decoding (greedy) — draft + verify (Leviathan et al., 2023)
  • StreamingLLM sink + window for unbounded context
  • int4 / int8 weight quantisation via MLX, plus HQQ sub-quadratic solver (Badri & Shaji, 2023) and AWQ reader (Lin et al., 2023)
  • LASER selective rank reduction (Sharma et al., 2024)

Interpretability

What is this thing actually doing?

  • Logit lens — per-layer next-token predictions (Nostalgebraist, 2020)
  • Tuned lens — trained per-layer probes (Belrose et al., 2023)
  • Attention heatmap — per-head weights, "watch the model think"
  • Per-layer ablation — zero out attn / mlp / whole block, observe
  • Activation patching — intervene on residual stream at (layer, position) (Meng et al., 2022)

Data

Where the bytes come from.

  • Any HuggingFace dataset via streaming importer (python_ref/fetch_hf_corpus.py)
  • Wikipedia article fetcher (browser-side, in the playground)
  • Project Gutenberg corpus (19 books, ~34 MB — bundled)
  • Magpie synthetic SFT data — bootstrap from a chat-tuned base (Xu et al., 2024)
  • Persistent tokenized corpus cache (skips re-tokenisation across runs)

Run it

In your browser

WebGPU · in-tab training · works on M-series Macs and modern Chrome

The browser playground builds a transformer in WebAssembly + WebGPU, trains it in a Web Worker so the UI never freezes, and lets you watch the loss curve live as the model picks up structure. Every interpretability surface above is wired into a single Sample card — benchmark, logit lens, ablation, all under one collapsible “Inspect & evaluate” panel.

Open the playground →

On your Mac

MLX-Swift · CLI + SwiftUI app · trains models the browser can’t fit

The Mac binary uses MLX-Swift to train preset sizes up to Titan (1.3B parameters) on Apple Silicon’s unified memory. The same architecture ships as a CLI (tinygpt train, sft, dpo, distill, es, laser, hqq, tuned-lens, magpie) and a SwiftUI app for point-and-click fine-tuning.

Build it from source →

Learn how it’s built

Phase-by-phase guides cover every shipped feature with paper references and reproducible commands.

Numbers

Concrete loss curves and benchmark scores live on the leaderboard. The validation report walks through an end-to-end pipeline that exercises every shipped feature with real corpus + real numbers.