T TinyGPT

Open-source transformer lab · 2.6× → 12.1× faster on WebGPU

TinyGPT trains a GPT-2 from scratch in your browser tab.

Hand-written WebGPU kernels run 2.6× → 12.1× faster than the multi-thread WASM SIMD baseline as d_model climbs from 96 to 256. Parity-tested to within 2.5% loss drift. Every modern technique — MoE, MTP, ALiBi, differential attention, distillation, DoRA, StreamingLLM — ships as a single clean Swift or TypeScript file, with a paper link next to it.

Open source under MIT. Source: github/tinygpt. Single-machine roadmap: 7 parts, 10 shipping phases.

Architecture menu

Every block is one file. The CLI accepts each as an opt-in flag (--moe-experts, --mtp-horizons, --diff-attn, --yoco, …) so configurations compose — you can train a Huge model with MoE and MTP and differential attention simultaneously, validated in a single shipped run.

Attention variants

Plug-in alternatives to standard causal MHA.

Sparsity & scale

Capacity per byte of memory.

Training

Pre-train, instruction-tune, prefer.

  • Pre-train on raw text (byte-level or BPE via any HuggingFace tokenizer)
  • SFT with response-only loss masking (ChatML / Alpaca / Llama / plain)
  • DPO + SimPO / ORPO / KTO — preference variants in one trainer (DPO, SimPO, ORPO, KTO)
  • Distillation — KL + NLL mix, temperature, alpha (Hinton et al., 2015)
  • Evolution Strategies — gradient-free, antithetic sampling (Salimans et al., 2017)
  • LoRA + LoRA+ + DoRA + NEFTune + grad-clip + cosine LR

Inference

Fast, small, long-context.

  • KV cache with optional fp16/bf16 quantisation
  • Prefix caching — save prompt KV state to disk, skip prefill on reload
  • Speculative decoding (greedy) — draft + verify (Leviathan et al., 2023)
  • StreamingLLM sink + window for unbounded context
  • int4 / int8 weight quantisation via MLX, plus HQQ sub-quadratic solver (Badri & Shaji, 2023) and AWQ reader (Lin et al., 2023)
  • LASER selective rank reduction (Sharma et al., 2024)

Interpretability

What is this thing actually doing?

  • Logit lens — per-layer next-token predictions (Nostalgebraist, 2020)
  • Tuned lens — trained per-layer probes (Belrose et al., 2023)
  • Attention heatmap — per-head weights, "watch the model think"
  • Per-layer ablation — zero out attn / mlp / whole block, observe
  • Activation patching — intervene on residual stream at (layer, position) (Meng et al., 2022)

Data

Where the bytes come from.

  • Any HuggingFace dataset via streaming importer (python_ref/fetch_hf_corpus.py)
  • Wikipedia article fetcher (browser-side, in the playground)
  • Project Gutenberg corpus (19 books, ~34 MB — bundled)
  • Magpie synthetic SFT data — bootstrap from a chat-tuned base (Xu et al., 2024)
  • Persistent tokenized corpus cache (skips re-tokenisation across runs)

Run it

In your browser

WebGPU · in-tab training · works on M-series Macs and modern Chrome

The browser playground builds a transformer in WebAssembly + WebGPU, trains it in a Web Worker so the UI never freezes, and lets you watch the loss curve live as the model picks up structure. Every interpretability surface above is wired into a single Sample card — benchmark, logit lens, ablation, all under one collapsible “Inspect & evaluate” panel.

Open the playground →

On your Mac

MLX-Swift · CLI + SwiftUI app · trains models the browser can’t fit

The Mac binary uses MLX-Swift to train preset sizes up to Titan (1.3B parameters) on Apple Silicon’s unified memory. The same architecture ships as a CLI (tinygpt train, sft, dpo, distill, es, laser, hqq, tuned-lens, magpie) and a SwiftUI app for point-and-click fine-tuning.

Build it from source →

Learn how it’s built

Phase-by-phase guides cover every shipped feature with paper references and reproducible commands.

Numbers

Concrete loss curves and benchmark scores live on the leaderboard. The validation report walks through an end-to-end pipeline that exercises every shipped feature with real corpus + real numbers.