An open-source LLM factory for one Mac.
Build routed specialists that earn their keep —
on a laptop you already own.
TinyGPT is a Swift codebase + 30+ CLI subcommands for the full
post‑training stack on Apple Silicon: train, distill from a local
teacher, fine‑tune with LoRA / DoRA / QLoRA, gate with real evals
(BFCL, τ‑bench, lm‑eval), serve OpenAI‑compatible,
and open the model up with sparse autoencoders, MEMIT, activation
patching. Same model also trains and runs in a browser tab via
hand‑written WebGPU kernels. The first packaged specialist is
honest by design: frontier‑level on file‑ops, routed only because
it regresses outside that domain.
frontier gate 100%
file ops route armed
oos guard caveated
qwen3-4b-file-ops-distilled HF artifact -> MLX adapter -> OpenAI-compatible socket
7.5 GB 4B params BFCL gated MIT code
distill eval serve route-only
First package: qwen3‑4B file‑ops specialist · 7.5 GB local HF/MLX artifact ·
routed specialist, not a general planner.
The strongest measured claim
A 4B routed file‑ops specialist, locally distilled, matches the
frontier on its multi‑turn hard gate.
BFCL multi‑turn hard gate, n=12 GorillaFileSystem agentic tasks,
task‑completion rate. Distillation: ~99 frontier trajectories ×
LoRA SFT, single Mac, no cloud train. Cheap routed specialist > bigger general
on one domain is the project thesis — not "4B > 12B in general."
The same file‑ops specialist regresses on out‑of‑domain breadth
(60 → 42%), so TinyGPT packages it with the caveat instead of
pretending it is a general planner.
Full writeup, including the honest tradeoffs →
Measured, on one Mac.
Numbers below are from a stock M5 Pro / 48 GB. Reproducible via
tinygpt bench and the linked artefacts. Where a number
applies only to a specific preset or build, the preset is named.
- Decode throughput
- 696tok/s
Huge preset, 221M params. 293 tok/s on the 960M Mega pilot.
- TTFT warm
- 5.8msp99
Cold start: 24 ms on a 1B model. ITL p99: 4.9 ms.
- Training step
- 42ms
Huge preset, 17.2× the same model in‑browser via WebGPU.
- WebGPU vs WASM SIMD
- 12.1×
At d_model=256. Scales 2.6×→12.1× as d_model grows 96→256.
- ANE M8 decode
- 17tok/s
28‑block Qwen3 chain on the Apple Neural Engine, layer‑chunked Core ML.
- Largest fit
- 960M
Params trainable end‑to‑end on unified memory; 473M in a browser tab via Memory64.
- Spec‑decode speedup
- 1.4×
serve ‑‑draft‑model, Qwen3 0.6B draft → 4B target. Lossless greedy; content‑dependent, 1.0–1.4×.
The cheapest token is the one you don’t rent.
Cloud serverless $0.20/ 1M tokens A 4B model on Fireworks serverless (4B–16B tier). Every prompt and completion leaves the device.
The same model, on your Mac $0marginal After a one‑time download: no metering, runs offline, and the data never leaves the laptop.
Cloud rate: Fireworks serverless pricing, 4B–16B tier, June 2026. Local marginal cost excludes electricity.
The loop the runtime closes.
Each surface emits the input the next one needs. The agent runtime
records token‑preserving trajectories; those trajectories become
SFT data; that data trains the next specialist; the eval gate decides
whether it ships.
- 01 Data tinygpt download‑dataset 22 curated sets. HF streaming, Magpie synth, GitHub issue→PR, MS‑MARCO, the‑stack‑smol.
- 02 Distill tinygpt distill KL + NLL teacher→student. Local teacher (Qwen3, Gemma, DeepSeek via API or Codex CLI). Rejection‑sample on a checker.
- 03 Train / SFT / DPO tinygpt sft · dpo · es Full PEFT bundle: LoRA, LoRA+, DoRA, VeRA, LoftQ, AdaLoRA, RsLoRA, PISSA. DPO/SimPO/KTO/ORPO in one trainer.
- 04 Eval gate tinygpt eval‑gate BFCL, τ‑bench, lm‑eval (MLX adapter), HumanEval+sandbox, judge shim. Exit‑non‑zero on regression.
- 05 Serve tinygpt serve OpenAI + Ollama surfaces on the same socket. Continue.dev / any OpenAI client plugs in unchanged.
- 06 Agent tinygpt agent Multi‑turn, tool dispatch, persistent KV, FSM‑constrained JSON, optional
‑‑cloud‑escalate. - 07 Traces → data tinygpt traces‑to‑data Every rollout is a token‑preserving
.atraj. Filter, dedupe (MinHash), emit ChatML SFT JSONL. Then loop to step 02.
What ships, audited against the code.
PLAN.md is the canonical shipped/skipped/TODO ledger. Each pillar
below is one subset; clicking through gets you to specific files
and the papers each technique cites.
Train Every modern pre/post‑train, one trainer.
- Pretrain — byte‑level or BPE; WSD schedule, gradient checkpointing, spike recovery
- SFT — response‑only masking, ChatML / Alpaca / Llama / plain
- Preference — DPO, SimPO, KTO, ORPO (one trainer, flags)
- PEFT — LoRA, LoRA+, DoRA (TGLA v2 on disk), VeRA, LoftQ, AdaLoRA, RsLoRA, PISSA, LoRA‑FA
- Optimisers — AdamW, Lion, Sophia, Muon, Adafactor, GaLore
- Architecture — RoPE+GQA, sliding window, ALiBi, MoE, MoD, differential attn, YOCO, MTP
Distill Frontier teacher → small student.
- KL + NLL mix loss with temperature and α (Hinton 2015)
- Rejection‑sampled trajectories — keep only what a checker passes
- Gold‑clone fallback — on verifiable tasks the gold IS the trajectory; teacher‑free reproduces the same model
- Trace replay — render in the student's own chat template via
render_sft_from_traj.py - Headline result — 4B at frontier‑parity, documented honestly
- Shared schema — every eval emits the same JSONL row (E0)
- tinygpt eval‑bfcl — 10 BFCL categories, OpenAI‑compat shim into
tinygpt serve - tinygpt eval‑tau‑bench — retail + airline, configurable user simulator
- tinygpt run‑lm‑eval — MLX adapter for
lm‑evaluation‑harness, two modes - tinygpt eval‑humaneval — Rust +
sandbox‑exec code execution - tinygpt eval‑gate — CI‑grade regression gate, exits non‑zero
Serve · Agent OpenAI‑compatible. Locally.
- tinygpt serve — OpenAI and Ollama on the same socket; Continue.dev provider compat
- tinygpt agent — multi‑turn loop, tool dispatch, persistent KV, FSM JSON
- ‑‑cloud‑escalate — defer to Anthropic / OpenAI only when the local model wants to
- Speculative decoding in
serve — ‑‑draft‑model draft+verify, ~1.4× decode, lossless greedy; vanilla/Medusa/EAGLE‑2 heads in the CLI - StreamingLLM sink, KV‑cache quant (KIVI), prefix caching
- Token‑preserving traces —
.atraj rollouts feed next round of SFT
Interp Open the model up.
- Logit lens · tuned lens (trainable per‑layer probes)
- Activation patching — zero + donor‑swap, on (layer, position)
- Per‑layer ablation · attention heatmap
- Linear probes — trainable per‑layer classifiers,
.lp sidecar - ROME · MEMIT — rank‑1 and rank‑K fact editing, multi‑layer with key‑norm weighting
- Sparse autoencoders —
tinygpt sae, group‑SAE, SAELens / Neuronpedia export
In a browser tab The same model, no install.
Hand‑written WGSL kernels train a GPT‑2 in your tab.
Blocked 4×4 matmul (5.18× kernel speedup at 2048³), FA2
forward + backward in WGSL, Memory64 lifts the 4 GB tab
heap so a 473M‑param model allocates cleanly.
End‑to‑end parity vs. the WASM reference: ≤ 2.5%
loss drift.
Open the playground → · See the speedup curve →
Run it.
Two paths. The Mac path is the primary one; the browser path is for
when you want everything in‑tab, no toolchain, no install.
# build the CLI
cd native‑mac
export DEVELOPER_DIR=/Applications/Xcode.app/Contents/Developer
xcodebuild -scheme tinygpt -destination 'platform=macOS,arch=arm64' \
-derivedDataPath .xcode-build build
# one‑command quickstart: data → specialist → chat
tinygpt quickstart --data my.jsonl
# or distill from a local teacher
tinygpt distill --teacher qwen3-4b --student huge \
--data ./traces --out ./student.tinygpt
# then serve it OpenAI‑compatible
tinygpt serve --model ./student.tinygpt --port 8080
Full build instructions →
The playground builds a transformer in WebAssembly + WebGPU,
trains it in a Web Worker so the UI never freezes, and lets you
watch the loss curve live as the model picks up structure.
Every interpretability surface above is wired in —
attention heatmap, logit lens, ablation, patching — under
one "Inspect & evaluate" panel.
- Parity‑tested against the WASM reference to ≤ 2.5% loss drift
- OPFS checkpoint persistence — a run survives a tab refresh
- Capability detection picks the right preset for your machine
Open the playground →
Honest scope.
What this is
- A single‑developer project, shipping in public, MIT.
- Mac‑first — M‑series, unified memory, MLX‑Swift.
- A factory for specialists, not a general assistant.
- An OpenAI‑compatible runtime any client already speaks.
- A research substrate for interp, eval, and distillation.
What this isn’t (yet)
-
A general‑skill specialist. The frontier‑parity 4B
dropped on out‑of‑domain BFCL (60 → 42%) —
real catastrophic forgetting. Mixed‑backend distillation
is the documented fix.
- Multi‑GPU or distributed training. One device, single Mac.
- A cloud product. Nothing leaves the laptop unless
‑‑cloud‑escalate is set. - An enterprise platform. No SSO, no tenancy, no SLA.
- A finished story. The roadmap shows what’s shipped and what’s next.
Read the journey.
Decision logs and per‑technique explainers; each ties back to
the paper it cites and the file where it lives.
Tool‑calling: frontier‑parity at 4B The headline result, the distillation recipe, and the honest tradeoffs PLAN.md — shipped · skipped · TODO The canonical ledger, audited against the code Knowledge distillation KL + NLL mix loss, temperature, α, rejection sampling Three phases of training Pretrain → SFT → DPO with paste‑ready commands Interpretability tools Logit lens, tuned lens, ablation, patching, SAE, ROME, MEMIT Agent runtime Multi‑turn loop, tool dispatch, traces, cloud escalation Eval leaderboard viewer Drag‑drop a JSONL; compare by step / model / task Lessons from the build The bugs and surprises that were worth more than the kernels