An open-source LLM factory for one Mac.

Build routed specialists that earn their keep — on a laptop you already own.

TinyGPT is a Swift codebase + 30+ CLI subcommands for the full post‑training stack on Apple Silicon: train, distill from a local teacher, fine‑tune with LoRA / DoRA / QLoRA, gate with real evals (BFCL, τ‑bench, lm‑eval), serve OpenAI‑compatible, and open the model up with sparse autoencoders, MEMIT, activation patching. Same model also trains and runs in a browser tab via hand‑written WebGPU kernels. The first packaged specialist is honest by design: frontier‑level on file‑ops, routed only because it regresses outside that domain.

Build the Mac CLI xcodebuild · macOS 14+ · MLX‑Swift Open the browser playground WebGPU · trains a GPT‑2 in your tab Read the public artifacts model packages · benchmark reports · blockers

First package: qwen3‑4B file‑ops specialist · 7.5 GB local HF/MLX artifact · routed specialist, not a general planner.

The strongest measured claim

A 4B routed file‑ops specialist, locally distilled, matches the frontier on its multi‑turn hard gate.

Qwen3‑4B file‑ops specialist

DeepSeek‑V4‑pro frontier 100%
Qwen3‑4B file‑ops specialist tinygpt 100%
Gemma‑4‑12B‑qat 83%
Qwen3‑4B (stock + plan prompt) 75%
Gemma‑3‑12B 33%

BFCL multi‑turn hard gate, n=12 GorillaFileSystem agentic tasks, task‑completion rate. Distillation: ~99 frontier trajectories × LoRA SFT, single Mac, no cloud train. Cheap routed specialist > bigger general on one domain is the project thesis — not "4B > 12B in general." The same file‑ops specialist regresses on out‑of‑domain breadth (60 → 42%), so TinyGPT packages it with the caveat instead of pretending it is a general planner. Full writeup, including the honest tradeoffs →

Measured, on one Mac.

Numbers below are from a stock M5 Pro / 48 GB. Reproducible via tinygpt bench and the linked artefacts. Where a number applies only to a specific preset or build, the preset is named.

Decode throughput: 696tok/s
TTFT warm: 5.8msp99
Training step: 42ms
WebGPU vs WASM SIMD: 12.1×
ANE M8 decode: 17tok/s
Largest fit: 960M
Spec‑decode speedup: 1.4×

The cheapest token is the one you don’t rent.

Cloud serverless $0.20/ 1M tokens

A 4B model on Fireworks serverless (4B–16B tier). Every prompt and completion leaves the device.

The same model, on your Mac $0marginal

After a one‑time download: no metering, runs offline, and the data never leaves the laptop.

Cloud rate: Fireworks serverless pricing, 4B–16B tier, June 2026. Local marginal cost excludes electricity.

The loop the runtime closes.

Each surface emits the input the next one needs. The agent runtime records token‑preserving trajectories; those trajectories become SFT data; that data trains the next specialist; the eval gate decides whether it ships.

Data tinygpt download‑dataset 22 curated sets. HF streaming, Magpie synth, GitHub issue→PR, MS‑MARCO, the‑stack‑smol.
Distill tinygpt distill KL + NLL teacher→student. Local teacher (Qwen3, Gemma, DeepSeek via API or Codex CLI). Rejection‑sample on a checker.
Train / SFT / DPO tinygpt sft · dpo · es Full PEFT bundle: LoRA, LoRA+, DoRA, VeRA, LoftQ, AdaLoRA, RsLoRA, PISSA. DPO/SimPO/KTO/ORPO in one trainer.
Eval gate tinygpt eval‑gate BFCL, τ‑bench, lm‑eval (MLX adapter), HumanEval+sandbox, judge shim. Exit‑non‑zero on regression.
Serve tinygpt serve OpenAI + Ollama surfaces on the same socket. Continue.dev / any OpenAI client plugs in unchanged.
Agent tinygpt agent Multi‑turn, tool dispatch, persistent KV, FSM‑constrained JSON, optional ‑‑cloud‑escalate.
Traces → data tinygpt traces‑to‑data Every rollout is a token‑preserving .atraj. Filter, dedupe (MinHash), emit ChatML SFT JSONL. Then loop to step 02.

What ships, audited against the code.

PLAN.md is the canonical shipped/skipped/TODO ledger. Each pillar below is one subset; clicking through gets you to specific files and the papers each technique cites.

Train

Every modern pre/post‑train, one trainer.

Pretrain — byte‑level or BPE; WSD schedule, gradient checkpointing, spike recovery
SFT — response‑only masking, ChatML / Alpaca / Llama / plain
Preference — DPO, SimPO, KTO, ORPO (one trainer, flags)
PEFT — LoRA, LoRA+, DoRA (TGLA v2 on disk), VeRA, LoftQ, AdaLoRA, RsLoRA, PISSA, LoRA‑FA
Optimisers — AdamW, Lion, Sophia, Muon, Adafactor, GaLore
Architecture — RoPE+GQA, sliding window, ALiBi, MoE, MoD, differential attn, YOCO, MTP

Distill

Frontier teacher → small student.

KL + NLL mix loss with temperature and α (Hinton 2015)
Rejection‑sampled trajectories — keep only what a checker passes
Gold‑clone fallback — on verifiable tasks the gold IS the trajectory; teacher‑free reproduces the same model
Trace replay — render in the student's own chat template via render_sft_from_traj.py
Headline result — 4B at frontier‑parity, documented honestly

Eval

Where the moat is.

Shared schema — every eval emits the same JSONL row (E0)
tinygpt eval‑bfcl — 10 BFCL categories, OpenAI‑compat shim into tinygpt serve
tinygpt eval‑tau‑bench — retail + airline, configurable user simulator
tinygpt run‑lm‑eval — MLX adapter for lm‑evaluation‑harness, two modes
tinygpt eval‑humaneval — Rust + sandbox‑exec code execution
tinygpt eval‑gate — CI‑grade regression gate, exits non‑zero

Serve · Agent

OpenAI‑compatible. Locally.

tinygpt serve — OpenAI and Ollama on the same socket; Continue.dev provider compat
tinygpt agent — multi‑turn loop, tool dispatch, persistent KV, FSM JSON
‑‑cloud‑escalate — defer to Anthropic / OpenAI only when the local model wants to
Speculative decoding in serve — ‑‑draft‑model draft+verify, ~1.4× decode, lossless greedy; vanilla/Medusa/EAGLE‑2 heads in the CLI
StreamingLLM sink, KV‑cache quant (KIVI), prefix caching
Token‑preserving traces — .atraj rollouts feed next round of SFT

Interp

Open the model up.

Logit lens · tuned lens (trainable per‑layer probes)
Activation patching — zero + donor‑swap, on (layer, position)
Per‑layer ablation · attention heatmap
Linear probes — trainable per‑layer classifiers, .lp sidecar
ROME · MEMIT — rank‑1 and rank‑K fact editing, multi‑layer with key‑norm weighting
Sparse autoencoders — tinygpt sae, group‑SAE, SAELens / Neuronpedia export

In a browser tab

The same model, no install.

Hand‑written WGSL kernels train a GPT‑2 in your tab. Blocked 4×4 matmul (5.18× kernel speedup at 2048³), FA2 forward + backward in WGSL, Memory64 lifts the 4 GB tab heap so a 473M‑param model allocates cleanly. End‑to‑end parity vs. the WASM reference: ≤ 2.5% loss drift.

Open the playground → See the speedup curve →

Run it.

Two paths. The Mac path is the primary one; the browser path is for when you want everything in‑tab, no toolchain, no install.

On a Mac

macOS 14+ · Xcode · Apple Silicon · MLX‑Swift

# build the CLI
cd native‑mac
export DEVELOPER_DIR=/Applications/Xcode.app/Contents/Developer
xcodebuild -scheme tinygpt -destination 'platform=macOS,arch=arm64' \
  -derivedDataPath .xcode-build build

# one‑command quickstart: data → specialist → chat
tinygpt quickstart --data my.jsonl

# or distill from a local teacher
tinygpt distill --teacher qwen3-4b --student huge \
  --data ./traces --out ./student.tinygpt

# then serve it OpenAI‑compatible
tinygpt serve --model ./student.tinygpt --port 8080

Full build instructions →

In a browser tab

WebGPU · Chrome 113+ / Safari 18+ · no install

The playground builds a transformer in WebAssembly + WebGPU, trains it in a Web Worker so the UI never freezes, and lets you watch the loss curve live as the model picks up structure. Every interpretability surface above is wired in — attention heatmap, logit lens, ablation, patching — under one "Inspect & evaluate" panel.

Parity‑tested against the WASM reference to ≤ 2.5% loss drift
OPFS checkpoint persistence — a run survives a tab refresh
Capability detection picks the right preset for your machine

Open the playground →

Honest scope.

What this is

A single‑developer project, shipping in public, MIT.
Mac‑first — M‑series, unified memory, MLX‑Swift.
A factory for specialists, not a general assistant.
An OpenAI‑compatible runtime any client already speaks.
A research substrate for interp, eval, and distillation.

What this isn’t (yet)

A general‑skill specialist. The frontier‑parity 4B dropped on out‑of‑domain BFCL (60 → 42%) — real catastrophic forgetting. Mixed‑backend distillation is the documented fix.
Multi‑GPU or distributed training. One device, single Mac.
A cloud product. Nothing leaves the laptop unless ‑‑cloud‑escalate is set.
An enterprise platform. No SSO, no tenancy, no SLA.
A finished story. The roadmap shows what’s shipped and what’s next.

Read the journey.

Decision logs and per‑technique explainers; each ties back to the paper it cites and the file where it lives.

Tool‑calling: frontier‑parity at 4B The headline result, the distillation recipe, and the honest tradeoffs PLAN.md — shipped · skipped · TODO The canonical ledger, audited against the code Knowledge distillation KL + NLL mix loss, temperature, α, rejection sampling Three phases of training Pretrain → SFT → DPO with paste‑ready commands Interpretability tools Logit lens, tuned lens, ablation, patching, SAE, ROME, MEMIT Agent runtime Multi‑turn loop, tool dispatch, traces, cloud escalation Eval leaderboard viewer Drag‑drop a JSONL; compare by step / model / task Lessons from the build The bugs and surprises that were worth more than the kernels