a one-machine research lab · MIT

The LLM factory
that fits on one Mac.

posttrainllm turns a stock open model into a routed specialist that earns its keep — then shows you exactly where it fails. Distill, fine-tune, gate on real evals, package for MLX, and open the model up with interpretability. One Apple Silicon machine. No cloud. No hand-waving.

See the specialists Train one in your browser →

losshuge-base-v1 · 200k steps · 0 spikes · loss 4.16

0.920SQL execution, on-device
100%file-ops hard gate — from 58%
100+CLI subcommands, one binary
0data leaves the laptop

the thesis

Not trying to win the frontier. Trying to reach frontier capability at a fraction of the compute — and to understand the whole machine while doing it.

Win on the Mac. Be best-in-class at what one Apple Silicon machine can actually do — train, post-train, evaluate, serve, inspect — with no cluster and no cloud dependency.

Learn the whole space. From y = mx + b to a self-improving factory. Every technique is built from scratch, anchored to the paper, and mapped to the file where it lives.

Show the scars. Failed runs are first-class. Every attempt is logged with the decision it forced. A number without its regression is marketing, not a result.

Build everything buildable here. If it fits on this Mac, it gets built — and packaged so that when compute arrives, the lab scales from a running start.

The loop the runtime closes.

Each surface emits the input the next one needs. Every run ends in a schema-valid folder — config, dataset, eval, decision, report — that factory-run validates.

01targetfrozen
02data5,675 rows
03post-traindpo · 200 steps
04evalexec 0.920
05packagetgla · mlx
06decideretry-data

latest · 2026-07-11 · qwen06-sql-hygiene-dpo · reference-anchored DPO fixed an earlier collapse, execution 0.860→0.920 with no regression — but output hygiene is a base-model prior, so the decision is retry-data. the one proof still missing is a run whose decision is ship.

Specialists, honest by design.

A specialist beats a generalist on its target and routes away when it shouldn't answer. Every result ships with the regression it costs.

released · weights4B · MLX

Qwen3-4B, file-ops distilled

Distilled to 100% on the multi-turn file-ops hard gate (BFCL), up from 58%. The honest cost: out-of-domain breadth dropped 59.6% → 42.3% — real catastrophic forgetting. It ships routed, never as a general planner.

file-ops gate: 58 → 100%
OOD breadth: 59.6 → 42.3%

The distillation write-up →

report-only candidate0.6B · routed

Qwen3-0.6B, routed SQL

Two adapters behind a router. Two reference-anchored DPO retries cured a policy collapse and pushed execution to 0.920 — but output hygiene is a base-model prior a small adapter can't strip. Decision: retry-data.

synthetic exec: 0.860 → 0.920
clean-SQL: 0.000

Full artifact + blockers →

Every artifact, with evidence and blockers →

Measured, on one machine.

Every figure below was recorded on a single Apple M5 Pro / 48 GB. No estimates, no extrapolation.

76tok/s4B decode, MLX

12.1×speedupWebGPU vs WASM, XL

<45sper pass50-row SQL eval

0spikes200k pretrain steps

½memoryref-free vs DPO

The whole stack — audited against the code.

Not a wrapper. Every capability is a real subcommand with tests behind it.

train

Pretrain, SFT, DPO / SimPO / KTO / ORPO, distillation, ES. Full PEFT — LoRA, LoRA+, DoRA, VeRA, LoftQ, AdaLoRA, PISSA. WSD schedules, spike recovery, z-loss.

eval

BFCL, τ-bench, lm-eval (MLX adapter), HumanEval + sandbox, SQL execution, router, MILU, MTEB. Frozen baselines, slice metrics, non-zero-exit gates.

serve

OpenAI- and Ollama-compatible on one socket. Agent loop, tool dispatch, FSM-constrained JSON, persistent KV cache, speculative decoding, optional cloud escalation.

package

Export to MLX, safetensors, CoreML. Quantize (GGUF / AWQ / GPTQ / HQQ), prune, merge, bake-LoRA with DoRA magnitudes. Specialist model cards.

inspect

SAE features, ROME, MEMIT, tuned / logit lens, activation patching, linear probes, attention heatmaps. Know where the model decides.

browser

The same model trains in a browser tab via hand-written WebGPU kernels — Memory64, FlashAttention-2, blocked matmul. A from-scratch learning track, honest negative results included.

The lab keeps a paper trail.

Every experiment, lesson, and recipe is written down — with the decision it forced and the file it lives in.

learning log

Honest scope.

A single-developer project, shipping in public, MIT.
Mac-first — M-series, unified memory, MLX-Swift.
A factory for specialists, not a general assistant.
An OpenAI-compatible runtime any client already speaks.

Not a proven ship loop yet — the missing proof is a run that decides ship.
Not multi-GPU or distributed. One device, one Mac.
Not a cloud product. Nothing leaves the laptop unless you ask.

run it — one Mac, from source

# build the native factory
git clone …/tinygpt && cd native-mac
swift build --product posttrainllm

# distill a specialist, gate it, serve it
posttrainllm distill --teacher qwen3 --student …
posttrainllm eval-gate --spec sql.json --candidate …
posttrainllm serve --port 8080  # OpenAI-compatible

Full quickstart →

The LLM factory
that fits on one Mac.

Not trying to win the frontier. Trying to reach frontier capability at a fraction of the compute — and to understand the whole machine while doing it.

The loop the runtime closes.

Specialists, honest by design.

Qwen3-4B, file-ops distilled

Qwen3-0.6B, routed SQL

Measured, on one machine.

The whole stack — audited against the code.

The lab keeps a paper trail.

Ground-up, in 10 modules

What we tried

How it was built

Method → recipe

Where it decides

Honest scope.

The LLM factorythat fits on one Mac.

Not trying to win the frontier. Trying to reach frontier capability at a fraction of the compute — and to understand the whole machine while doing it.

Qwen3-4B, file-ops distilled

Qwen3-0.6B, routed SQL

Ground-up, in 10 modules

What we tried

How it was built

Method → recipe

Where it decides

Honest scope.

The LLM factory
that fits on one Mac.