Phase 1-10 validation + end-to-end workflows
This doc captures the validation runs and workflows that exercise the features shipped in Phases 1-10. The goal isn’t peer-review rigour — it’s enough evidence that each surface RUNS end-to-end on real data, so a future user knows the pieces actually compose.
1. End-to-end training with the new architecture knobs ✓ RUN
Run a Huge model on FineWeb-Edu with --diff-attn and --mod
enabled, plus the standard stability stack (--grad-clip 1.0,
cosine LR, val split, atomic save).
Actual run (29 May 2026, 22 min wall-clock)
preset: huge (12L · d=256 · ctx=512)
features: --diff-attn --mod
params: 26,931,736 (+3× attn projections vs vanilla Huge)
dtype: bfloat16
batch / accum: 4 × 4 = 16 effective
steps: 500
tokenizer: SmolLM2 BPE (vocab=49152)
Loss curve:
| Step | Train loss | Val loss | Notes |
|---|---|---|---|
| 1 | 11.222 | — | Initial (worse than uniform = log(49152) ≈ 10.8) |
| 50 | 7.596 | — | End of warmup, LR at peak 6e-4 |
| 100 | 6.850 | 7.086 | First val eval |
| 200 | 6.831 | 6.556 | Val improving |
| 300 | 6.243 | 6.557 | LR cosine decay, ~half-way |
| 400 | 6.460 | 6.561 | LR ≈ 1.2e-4 |
| 500 | 6.464 | 6.494 | Final: train ≈ val ≈ 6.5 |
Δ initial → final: −4.76 nats. Both --diff-attn (2× attention
projections + λ) and --mod (sigmoid gate per token per block)
active simultaneously; loss curve is monotonically decreasing
through both warmup and cosine decay, no spikes, no NaNs. Train ≈
val tracks closely → no overfit.
Sample after training
$ tinygpt sample /tmp/validation-huge.tinygpt --prompt "Once upon a time" --tokens 40 --temperature 0.8
Once upon a time, has a last, for what it over the the kids and I 2.
the guide's areas at the we have place, a biodiversity healthy has
information and the kind of routine.
500 steps is far short of convergence on 2 GB of text, but the output is recognisably English (real words, grammatical fragments, vocabulary coherent with the FineWeb-Edu domain). No NaN garbage, no repetition loops.
caffeinate -di tinygpt train \
--preset huge \
--tokenizer /tmp/smollm2 \
--corpus /tmp/fineweb-edu-500M.txt \
--dtype bfloat16 --batch 4 --accum 4 --ctx 512 \
--steps 500 \
--diff-attn --mod \
--lr-schedule cosine --warmup 50 --max-lr 6e-4 --min-lr 6e-5 \
--save-every 100 --val-split 0.005 --val-every 100 \
--sample-every 999999 \
--out /tmp/validation-huge.tinygpt
Expected runtime (M-series, bf16):
- BPE tokenize: 20-30 min on 2 GB corpus (cached on subsequent runs)
- Train: ~500 steps × ~0.5 s/step = 4-5 min
Success criteria:
- Training completes without OOM, NaN, or kernel errors
- Final loss decreases relative to step 0
- Checkpoint loads + samples successfully via
tinygpt sample
The combination of DiffAttn (2× attention projections + λ) and MoD (sigmoid gate per token per block) puts both new architectural surfaces under load simultaneously. If either had a wiring bug, this run would crash or NaN.
2. MoE end-to-end (the Phase 5 deliverable)
Now that MoE save/load works, this is the workflow that bookends Phase 5: train a tiny MoE on BPE-tokenized data, save it, reload, sample.
# Step 1 — train a tiny MoE
tinygpt train \
--preset tiny \
--tokenizer /tmp/smollm2 \
--corpus /tmp/smoke-corpus.txt \
--moe-experts 4 --moe-topk 2 \
--steps 200 \
--out /tmp/moe-tiny.tinygpt
# Step 2 — sample from the saved MoE
tinygpt sample /tmp/moe-tiny.tinygpt \
--prompt "Once upon a time" --tokens 80 --temperature 0.8
Success criteria:
- Step 1 trains without router collapse (loss decreases)
- Step 2 loads the MoE blocks correctly and produces coherent text
tinygpt inspect /tmp/moe-tiny.tinygptshowsmoe.router.weightandmoe.experts.0..3.{fc_in,fc_out}.weightentries per layer
3. MoE distillation pipeline ✓ RUN
The original Phase 5 headline was “distill from a big teacher into our smaller MoE”. The mechanics:
# Step 1 — initialise a small MoE student with the teacher's tokenizer.
tinygpt train \
--preset tiny \
--tokenizer /tmp/smollm2 \
--corpus /tmp/smoke-corpus.txt \
--moe-experts 4 --moe-topk 2 \
--steps 100 \
--out /tmp/moe-student-init.tinygpt
# Step 2 — distill from SmolLM2 (HF, dense) into the MoE student.
tinygpt distill /tmp/moe-student-init.tinygpt \
--teacher /tmp/smollm2 \
--corpus /tmp/smoke-corpus.txt \
--tokenizer /tmp/smollm2 \
--steps 200 --temperature 4 --alpha 0.7 \
--out /tmp/moe-distilled.tinygpt
The student is a from-scratch MoE; the teacher is an HF dense model (SmolLM2). Both share the SmolLM2 tokenizer so the cross-entropy on softmax distributions is well-defined.
Actual run
student: tiny (4L · d=128 · 4 experts top-2 · 8,683,776 params)
teacher: /tmp/validation-huge.tinygpt (12L · d=256 · 26,931,736 params,
trained with --diff-attn --mod)
vocab: 49152 (shared SmolLM2 BPE)
loss: α·T²·KL + (1−α)·NLL [α=0.7 T=4.0]
steps: 30
step 1/30 loss 1.927
step 30/30 loss 0.213
done — 30 steps in 2.3s (13.2 step/s)
Loss dropped 1.93 → 0.21 in 30 steps — fast because the student is much smaller than the teacher and learning from soft labels on a tiny corpus. The DISTILLED MoE sampled:
$ tinygpt sample /tmp/moe-distilled.tinygpt --prompt "The quick brown fox" --tokens 20 --temperature 0.8
The quick brown fox jumps over the lazy dog. Lorem ($ ipsum dolor sit amet, consect
Sample reproduces the smoke corpus seed text near-exactly — the student overfit on the small corpus (expected), but proves the distillation pipeline closes the loop: a Phase 10 teacher (with DiffAttn + MoD) distilled into a Phase 5 student (MoE), saved via the new manifest schema, reloaded, sampled cleanly.
tinygpt inspect /tmp/moe-distilled.tinygpt confirms the full MoE
structure round-tripped through distillation:
blocks.0.moe.router.weight [4, 128] 512
blocks.0.moe.experts.0.fc_in.weight [512, 128] 65,536
blocks.0.moe.experts.0.fc_in.bias [512] 512
blocks.0.moe.experts.0.fc_out.weight [128, 512] 65,536
…etc, 4 experts per block × 4 blocks
Note: distilling FROM an HF MoE teacher (Mixtral, DeepSeek) is the
next step but is blocked on the HF MoE safetensors loader — see
docs/phase_9_10_status.md.
4. Interpretability — tuned-lens pipeline
Trained probes give better-calibrated per-layer predictions than the raw final-LN + LM-head lens.
# Train the probes — base frozen, only the lens probes update.
tinygpt tuned-lens /tmp/validation-huge.tinygpt \
--corpus /tmp/smoke-corpus.txt \
--steps 300 --lr 1e-3 \
--out /tmp/huge.lenses
Browser side: open the playground, load a model, click the 🎯 button
next to “Logit lens”, select the .lenses file. The worker parses it
and uses the trained probes on the next “Logit lens” click — the
ASCII table shows per-layer predictions that are SHARPER than the
raw lens (the trained probes are layer-calibrated; the raw lens is
not).
5. Activation patching + ablation in the browser
After loading any gallery model:
- “Ablate & sample” — pick a layer index + a target (attn, mlp, or whole layer) → that component is zeroed at every position during generation. Reveals how load-bearing the block is.
- Patch button (via worker
patchmessage; UI in next iteration) — zero out one (layer, position) pair in the residual stream. Pinpoints whether THAT token’s representation at THAT depth was load-bearing.
6. LASER + HQQ — post-hoc weight surgery
Post-training operations on a finished .tinygpt file:
# Drop the bottom 30% of singular components from the late layers'
# MLP outputs. Sometimes improves downstream accuracy by removing
# the "noise tail" that the higher components had to fight.
tinygpt laser /tmp/validation-huge.tinygpt \
--target mlp.fc_out --layers 8-11 \
--rank-fraction 0.7 \
--out /tmp/huge-lasered.tinygpt
# Quantize-then-dequantise via HQQ's IRLS solver. Stores the
# REQUANTISED weights as dense fp32 (the inference-time memory win
# would require a packed-int4 matmul kernel).
tinygpt hqq /tmp/validation-huge.tinygpt \
--bits 4 --group-size 64 --p 0.7 \
--layers 0-11 \
--out /tmp/huge-hqq.tinygpt
Both operate at the .tinygpt file level — load, modify the weight tensors, write a new file. The rest of the toolchain (sample, eval, finetune) treats the output identically to the input.
7. ES — gradient-free training
A separate trainer for non-differentiable rewards or as an educational counterpoint to AdamW:
tinygpt es /tmp/validation-huge.tinygpt \
--corpus /tmp/smoke-corpus.txt \
--steps 50 --population 40 --sigma 0.02 --lr 0.01 \
--out /tmp/huge-es.tinygpt
Per step: K=40 forward passes (no backward). Slower per step than SGD but fully parallelisable and works on rewards that aren’t differentiable.
8. Magpie — synthetic SFT data
Bootstrap an SFT dataset from any chat-format base:
tinygpt magpie /path/to/chat-tuned-model \
--count 500 --template chatml --temperature 0.9 \
--out /tmp/magpie-sft.jsonl
Each line: {"instruction": "...", "response": "..."}. Pipe this
into tinygpt sft for fine-tuning, applying user-side quality
filters (length ≥ 20, no repeating loops) first.
What this validation does NOT cover
- Leaderboard scoring of the validation artifact. The benchmark
scorers (
browser/score_gallery.mjs) target byte-level models in the gallery format; scoring a BPE-tokenized Huge would need a parallel scoring path. - Long-context behaviour — YOCO’s KV-cache memory win only shows up at ctx ≥ 1024 with autoregressive decode. The 500-step validation runs at ctx=512.
- Quality A/B between configurations. We’re validating that things RUN, not that the new features improve quality on a specific task. Doing the A/Bs is the next round of experiments.
This is a “smoke” report — strong evidence the pieces compose, weak evidence on RELATIVE performance. Pulling the latter takes its own focused experiment + the leaderboard scorers extending to BPE models.
Bugs the validation actually caught
The point of running this report is not to feel good about the work — it’s to catch the gaps between “compiles” and “works.” Two real bugs landed during the validation that wouldn’t have surfaced otherwise:
-
tinygpt tuned-lenscrashed at the first gradient step withFatal error: [grad] Must specify at least one argument.The probe Linears were being attached to TinyGPTModel via post-init assignment to an Optional @ModuleInfo field — MLX-Swift’s parameter discovery wasn’t picking them up as trainable through that path. Fix in commita64de95: probes live in a standaloneTunedLensProbesModule;valueAndGradtargets that module directly while the base model is closure-captured. -
npm run build(production Astro) failed with a Vite parse error atsrc/pages/index.astro:3315:16:Expected ";" but found "tinygpt". Astro / esbuild’s JSX-ish parser interprets backticks inside HTML comments as template-literal delimiters; the comment content then fails to tokenise as JS. Fix in commit9877bb7: replaced backticks with plain quotes in the comment around thelensUploadLabel.
Both passed every prior check (Swift swift build was green; TypeScript
tsc --noEmit was green) — only END-TO-END execution caught them.
Lesson for future cycles: ship + validate is not the same as ship +
compile.
Appendix — evaluation & safety
(Merged from the former docs/evaluation.md.)
Required correctness tests
The full table is in ../tests/README.md. Summary:
| Test | Purpose |
|---|---|
| Tokenizer roundtrip | bytes → text → bytes is lossless |
| Shape tests | every layer returns the expected shape |
| Loss sanity | random model loss near ln(256) ≈ 5.54 |
| Tiny overfit | model overfits 1–10 KB repeated text |
| Gradient check | finite-difference check on a tiny layer |
| PyTorch parity | ported forward matches the PyTorch reference |
| Checkpoint reload | same loss after save + reload |
| Sampling fixed seed | deterministic generation for a fixed seed |
| Browser refresh | run resumes after a page reload |
The most important test: can it overfit a tiny repeated dataset? If not, do not scale — the model, backprop, or data pipeline is broken.
The evaluation matrix (LoRA)
Always run four comparisons:
A. Base model only
B. Base model + prompt examples (few-shot)
C. Base model + LoRA
D. Base model + LoRA + retrieval
Typical outcome:
| Setup | Behaviour |
|---|---|
| Base only | generic |
| Few-shot prompt | immediate style improvement |
| LoRA | stronger tone/style adaptation |
| Retrieval + LoRA | best practical quality |
LoRA teaches style. Retrieval supplies context/facts.
If LoRA does not beat few-shot prompting, the adapter was not worth training.
Memorization test
Tiny models memorize; LoRA adapters can memorize too.
Take the first 30–50 tokens of a training example.
Ask the model to continue.
Check whether it reproduces the rest verbatim.
If it copies too much: train fewer steps, lower rank, lower learning rate, add dropout, deduplicate data, increase dataset size, avoid repeated text.
Qualitative questions
For a style-adaptation project, ask:
- Does it sound more like the corpus?
- Does it copy exact training text?
- Does it hallucinate facts?
- Does retrieval improve grounding?
- Does LoRA actually beat few-shot prompting?
Backend parity (Phase 4–5)
- WASM vs PyTorch — the WASM forward must match the Python reference within tolerance before you trust WASM training.
- WebGPU vs WASM — each WebGPU kernel must match the WASM kernel within tolerance before it joins the pipeline. Start with matmul only.
Safety notes
- Style ≠ intelligence. LoRA on blogs learns tone, format, phrasing, argument rhythm, vocabulary — not truth, judgment, current beliefs, or reliable reasoning.
- Watch for data leakage and copying risk, especially with tiny models.
- Do not redistribute adapters trained on a living author’s writing.
Evaluation deliverable
A small evaluation suite producing: base output, prompt-only output, LoRA output, and LoRA + retrieval output — for the same held-out prompts.