Knowledge distillation — making a tiny model punch above its weight
Distillation is the post-training technique where a SMALL model (“student”) is trained to match a LARGER model’s (“teacher’s”) output distribution on a corpus. The result: students that significantly outperform what you’d get by training the same architecture from scratch on the same data — sometimes by 2-5x in perplexity terms.
This guide covers the workflow as it exists in tinygpt distill, plus
the comparison protocol for “distilled vs from-scratch at the same
parameter count” — the case study that justifies the technique on the
TinyGPT leaderboard.
Reference: Hinton et al., 2015, “Distilling the Knowledge in a Neural
Network” (https://arxiv.org/abs/1503.02531).
Why it works
Training from a one-hot target (cross-entropy NLL) gives the student a very thin signal at each position: “the next token is X, everything else is wrong.” Training against the teacher’s full softmax distribution exposes the student to the teacher’s uncertainty — the fact that “Y was almost as plausible as X here, and Z is clearly out.”
That richer signal is most valuable when:
- The student is too small to discover the teacher’s structure on its own from raw text (e.g. a 5M student vs a 100M+ teacher).
- The corpus is small relative to what would normally be needed.
- The teacher has been instruction-tuned, and you want the student to absorb that behaviour without re-running SFT/DPO.
The loss
tinygpt distill uses the standard two-term distillation loss:
L = α · T² · KL( softmax(s_logits / T) ‖ softmax(t_logits / T) )
+ (1 − α) · NLL(s_logits, true_target)
- T (temperature) softens both distributions. T=1 means the argmax dominates; higher T flattens the distribution so the student learns the full shape, not just the mode. Typical: T = 4-8.
- The T² multiplier compensates for the gradient-scaling caused by dividing logits by T. Without it, raising T silently shrinks the KL term’s effective learning rate.
- α (alpha) mixes the soft (KL) and hard (NLL) terms. Higher α means the student listens to the teacher more; lower α keeps it grounded in real data. Typical: α = 0.5-0.9.
Both terms are needed: pure KL lets the student drift away from real data if the teacher is wrong somewhere; pure NLL throws away the teacher’s soft information. The 0.7/0.3 mix is the HuggingFace “distil” family default.
Command
tinygpt distill <student-init> \
--teacher <teacher-path> \
--corpus <corpus.txt> \
--tokenizer <hf-tokenizer-dir> \
--steps 5000 \
--temperature 4 \
--alpha 0.7 \
--out distilled.tinygpt
The student-init path is a .tinygpt checkpoint — start with the
output of a short tinygpt train --preset tiny run (just enough to
have valid weights — distillation does the heavy lifting). The
teacher path is either another .tinygpt file OR an HF model
directory; the loader auto-detects.
Both teacher and student MUST share a tokenizer (same vocab size + ids).
The distill command asserts vocab equality at startup.
The comparison protocol
To make the “distilled vs scratch” claim concrete, run the same target configuration both ways and score them on the same benchmark.
Step 1: Train a teacher
# A 100M-class teacher on FineWeb-Edu, BPE-tokenised.
tinygpt train --preset huge \
--tokenizer /tmp/smollm2 \
--corpus /tmp/fineweb-edu-500M.txt \
--dtype bfloat16 --batch 4 --accum 4 --ctx 512 \
--steps 5000 --save-every 100 \
--out /tmp/teacher.tinygpt
Step 2: Make a from-scratch student of the chosen tiny size
# A 5M-class student, also BPE on the same tokenizer. Train for a
# matched compute budget (NOT a matched step count — the bigger
# teacher's steps cost more, so the student gets more steps).
tinygpt train --preset tiny \
--tokenizer /tmp/smollm2 \
--corpus /tmp/fineweb-edu-500M.txt \
--dtype bfloat16 --batch 16 --ctx 512 \
--steps 20000 --save-every 200 \
--out /tmp/student-scratch.tinygpt
Step 3: Distill the same student-init from the teacher
# Same architecture as Step 2, same compute budget. Initialise from a
# short randomly-trained run so weights are in a reasonable range.
tinygpt train --preset tiny \
--tokenizer /tmp/smollm2 \
--corpus /tmp/fineweb-edu-500M.txt --dtype bfloat16 \
--steps 500 --out /tmp/student-init.tinygpt
tinygpt distill /tmp/student-init.tinygpt \
--teacher /tmp/teacher.tinygpt \
--corpus /tmp/fineweb-edu-500M.txt \
--tokenizer /tmp/smollm2 \
--steps 20000 \
--temperature 4 --alpha 0.7 \
--out /tmp/student-distilled.tinygpt
Step 4: Score both on the same held-out benchmark
tinygpt eval /tmp/student-scratch.tinygpt --corpus /tmp/holdout.txt
tinygpt eval /tmp/student-distilled.tinygpt --corpus /tmp/holdout.txt
tinygpt eval /tmp/teacher.tinygpt --corpus /tmp/holdout.txt
Expected (rough) result for a 100M teacher → 5M student on education-grade text:
| Model | Params | Holdout PPL |
|---|---|---|
| Teacher | 100M | ~12 |
| Distilled student | 5M | ~22 |
| Scratch student | 5M | ~45 |
The exact numbers depend on data, training budget, and how instruction-tuned the teacher is. The qualitative pattern (~2× PPL gap between distilled and scratch at the same student size) is robust.
Hyperparameter notes
- Temperature: start at 4. If the student is having trouble learning the soft distribution (loss plateaus high), try T = 8. If the student over-mimics teacher mistakes, try T = 2.
- Alpha: start at 0.7 (KL-heavy). If you trust the data more than the teacher (e.g. teacher is itself trained on a different corpus), drop to 0.5. If the teacher is heavily instruction-tuned and you want to inherit that, push to 0.9.
- Learning rate: 3e-4 is the safe default for a from-scratch student. Lower (1e-4) if initialising from a partially-trained checkpoint.
- Batch / context: the teacher forward runs on the same batch, so
per-step memory is ~2× a normal train step. The default batch
sizes in
Distill.swiftare halved fromTrain.swift’s defaults accordingly.
Caveats
- Tokenizer must match: cross-tokenizer distillation requires a
token-id remapping that we don’t ship today. Both models use the
same
.tokenizer.sourcefield, whichdistillasserts. - Student class: the current implementation expects a from-scratch
student (
TinyGPTModel). HF-architecture student support is a follow-up. - No gradient flows through the teacher: the teacher’s parameters
are not in the
valueAndGradtarget, so MLX treats its activations as constants. No explicitstop_gradientis needed.
Reasoning distillation (R1-Distill family)
A particularly hot variant in 2024-2026: distilling REASONING behaviour from a chain-of-thought-trained teacher (e.g. DeepSeek-R1’s distilled series) into a much smaller student. The recipe is the same loss; what changes is the corpus — instead of generic web text, you feed prompts + the teacher’s STEP-BY-STEP responses. The student learns to produce the same reasoning structure even though it doesn’t have the reasoning capacity to derive it from scratch.
To reproduce that workflow:
- Generate or collect a corpus of
prompt → step-by-step responsepairs from a reasoning teacher. - Format them with the same
chatmltemplate SFT uses. - Run
tinygpt distillwith--temperature 2(sharper — we want to preserve the reasoning chain’s specific tokens, not the broad distribution shape).
This is the path to a ~5M reasoning-capable student on the leaderboard.