← Back to the playground
The performance journey
roadmap
A live ledger of every lever that would make in-browser training of TinyGPT
faster — what's shipped, what's blocked, what's open, and the honest reason
each is in the state it's in. ~80% of training time is matmul;
most of these levers attack it from a different angle.
Shipped — running today
Partial — exists but unverified
Blocked — external constraint
Open — not started
●
Benchmark log
Measured
Machine: Apple M-series
Build: emcc -O3 -msimd128
Driver: tests/bench_wasm.mjs
Every reported number on this page is run-on-this-machine, not extrapolation.
The numbers below are ms / training step at batch 16/12/8 on the
single-threaded WASM-SIMD build — the current shipped baseline.
Current shipped build — multi-threaded WASM SIMD:
| Preset | Params | d_model | ctx | ms/step | tok/s |
| Small | 0.37M | 96 | 64 | 101 | 10,116 |
| Medium | 0.84M | 128 | 96 | 357 | 4,305 |
| Large | 2.74M | 192 | 128 | 1,191 | 1,289 |
| XL | 6.42M | 256 | 128 | 1,851 | 553 |
(Previously, single-threaded SIMD: ~2× slower across the board. See lever 3.)
For WebGPU on the same hardware, the first user-measured datapoint is
~7× faster than WASM SIMD: a run estimated at 7 minutes
on WASM finished in ~1 minute on WebGPU. WebGPU benchmarks across
multiple machines are still TBD — see lever 2.
How to reproduce
bash wasm/build_wasm.sh && node tests/bench_wasm.mjs
from the repo root. Reports ms/step per preset, both forward and backward.
●
Speed evolution — Small preset, normalized to scalar baseline
Measured + extrapolated
Baseline: 1× = single-threaded scalar WASM
Reading: each bar is the cumulative speedup over baseline
scalar baseline
1.0×
measured
+ WASM SIMD
1.6×
measured
+ multi-thread (4 workers)
3.2×
measured
+ WebGPU naive
~7×
end-to-end measured (Small preset)
+ WebGPU blocked matmul
~12×
end-to-end measured · 7.6× over WASM SIMD on Small preset · 0.5% loss drift
+ blocked on big-matmul presets (Mega+)
~30×
kernel measured · 5.2× kernel speedup at 2048³
+ WebGPU subgroups (softmax, layernorm)
~36×
projected
+ Flash Attention 2 (ctx 512)
~60–80×
projected
Solid teal bars are measured end-to-end on this codebase:
tests/test_webgpu_train.mjs compares WASM vs WebGPU final loss
after 50 steps on the same seed — the WebGPU + blocked-matmul run finishes
in 0.9 s vs 6.8 s for WASM, with 0.5% loss drift (pure float-reorder noise,
model trains identically). Bigger presets see more benefit because the
blocked kernel's win grows with matmul size — at 2048³ the standalone
kernel is 5.18× faster than naive, vs ~1.5–2× on Small-preset shapes.
Striped bars are projected from per-lever impact estimates.
orthogonal lever
Memory64 doesn't appear as a bar because it lifts the
model-size ceiling, not training throughput. At fixed Small-preset
size it's a no-op — but it's the only thing that lets the whole optimised
pipeline run on a 473M-param model in the first place (a config that
hard-OOMs the 32-bit WASM build).
1
WebAssembly SIMD in the matmul inner loop
Shipped
Impact: ~1.6× per project notes
Lives in: wasm/src/matmul.cpp
The C++ matmul is compiled twice — once scalar, once with -msimd128.
With SIMD on, four f32 lanes multiply per cycle in the inner loop instead of one.
docs/performance.md reports ~1.6×; current build is SIMD by default
(the numbers in the Benchmark log above are SIMD-on).
The page's "WASM SIMD" pill at top shows whether your browser actually loaded
the SIMD build. All Chromium-family browsers and Safari 16.4+ do.
Why now
Smallest cost / biggest immediate win.
Doesn't change any maths, just generates better machine code.
See docs/performance.md.
2
WebGPU forward + backward + AdamW
Shipped · ~7× measured on M-series
Measured: ~7× on Apple M-series · others TBD
Lives in: webgpu/
The full training loop runs on the GPU — all 24 kernels written in WGSL,
every one finite-difference and parity-checked against the WASM reference.
Correct end to end.
First real-hardware datapoint: a run that took ~1 min on WebGPU
on Apple M-series was estimated at ~7 min on the WASM SIMD path for the
same config — roughly 7× faster. Earlier numbers from this
project were withdrawn because they came from swiftshader (software adapter,
see docs/notes.md §10);
this is the first honestly measured speedup on real silicon.
What's next
Benchmark on NVIDIA + Intel iGPU + Snapdragon to build a per-hardware
table. Then make WebGPU the default backend when available.
3
Multi-threaded WebAssembly
Shipped · ~2× measured
Measured: ~2× across all preset sizes
Lives in: wasm/src/matmul.cpp · wasm/build_wasm.sh
matmul_forward and matmul_backward now split the
M dimension across CPU cores via std::thread. Each thread takes
a contiguous row slice; outputs don't overlap so no locks. The dB path is
the exception — it accumulates over M, so we use per-thread scratch and a
final reduction. Threading only kicks in when M ≥ 64.
The pthread WASM build requires SharedArrayBuffer, which requires
cross-origin isolation. The _headers file sets COOP/COEP for
Cloudflare Pages; vite.config.ts mirrors it for the dev server.
| Config | d_model | 1-thread | Threaded | Δ |
| Small | 96 | 190 ms | 101 ms | +88% |
| Medium | 128 | 693 ms | 357 ms | +94% |
| Large | 192 | 2397 ms | 1191 ms | +101% |
| XL | 256 | 3797 ms | 1851 ms | +105% |
Why only 2×, not 4-8×: the workload is memory-bandwidth
bound past ~2 threads. Each matmul reads the entire B matrix; that's the
shared bottleneck. Adding cores past the BW limit gives diminishing
returns. Real measurement consistent with this theory.
4
Tiled blocked matmul (cache-aware)
Tried · reverted (no measured win)
Measured: net wash across tested sizes
Lives in: wasm/src/matmul.cpp
Tiled matmul (Tm=32, Tn=64, Tk=32 blocks) was implemented and benchmarked
against the baseline on the same single-threaded WASM-SIMD build. The result:
| Config | d_model | Baseline | Tiled | Δ |
| Small | 96 | 190 ms | 196 ms | -3% |
| Medium | 128 | 693 ms | 690 ms | ±0% |
| Large | 192 | 2397 ms | 2248 ms | +6.7% |
| XL | 256 | 3797 ms | 3990 ms | -5% |
Why the theoretical prediction (1.5-2×) didn't materialise here:
the baseline matmul's inner loop is a fixed-bound for n in 0..N
that emcc -O3 -msimd128 aggressively autovectorises into f32x4
FMA chains. The tiled variant introduces variable-bound inner loops
(for n in n0..n1) that the autovectoriser handles less cleanly,
so the SIMD win shrinks just as the cache win arrives. Net: wash.
What would change this
A hand-written SIMD inner kernel with statically known
tile sizes (32×4 SIMD micro-kernel + scalar epilogue) — the BLIS approach.
That's ~2 days of careful work, vs the 50-line tiled patch tried here.
5
Mixed-precision weights (fp16 / bfloat16)
Open
Potential: 2× memory, ~1.3–1.8× speed
Store weights and activations in fp16, keep gradient accumulators in fp32.
Halves memory bandwidth — which is the actual bottleneck for most matmuls
once they exceed L1.
Why not yet: the entire kernel set assumes fp32 today. Every
op — forward, backward, AdamW — would need a fp16 variant. Loss scaling needs
adding to prevent gradient underflow into denormals. Multi-week refactor.
When
After tiled matmul and after WebGPU is verified.
The complexity tax is too high to pay before the simpler levers.
6
Flash Attention
Open · increasingly relevant
Standard attention materialises an N×N score matrix in memory; Flash Attention
computes it in tiles so the full matrix never exists — saving memory and beating
naïve attention on speed by avoiding HBM round-trips.
What changed: with the new Huge/Massive/Mega presets, ctx now goes
to 256–512 — the regime where attention's share of step time goes from ~12% (ctx 64)
to ~40% (ctx 256) to ~55% (ctx 512). At Mega (ctx 512), the score matrix at
B=2, H=12, fp32 is ~25 MB per attention call — starting to hit WebGPU buffer pressure.
Estimated impact, today: ~1.18× on Massive, ~1.7× on Mega. The
ctx=512 preset is the first where Flash Attention becomes the highest-ROI open lever.
Cost
A new kernel from scratch — tiled forward + tiled backward in
WGSL, plus finite-difference parity tests against the existing naïve attention.
~1–2 weeks of focused work.
7
Kernel fusion (forward + loss + backward + AdamW)
Open — design tension
Potential: 1.3–1.6× for small models
Separate kernels for forward, backward, and AdamW each write their outputs
to memory and read them back. Fusing them — one big kernel that does
forward + loss + backward + weight update without ever round-tripping through
memory — kills the traffic that dominates for small models.
The tension: the project's stated principle is "every layer
can be understood." Fused kernels are notoriously unreadable. The right move
is probably a second build target — a "fast" variant alongside the "readable"
one — but that doubles maintenance.
Honest take
Worth doing only after the simpler structural wins
(multi-threading, tiled matmul, real WebGPU numbers).
8
Local Python with CUDA / Apple MPS
Shipped — escape hatch
Impact: 50–100×
Lives in: python_ref/
The Python reference runs the same model on PyTorch with full GPU support.
On an M5 Pro: 10M-param model trains at ~24 s per 1,000 steps. Practical
iteration speed for real models.
This is the right answer for anyone serious — the in-browser path is for
learning the mechanics, not training the next ChatGPT. The Diagnostics
section of the playground has the three commands you need.
9
WebAssembly Memory64 — break the 4 GB tab ceiling
In progress
Impact: ~2× model size (in fp32)
Needs: tinygpt64.{js,wasm}
V8 caps each tab's WASM heap near 4 GB on 32-bit pointers — that's
~250M fp32 params with Adam state, full stop. Memory64
(-sMEMORY64=1 -sWASM_BIGINT) switches the module to
64-bit pointers and lifts the cap into the tens of GB on Chromium 133+.
Measured: built the Memory64 variant; allocated a
473M-param model (~5.6 GB heap with Adam state) cleanly in Node — a
config that hard-OOMed on the 32-bit module. Next: wire the loader
to prefer the 64-bit build when the runtime supports it, and ship a
Behemoth preset that uses the extra ceiling.
9b
Thread-blocked matmul (4×4 register block)
Kernel measured · biggest single kernel lever
Impact: ~5.2× over naive WebGPU matmul at 2048³ (measured)
Lives in: webgpu/matmul_blocked.wgsl
Stacks two well-known wins. (a) Same 16×16 workgroup-shared
tiling as lever 10, plus (b) each of the 256 threads computes a
4×4 block of output values held in registers. Workgroup outputs
a 64×64 tile; each shared-memory load gets reused 4× across the
thread's register accumulator via outer-product structure.
Arithmetic intensity per shared-mem load climbs from ~1 fused
multiply-add to ~16 — well past the point where matmul becomes
compute-bound rather than memory-bound.
Measured on M-series WebGPU:
| matmul size | naive ms | tiled ms | blocked ms | vs naive |
| 256³ | 0.66 | 0.72 | 0.45 | 1.48× |
| 512³ | 1.96 | 0.86 | 0.64 | 3.04× |
| 1024³ | 6.43 | 2.85 | 1.80 | 3.58× |
| 2048³ | 47.24 | 17.23 | 9.12 | 5.18× |
Speedup grows with matrix size because bigger problems amortize
workgroup-shared-memory loading more effectively across the 4×4
register reuse. At 2048³ (the kind of shape that shows up in
Mega and Behemoth presets) the kernel runs 5.18× faster than
the naive version and 1.89× faster than the merely-tiled one.
Open
Drop-in replacement for the naive matmul
in train.wgsl: same bind-group layout, output is
bit-identical (modulo float reorder). Pipeline-integration is
the next item.
9c
8×8 register block — tried, lost to 4×4
Honest result · register spill / lower occupancy
Impact: ~0.85× of blocked4 across all sizes (measured)
Lives in: webgpu/matmul_blocked8.wgsl
Tried scaling the register block up from 4×4 to 8×8 (workgroup
output tile 128×128 instead of 64×64). Hypothesis was that 4× the
arithmetic intensity per shared-memory load would translate to
~1.5× more speedup on top of blocked4. Lost at every
size:
| matmul size | blocked4 ms | blocked8 ms | ratio |
| 256³ | 0.33 | 0.55 | 0.60× |
| 512³ | 0.54 | 0.75 | 0.72× |
| 1024³ | 1.78 | 1.96 | 0.91× |
| 2048³ | 10.15 | 11.52 | 0.88× |
Most likely cause: 64 floats per thread for the accumulator
exceeds the per-thread register budget on Apple GPUs, forcing
register spill into local memory and tanking effective compute
throughput. Lower workgroup occupancy (16 KB shared per
workgroup vs 4 KB) compounds it — fewer concurrent workgroups
per SM. Kept in the codebase as a documented negative
result. Same lesson as f16-vs-tiled: more aggressive is
not always faster; benchmark every variant.
10
Tiled matmul (workgroup-shared memory)
Kernel measured · superseded by blocked
Impact: ~2.5× over naive WebGPU matmul (measured)
Lives in: webgpu/matmul_tiled.wgsl
Classic 16×16 tiled matmul using var<workgroup>
shared memory (textbook Goto/VandeGeijn pattern). Each workgroup
of 16×16 threads cooperatively loads A's and B's 16×16 blocks
into shared memory, then each thread does 16 multiply-accumulates
from shared. Effectively turns 16 global reads into 1 global +
16 shared, which on big matmuls is where the GPU starts looking
like a GPU.
Measured on M-series WebGPU, dispatch-only timing:
| matmul size | naive ms | tiled ms | speedup |
| 256³ | 0.87 | 0.37 | 2.35× |
| 512³ | 1.74 | 0.64 | 2.72× |
| 1024³ | 6.00 | 2.48 | 2.42× |
| 2048³ | 43.16 | 16.90 | 2.55× |
Clean ~2.5× across every realistic size, peak ~2.7× at 512³.
Parity validated at sizes ≤ 512.
Open
Wire into train.wgsl — every
forward + backward matmul uses the tiled kernel. Drop-in: the
bind-group layout is identical to the naive kernel, so only the
pipeline creation needs to point at the new shader source.
10b
f16-packed storage — tried, doesn't compound
Honest result · standalone win swallowed by tiling
Impact: ~1.7× vs naive, but ≤ tiled
Lives in: webgpu/matmul_f16packed.wgsl · matmul_tiled_f16.wgsl
Weights live as packed half-precision (two f16 per u32 via
pack2x16float built-ins), accumulation in f32. The
standalone version beats naive by ~1.7× by halving global
bandwidth. But once tiling is in place the
kernel is no longer bandwidth-bound — it's compute-bound on
shared-memory ops — and halving global bandwidth no longer
helps. The combined tiled+f16 kernel is the same speed
as plain tiled at 1024³ and a touch slower at 2048³ (17.78 ms
vs 16.90 ms).
Lesson: always bench an optimization against
the best baseline, not the naive one. The ~1.7× we
measured earlier was real but not additive — it was a different
way to get the same memory-traffic win that tiling already
captures more thoroughly.
The packed kernel stays in the codebase as a reference + for
cases where the model genuinely needs more total bytes than the
GPU can hold (Behemoth-scale weight buffers), where halving
storage isn't just about speed but about fitting at all.
11
WebGPU subgroups — fast reductions
Planned
Impact: ~1.3–1.8× on softmax / layernorm
Needs: "subgroups" extension (Chrome 125+)
Replace shared-memory reduction loops in softmax / layernorm /
attention with subgroup intrinsics
(subgroupAdd, subgroupMax). One pass
instead of log₂(blockDim) passes through workgroup memory.
12
Flash Attention 2 in WGSL
Planned
Impact: ~3–5× on long-context, O(N) memory
Reference: Dao 2023 (FA2)
Tile Q, K, V across workgroups; never materialize the full attention
matrix; recompute on backward. Today attention is the wall at
ctxLen ≥ 512. FA2 makes the 1024-context Behemoth
preset realistic in the browser.
13
LoRA fine-tuning in the browser
Planned
Impact: train 10×-larger frozen base models
Reference: python_ref/lora.py
Train low-rank adapters (rank 4–16) on top of a frozen base
checkpoint. Trainable parameters drop ~99%, so a 100M base + LoRA
fits comfortably where full fine-tuning would OOM. The
python_ref/lora.py implementation already exists; the
browser path needs the WASM op + a UI panel ("Adapt" beside
"Train").
14
Quantized inference (4-bit / 8-bit)
Planned
Impact: ~4× larger model loadable for sampling
Methods: GPTQ / AWQ / HQQ for fp32 → int4
Once a model is trained (or loaded), quantize to int4 / int8 for
sampling-only mode. The browser ceiling is dominated by Adam state
during training, but at inference you can drop weights to int4 and
run models 4× larger than you trained. Pairs naturally with the
Behemoth preset — train at 100M, sample at "what if it were
400M."
15
Muon optimizer
Planned — experimental
Impact: ~1.4–2× faster convergence vs. AdamW
Reference: Keller Jordan 2024
Orthogonalize the matrix-shaped gradients via a Newton-Schulz
iteration before stepping (skip the embedding + output head — those
stay on AdamW). Empirically matches or beats AdamW with fewer
steps. Drop-in port from the public reference implementation.
16
WASM Relaxed SIMD
Planned
Impact: ~1.1–1.3× on the WASM fallback path
Flag: -mrelaxed-simd
Newer SIMD ops (FMA, dot products, relaxed rounding) that the older
-msimd128 set leaves on the table. Modest but free
uplift on the CPU code path for any machine that can't use WebGPU.