← Back to the playground

The performance journey

roadmap

A live ledger of every lever that would make in-browser training of TinyGPT faster — what's shipped, what's blocked, what's open, and the honest reason each is in the state it's in. ~80% of training time is matmul; most of these levers attack it from a different angle.

Shipped — running today Partial — exists but unverified Blocked — external constraint Open — not started

Benchmark log

Measured
Machine: Apple M-series Build: emcc -O3 -msimd128 Driver: tests/bench_wasm.mjs

Every reported number on this page is run-on-this-machine, not extrapolation. The numbers below are ms / training step at batch 16/12/8 on the single-threaded WASM-SIMD build — the current shipped baseline.

Current shipped build — multi-threaded WASM SIMD:

PresetParamsd_modelctxms/steptok/s
Small0.37M966410110,116
Medium0.84M128963574,305
Large2.74M1921281,1911,289
XL6.42M2561281,851553

(Previously, single-threaded SIMD: ~2× slower across the board. See lever 3.)

For WebGPU on the same hardware, the first user-measured datapoint is ~7× faster than WASM SIMD: a run estimated at 7 minutes on WASM finished in ~1 minute on WebGPU. WebGPU benchmarks across multiple machines are still TBD — see lever 2.

How to reproduce bash wasm/build_wasm.sh && node tests/bench_wasm.mjs from the repo root. Reports ms/step per preset, both forward and backward.

Speed evolution — Small preset, normalized to scalar baseline

Measured + extrapolated
Baseline: = single-threaded scalar WASM Reading: each bar is the cumulative speedup over baseline

Solid teal bars are measured end-to-end on this codebase: tests/test_webgpu_train.mjs compares WASM vs WebGPU final loss after 50 steps on the same seed — the WebGPU + blocked-matmul run finishes in 0.9 s vs 6.8 s for WASM, with 0.5% loss drift (pure float-reorder noise, model trains identically). Bigger presets see more benefit because the blocked kernel's win grows with matmul size — at 2048³ the standalone kernel is 5.18× faster than naive, vs ~1.5–2× on Small-preset shapes. Striped bars are projected from per-lever impact estimates.

orthogonal lever Memory64 doesn't appear as a bar because it lifts the model-size ceiling, not training throughput. At fixed Small-preset size it's a no-op — but it's the only thing that lets the whole optimised pipeline run on a 473M-param model in the first place (a config that hard-OOMs the 32-bit WASM build).

1

WebAssembly SIMD in the matmul inner loop

Shipped
Impact: ~1.6× per project notes Lives in: wasm/src/matmul.cpp

The C++ matmul is compiled twice — once scalar, once with -msimd128. With SIMD on, four f32 lanes multiply per cycle in the inner loop instead of one. docs/performance.md reports ~1.6×; current build is SIMD by default (the numbers in the Benchmark log above are SIMD-on).

The page's "WASM SIMD" pill at top shows whether your browser actually loaded the SIMD build. All Chromium-family browsers and Safari 16.4+ do.

Why now Smallest cost / biggest immediate win. Doesn't change any maths, just generates better machine code. See docs/performance.md.
2

WebGPU forward + backward + AdamW

Shipped · ~7× measured on M-series
Measured: ~7× on Apple M-series · others TBD Lives in: webgpu/

The full training loop runs on the GPU — all 24 kernels written in WGSL, every one finite-difference and parity-checked against the WASM reference. Correct end to end.

First real-hardware datapoint: a run that took ~1 min on WebGPU on Apple M-series was estimated at ~7 min on the WASM SIMD path for the same config — roughly 7× faster. Earlier numbers from this project were withdrawn because they came from swiftshader (software adapter, see docs/notes.md §10); this is the first honestly measured speedup on real silicon.

What's next Benchmark on NVIDIA + Intel iGPU + Snapdragon to build a per-hardware table. Then make WebGPU the default backend when available.
3

Multi-threaded WebAssembly

Shipped · ~2× measured
Measured: ~2× across all preset sizes Lives in: wasm/src/matmul.cpp · wasm/build_wasm.sh

matmul_forward and matmul_backward now split the M dimension across CPU cores via std::thread. Each thread takes a contiguous row slice; outputs don't overlap so no locks. The dB path is the exception — it accumulates over M, so we use per-thread scratch and a final reduction. Threading only kicks in when M ≥ 64.

The pthread WASM build requires SharedArrayBuffer, which requires cross-origin isolation. The _headers file sets COOP/COEP for Cloudflare Pages; vite.config.ts mirrors it for the dev server.

Configd_model1-threadThreadedΔ
Small96190 ms101 ms+88%
Medium128693 ms357 ms+94%
Large1922397 ms1191 ms+101%
XL2563797 ms1851 ms+105%

Why only 2×, not 4-8×: the workload is memory-bandwidth bound past ~2 threads. Each matmul reads the entire B matrix; that's the shared bottleneck. Adding cores past the BW limit gives diminishing returns. Real measurement consistent with this theory.

4

Tiled blocked matmul (cache-aware)

Tried · reverted (no measured win)
Measured: net wash across tested sizes Lives in: wasm/src/matmul.cpp

Tiled matmul (Tm=32, Tn=64, Tk=32 blocks) was implemented and benchmarked against the baseline on the same single-threaded WASM-SIMD build. The result:

Configd_modelBaselineTiledΔ
Small96190 ms196 ms-3%
Medium128693 ms690 ms±0%
Large1922397 ms2248 ms+6.7%
XL2563797 ms3990 ms-5%

Why the theoretical prediction (1.5-2×) didn't materialise here: the baseline matmul's inner loop is a fixed-bound for n in 0..N that emcc -O3 -msimd128 aggressively autovectorises into f32x4 FMA chains. The tiled variant introduces variable-bound inner loops (for n in n0..n1) that the autovectoriser handles less cleanly, so the SIMD win shrinks just as the cache win arrives. Net: wash.

What would change this A hand-written SIMD inner kernel with statically known tile sizes (32×4 SIMD micro-kernel + scalar epilogue) — the BLIS approach. That's ~2 days of careful work, vs the 50-line tiled patch tried here.
5

Mixed-precision weights (fp16 / bfloat16)

Open
Potential: memory, ~1.3–1.8× speed

Store weights and activations in fp16, keep gradient accumulators in fp32. Halves memory bandwidth — which is the actual bottleneck for most matmuls once they exceed L1.

Why not yet: the entire kernel set assumes fp32 today. Every op — forward, backward, AdamW — would need a fp16 variant. Loss scaling needs adding to prevent gradient underflow into denormals. Multi-week refactor.

When After tiled matmul and after WebGPU is verified. The complexity tax is too high to pay before the simpler levers.
6

Flash Attention

Open · increasingly relevant
Potential: 1.15–2× total, scaling with ctx Paper: Dao et al. 2022

Standard attention materialises an N×N score matrix in memory; Flash Attention computes it in tiles so the full matrix never exists — saving memory and beating naïve attention on speed by avoiding HBM round-trips.

What changed: with the new Huge/Massive/Mega presets, ctx now goes to 256–512 — the regime where attention's share of step time goes from ~12% (ctx 64) to ~40% (ctx 256) to ~55% (ctx 512). At Mega (ctx 512), the score matrix at B=2, H=12, fp32 is ~25 MB per attention call — starting to hit WebGPU buffer pressure.

Estimated impact, today: ~1.18× on Massive, ~1.7× on Mega. The ctx=512 preset is the first where Flash Attention becomes the highest-ROI open lever.

Cost A new kernel from scratch — tiled forward + tiled backward in WGSL, plus finite-difference parity tests against the existing naïve attention. ~1–2 weeks of focused work.
7

Kernel fusion (forward + loss + backward + AdamW)

Open — design tension
Potential: 1.3–1.6× for small models

Separate kernels for forward, backward, and AdamW each write their outputs to memory and read them back. Fusing them — one big kernel that does forward + loss + backward + weight update without ever round-tripping through memory — kills the traffic that dominates for small models.

The tension: the project's stated principle is "every layer can be understood." Fused kernels are notoriously unreadable. The right move is probably a second build target — a "fast" variant alongside the "readable" one — but that doubles maintenance.

Honest take Worth doing only after the simpler structural wins (multi-threading, tiled matmul, real WebGPU numbers).
8

Local Python with CUDA / Apple MPS

Shipped — escape hatch
Impact: 50–100× Lives in: python_ref/

The Python reference runs the same model on PyTorch with full GPU support. On an M5 Pro: 10M-param model trains at ~24 s per 1,000 steps. Practical iteration speed for real models.

This is the right answer for anyone serious — the in-browser path is for learning the mechanics, not training the next ChatGPT. The Diagnostics section of the playground has the three commands you need.

9

WebAssembly Memory64 — break the 4 GB tab ceiling

In progress
Impact: ~2× model size (in fp32) Needs: tinygpt64.{js,wasm}

V8 caps each tab's WASM heap near 4 GB on 32-bit pointers — that's ~250M fp32 params with Adam state, full stop. Memory64 (-sMEMORY64=1 -sWASM_BIGINT) switches the module to 64-bit pointers and lifts the cap into the tens of GB on Chromium 133+.

Measured: built the Memory64 variant; allocated a 473M-param model (~5.6 GB heap with Adam state) cleanly in Node — a config that hard-OOMed on the 32-bit module. Next: wire the loader to prefer the 64-bit build when the runtime supports it, and ship a Behemoth preset that uses the extra ceiling.

9b

Thread-blocked matmul (4×4 register block)

Kernel measured · biggest single kernel lever
Impact: ~5.2× over naive WebGPU matmul at 2048³ (measured) Lives in: webgpu/matmul_blocked.wgsl

Stacks two well-known wins. (a) Same 16×16 workgroup-shared tiling as lever 10, plus (b) each of the 256 threads computes a 4×4 block of output values held in registers. Workgroup outputs a 64×64 tile; each shared-memory load gets reused 4× across the thread's register accumulator via outer-product structure. Arithmetic intensity per shared-mem load climbs from ~1 fused multiply-add to ~16 — well past the point where matmul becomes compute-bound rather than memory-bound.

Measured on M-series WebGPU:

matmul sizenaive mstiled msblocked msvs naive
256³0.660.720.451.48×
512³1.960.860.643.04×
1024³6.432.851.803.58×
2048³47.2417.239.125.18×

Speedup grows with matrix size because bigger problems amortize workgroup-shared-memory loading more effectively across the 4×4 register reuse. At 2048³ (the kind of shape that shows up in Mega and Behemoth presets) the kernel runs 5.18× faster than the naive version and 1.89× faster than the merely-tiled one.

Open Drop-in replacement for the naive matmul in train.wgsl: same bind-group layout, output is bit-identical (modulo float reorder). Pipeline-integration is the next item.
9c

8×8 register block — tried, lost to 4×4

Honest result · register spill / lower occupancy
Impact: ~0.85× of blocked4 across all sizes (measured) Lives in: webgpu/matmul_blocked8.wgsl

Tried scaling the register block up from 4×4 to 8×8 (workgroup output tile 128×128 instead of 64×64). Hypothesis was that 4× the arithmetic intensity per shared-memory load would translate to ~1.5× more speedup on top of blocked4. Lost at every size:

matmul sizeblocked4 msblocked8 msratio
256³0.330.550.60×
512³0.540.750.72×
1024³1.781.960.91×
2048³10.1511.520.88×

Most likely cause: 64 floats per thread for the accumulator exceeds the per-thread register budget on Apple GPUs, forcing register spill into local memory and tanking effective compute throughput. Lower workgroup occupancy (16 KB shared per workgroup vs 4 KB) compounds it — fewer concurrent workgroups per SM. Kept in the codebase as a documented negative result. Same lesson as f16-vs-tiled: more aggressive is not always faster; benchmark every variant.

10

Tiled matmul (workgroup-shared memory)

Kernel measured · superseded by blocked
Impact: ~2.5× over naive WebGPU matmul (measured) Lives in: webgpu/matmul_tiled.wgsl

Classic 16×16 tiled matmul using var<workgroup> shared memory (textbook Goto/VandeGeijn pattern). Each workgroup of 16×16 threads cooperatively loads A's and B's 16×16 blocks into shared memory, then each thread does 16 multiply-accumulates from shared. Effectively turns 16 global reads into 1 global + 16 shared, which on big matmuls is where the GPU starts looking like a GPU.

Measured on M-series WebGPU, dispatch-only timing:

matmul sizenaive mstiled msspeedup
256³0.870.372.35×
512³1.740.642.72×
1024³6.002.482.42×
2048³43.1616.902.55×

Clean ~2.5× across every realistic size, peak ~2.7× at 512³. Parity validated at sizes ≤ 512.

Open Wire into train.wgsl — every forward + backward matmul uses the tiled kernel. Drop-in: the bind-group layout is identical to the naive kernel, so only the pipeline creation needs to point at the new shader source.
10b

f16-packed storage — tried, doesn't compound

Honest result · standalone win swallowed by tiling
Impact: ~1.7× vs naive, but ≤ tiled Lives in: webgpu/matmul_f16packed.wgsl · matmul_tiled_f16.wgsl

Weights live as packed half-precision (two f16 per u32 via pack2x16float built-ins), accumulation in f32. The standalone version beats naive by ~1.7× by halving global bandwidth. But once tiling is in place the kernel is no longer bandwidth-bound — it's compute-bound on shared-memory ops — and halving global bandwidth no longer helps. The combined tiled+f16 kernel is the same speed as plain tiled at 1024³ and a touch slower at 2048³ (17.78 ms vs 16.90 ms).

Lesson: always bench an optimization against the best baseline, not the naive one. The ~1.7× we measured earlier was real but not additive — it was a different way to get the same memory-traffic win that tiling already captures more thoroughly.

The packed kernel stays in the codebase as a reference + for cases where the model genuinely needs more total bytes than the GPU can hold (Behemoth-scale weight buffers), where halving storage isn't just about speed but about fitting at all.

11

WebGPU subgroups — fast reductions

Planned
Impact: ~1.3–1.8× on softmax / layernorm Needs: "subgroups" extension (Chrome 125+)

Replace shared-memory reduction loops in softmax / layernorm / attention with subgroup intrinsics (subgroupAdd, subgroupMax). One pass instead of log₂(blockDim) passes through workgroup memory.

12

Flash Attention 2 in WGSL

Planned
Impact: ~3–5× on long-context, O(N) memory Reference: Dao 2023 (FA2)

Tile Q, K, V across workgroups; never materialize the full attention matrix; recompute on backward. Today attention is the wall at ctxLen ≥ 512. FA2 makes the 1024-context Behemoth preset realistic in the browser.

13

LoRA fine-tuning in the browser

Planned
Impact: train 10×-larger frozen base models Reference: python_ref/lora.py

Train low-rank adapters (rank 4–16) on top of a frozen base checkpoint. Trainable parameters drop ~99%, so a 100M base + LoRA fits comfortably where full fine-tuning would OOM. The python_ref/lora.py implementation already exists; the browser path needs the WASM op + a UI panel ("Adapt" beside "Train").

14

Quantized inference (4-bit / 8-bit)

Planned
Impact: ~4× larger model loadable for sampling Methods: GPTQ / AWQ / HQQ for fp32 → int4

Once a model is trained (or loaded), quantize to int4 / int8 for sampling-only mode. The browser ceiling is dominated by Adam state during training, but at inference you can drop weights to int4 and run models 4× larger than you trained. Pairs naturally with the Behemoth preset — train at 100M, sample at "what if it were 400M."

15

Muon optimizer

Planned — experimental
Impact: ~1.4–2× faster convergence vs. AdamW Reference: Keller Jordan 2024

Orthogonalize the matrix-shaped gradients via a Newton-Schulz iteration before stepping (skip the embedding + output head — those stay on AdamW). Empirically matches or beats AdamW with fewer steps. Drop-in port from the public reference implementation.

16

WASM Relaxed SIMD

Planned
Impact: ~1.1–1.3× on the WASM fallback path Flag: -mrelaxed-simd

Newer SIMD ops (FMA, dot products, relaxed rounding) that the older -msimd128 set leaves on the table. Modest but free uplift on the CPU code path for any machine that can't use WebGPU.