← TinyGPT · roadmap · devlog · docs

The speedup, measured — and why it's a curve

Same model, same seed, same data, same Apple M-series laptop. Two backends. The speedup grows with d_model because GPU work amortizes better as matmul shapes grow. Run it yourself: tests/test_webgpu_train.mjs.

Small · d=96WASM SIMD (mt)
2.6×WebGPU faster
Medium · d=128WASM SIMD (mt)
6.8×WebGPU faster
Large · d=192WASM SIMD (mt)
9.3×WebGPU faster
XL · d=256WASM SIMD (mt)
12.1×WebGPU faster
12.1× at XL — and growing with model size. 1.1–2.5% loss drift across the curve (pure float-reorder noise from different GPU accumulation order).
Measured via tests/test_webgpu_train.mjs. WebGPU run uses the full stack: matmul_blocked + vec4 loads, subgroup-cooperative layernorm + cross-entropy, and Flash Attention 2 forward + backward. Mega and Behemoth presets are omitted — a Memory64 ABI bug at the JS↔WASM bridge (task #66) currently blocks an honest in-browser end-to-end measurement.

Why the speedup is a curve, not a single number

Every training step costs fixed_overhead + math(model_size). The overhead is roughly the same on both backends; only the math piece scales. So small models are bottlenecked on per-step overhead and look "only" 2.6× faster on the GPU; big models are bottlenecked on the math itself, and the GPU's arithmetic throughput dominates.

scalar WASM baseline
1.0×
measured
+ WASM SIMD
1.6×
measured
+ multi-thread (4 workers)
3.2×
measured
+ WebGPU full stack — Small (d=96)
~8.3×
2.6× over WASM-SIMD-mt
+ WebGPU full stack — Medium (d=128)
~22×
6.8× over WASM-SIMD-mt
+ WebGPU full stack — Large (d=192)
~30×
9.3× over WASM-SIMD-mt
+ WebGPU full stack — XL (d=256)
~39×
12.1× over WASM-SIMD-mt · top measured
Mega / Behemoth (projected)
≥ 15×
projected · blocked by Memory64 ABI bug

Each "WebGPU full stack" row is the same code: blocked4 matmul + vec4 loads + subgroup-cooperative reductions + Flash Attention 2 forward and backward. These aren't independent additive levers any more — they're the shipped default WebGPU path.

Full benchmark log + the three things that didn't work → /devlog · /roadmap