← posttrainllm · roadmap · devlog · docs

The speedup, measured — and why it's a curve

Same model, same seed, same data, same Apple M-series laptop. Two backends. The speedup grows with d_model because GPU work amortizes better as matmul shapes grow. Run it yourself: tests/test_webgpu_train.mjs.

Small · d=96WASM SIMD (mt)

2.6×WebGPU faster

Medium · d=128WASM SIMD (mt)

6.8×WebGPU faster

Large · d=192WASM SIMD (mt)

9.3×WebGPU faster

XL · d=256WASM SIMD (mt)

12.1×WebGPU faster

12.1× at XL — and growing with model size. 1.1–2.5% loss drift across the curve (pure float-reorder noise from different GPU accumulation order).

Measured via tests/test_webgpu_train.mjs. WebGPU run uses the full stack: matmul_blocked + vec4 loads, subgroup-cooperative layernorm + cross-entropy, and Flash Attention 2 forward + backward. Mega and Behemoth presets are omitted — a Memory64 ABI bug at the JS↔WASM bridge (task #66) currently blocks an honest in-browser end-to-end measurement.

Why the speedup is a curve, not a single number

Every training step costs fixed_overhead + math(model_size). The overhead is roughly the same on both backends; only the math piece scales. So small models are bottlenecked on per-step overhead and look "only" 2.6× faster on the GPU; big models are bottlenecked on the math itself, and the GPU's arithmetic throughput dominates.

scalar WASM baseline

1.0×

measured

+ WASM SIMD

1.6×

measured

+ multi-thread (4 workers)

3.2×

measured

+ WebGPU full stack — Small (d=96)

~8.3×

2.6× over WASM-SIMD-mt

+ WebGPU full stack — Medium (d=128)

~22×

6.8× over WASM-SIMD-mt

+ WebGPU full stack — Large (d=192)

~30×

9.3× over WASM-SIMD-mt

+ WebGPU full stack — XL (d=256)

~39×

12.1× over WASM-SIMD-mt · top measured

Mega / Behemoth (projected)

≥ 15×

projected · blocked by Memory64 ABI bug

Each "WebGPU full stack" row is the same code: blocked4 matmul + vec4 loads + subgroup-cooperative reductions + Flash Attention 2 forward and backward. These aren't independent additive levers any more — they're the shipped default WebGPU path.

Full benchmark log + the three things that didn't work → /devlog · /roadmap