Same model, same seed, same data, same Apple M-series laptop. Two backends.
The speedup grows with d_model because GPU work amortizes better
as matmul shapes grow. Run it yourself: tests/test_webgpu_train.mjs.
tests/test_webgpu_train.mjs. WebGPU run uses
the full stack: matmul_blocked + vec4 loads,
subgroup-cooperative layernorm + cross-entropy, and Flash Attention 2
forward + backward. Mega and Behemoth presets are omitted —
a Memory64 ABI bug at the JS↔WASM bridge (task #66) currently
blocks an honest in-browser end-to-end measurement.
Every training step costs fixed_overhead + math(model_size). The overhead is roughly the same on both backends; only the math piece scales. So small models are bottlenecked on per-step overhead and look "only" 2.6× faster on the GPU; big models are bottlenecked on the math itself, and the GPU's arithmetic throughput dominates.
Each "WebGPU full stack" row is the same code: blocked4 matmul + vec4 loads + subgroup-cooperative reductions + Flash Attention 2 forward and backward. These aren't independent additive levers any more — they're the shipped default WebGPU path.