Same model, same seed, same data, same Apple M-series laptop. Just two backends.
Run it yourself: tests/test_webgpu_train.mjs at the repo root.
tests/test_webgpu_train.mjs. Same seed, identical
tokenization, identical AdamW state. WebGPU run uses the
matmul_blocked kernel (this work), subgroup-cooperative layernorm
+ cross-entropy, and the fused softmax+value attention forward.
Each row is the cumulative speedup over a single-threaded scalar WASM baseline. Solid = measured; striped = projected.