← TinyGPT · roadmap · devlog

The speedup, measured

Same model, same seed, same data, same Apple M-series laptop. Just two backends. Run it yourself: tests/test_webgpu_train.mjs at the repo root.

WASM SIMDmulti-threaded baseline
6.8 sper step
WebGPU + this workblocked4 + subgroups + fused attn + Memory64
0.7 sper step
9.7× faster end-to-end. 50 training steps · Small preset · 1.1% loss drift (pure float-reorder noise).
Measured via tests/test_webgpu_train.mjs. Same seed, identical tokenization, identical AdamW state. WebGPU run uses the matmul_blocked kernel (this work), subgroup-cooperative layernorm + cross-entropy, and the fused softmax+value attention forward.

Where the 9.7× actually came from

Each row is the cumulative speedup over a single-threaded scalar WASM baseline. Solid = measured; striped = projected.

scalar WASM baseline
1.0×
measured
+ WASM SIMD
1.6×
measured
+ multi-thread (4 workers)
3.2×
measured
+ WebGPU naive
~7×
measured
+ WebGPU blocked matmul
9.7×
measured
+ on big-matmul presets (Mega+)
~30×
measured (kernel)
+ WebGPU subgroups
~36×
projected
+ Flash Attention 2
~70×
projected
Full benchmark log + the three things that didn't work → /devlog · /roadmap