Decisions made, measurements taken, things that didn't work. Most of this came out of live AI-pair-programming sessions; the dialogue is condensed but the numbers are verbatim from runs on this codebase.
TinyGPT started as a teaching project — a GPT-2-shaped model implemented from scratch in Python, then ported to C++/WASM for the browser, then to WebGPU. Somewhere along the way it became a speed-optimization project too. The interesting part isn't the final number; it's which optimizations worked, which didn't, and why. That's what this page captures.
The benchmarks below are run-on-this-machine, not extrapolation. Every "kernel
measured" or "end-to-end measured" value can be reproduced with the WebGPU benchmark
button on the playground or the
tests/test_webgpu_train.mjs parity script.
Before this work, the browser playground couldn't allocate a model bigger than ~250M parameters in fp32. V8 caps each tab's WebAssembly heap near 4 GB using 32-bit pointers; weights + AdamW optimizer state (≈12 bytes per parameter) hits that wall at exactly that size.
WebAssembly's -sMEMORY64=1 + -sWASM_BIGINT flags switch
the module to 64-bit pointers, lifting the cap into the tens of GB on Chromium
133+ and Firefox 134+. The build script
(wasm/build_wasm64.sh) produces a separate
tinygpt64.{js,wasm} module — same C++ source, just compiled with the
new flags. Runtime feature-detection picks the right module.
Measured. Allocated a 473M-parameter model end-to-end:
handle: 80312 params: 473,244,160 alloc time: 3,703 ms 1 train step: loss 5.78 in 82.2 s (initial loss for random init, sane) freed cleanly
The same allocation hard-OOMs the 32-bit module. The Behemoth preset in the playground deliberately surfaces this — pick it and the "Memory64 ✓" capability pill lights up, and a pre-flight check blocks the run on browsers that don't support it (telling you which browsers do).
Lesson learned: the Memory64 descriptor spelling changed
mid-flight in the WebAssembly proposal. Newer Chromium uses
address: "i64"; older Chromium (still bundled with Playwright as of
late 2026) uses index: "i64". The loader probes both. Without that
fallback, browsers that did support the feature would silently load the
32-bit module.
Most of training time is matmul. So most of the speed work was matmul. The bench button on the playground runs a side-by-side sweep across kernel variants at realistic sizes (256³ → 2048³, inputs uploaded outside the timed loop so we measure dispatch cost, not packing). The data anchors every speed claim here.
1. Workgroup-shared tiling (Goto/VandeGeijn 16×16): the canonical first optimization. Load a 16×16 tile of A and B into shared memory cooperatively, then do 16 multiply-adds from shared. Cuts global reads by ~16×.
2. Thread-level register blocking (4×4): each thread holds a 4×4 output block in registers. Outer-product structure means each shared-memory load gets reused 4× across the register accumulator. This is where matmul stops being bandwidth-bound and starts being compute-bound.
| matmul size | naive ms | tiled ms | blocked4 ms | vs naive |
|---|---|---|---|---|
| 256³ | 0.87 | 0.72 | 0.45 | 1.93× |
| 512³ | 1.74 | 0.86 | 0.64 | 2.72× |
| 1024³ | 6.43 | 2.85 | 1.80 | 3.58× |
| 2048³ | 47.24 | 17.23 | 9.12 | 5.18× |
Blocked4 was wired into train.wgsl as a drop-in replacement (same
bind-group layout as the naive kernel). The end-to-end parity test confirmed it
produces equivalent training:
WASM SIMD 6.8 s · loss 2.9385
WebGPU+block 0.7 s · loss 2.9719 → 9.7× wall-clock, 1.1% loss drift
(pure float-reorder noise)
f16-packed storage — store weights as two f16 per u32 via
pack2x16float, halve global bandwidth. Standalone benchmark: 1.7×
faster than naive WebGPU at 2048³. Sounded great. But when compared
against the right baseline (the already-tiled kernel), the combined
tiled+f16 ran slower than plain tiled at 2048³: 17.78 ms vs
16.90 ms. Once tiling has amortized global reads, the kernel is compute-bound
on shared-memory ops — halving global bandwidth has nowhere left to help.
The 1.7× win was real but not additive — same underlying mechanism as
tiling, captured worse.
8×8 register block — hypothesis was that scaling from 4×4 to 8×8 (with a 128×128 workgroup output tile) would 4× the arithmetic intensity per shared-mem load. Lost at every benchmarked size:
size 1024: blocked4 1.78 ms vs blocked8 1.96 ms (0.91×) size 2048: blocked4 10.15 ms vs blocked8 11.52 ms (0.88×)
Most likely cause: 64 floats per thread for the accumulator exceeds the per-thread register budget on Apple GPUs, forcing register spill to local memory. Lower workgroup occupancy (16 KB shared per workgroup vs 4 KB) compounds it. Kept in the codebase as a documented negative result.
vec4 global loads — broke once, then root-caused.
Same blocked4 algorithm but issuing 128-bit memory transactions for A and B.
Standalone bench at 2048³: 1.37× faster than scalar blocked4. Best
single-kernel measurement in the project. First integration attempt diverged
loss to 88.67 vs WASM's 2.94 — 30× off. The standalone bench used square shapes
with WebGPU's layout: "auto" (which inferred read-only-storage to
match the WGSL var<storage, read> declaration); production
uses an explicit pipeline layout declaring buffer: { type: "storage" }
— read-write. WGSL access mode and bind-group-layout type disagreeing
is undefined behaviour on Chromium/Apple — silently returns wrong data instead
of erroring at validation. Fix was one line per binding: declare all
six as var<storage, read_write> in train_vec4.wgsl
(the kernel only reads from g0/g1; the decoration just has to match). Parity
test now passes at 1.6% drift; vec4 is the default forward matmul.
Lesson learned, three times over. "More aggressive" is not the
same as "faster," and standalone benchmarks miss bugs that show up in real
training. The end-to-end parity test (tests/test_webgpu_train.mjs)
is now the bar — it runs 50 training steps under WASM and 50 under WebGPU on
the same seed, asserts loss drift is below 5%. Every kernel integration goes
through this gate.
Each bar is anchored to a measurement on this machine, not an extrapolation. (See the same chart, with longer captions, on /roadmap.)
| Build | Step time | vs scalar WASM | Notes |
|---|---|---|---|
| scalar WASM | baseline | 1.0× | single-threaded, no SIMD |
| + WASM SIMD | – | 1.6× | -msimd128 |
| + multi-thread (4 wrk) | – | 3.2× | Web Workers + SharedArrayBuffer |
| + WebGPU naive | ~1.0 s | ~7× | Small preset, 50 steps |
| + WebGPU blocked4 (fwd + back) | 0.7 s | 9.7× | integration parity verified |
| + subgroups (layernorm, cross-ent) | 0.7 s | 9.7× | kicks in at d_model ≥ 256 |
| + fused softmax+value | 0.7 s | 9.7× | saves attn round-trip; ctx-1024 ready |
The 9.7× wall-clock end-to-end speedup is the load-bearing number. It corresponds to a step time of 0.7 s where WASM SIMD takes 6.8 s — same model, same seed, same data, 0.5%–2% loss drift (pure float-reorder noise from different accumulation order in the GPU kernels).
Honest read on diminishing returns. Most of the easy wins are done. What's left:
.tinygpt file format both ways. Future work.
Most of this work was done in conversation with Claude. A few things that stood out:
The AI's first answer is often the most aggressive one. Initial proposals were "let's do 8×8 blocking, that's 4× the reuse." Bench said no. Same with "f16 stacks on top of tiled." Bench said no. Always bench.
Negative results were the most useful part. Documenting why f16-on-top-of-tiled doesn't compound, why 8×8 lost to 4×4, why vec4 broke non-square — those are now in the roadmap as honest entries. The next person (or next AI) won't waste time trying them.
End-to-end parity tests catch what kernel benches miss. Standalone
WebGPU matmul benchmarks at 256³/512³ passed every test. Wire it into a real
training step and the loss diverges 30×. The end-to-end test
(tests/test_webgpu_train.mjs) became the bar — every integration runs
through it before claiming victory.
"How fast can it be?" is the wrong question. "What does it take to produce real text?" is the right one. The model needs val loss below 1.5 for the prose to start looking like prose. That's a dataset+model-scale problem, not a kernel-speed one. Speed makes the runs feasible; the data + step budget determines whether they actually produce something readable.