Decisions made, measurements taken, things that didn't work. Most of this came out of live AI-pair-programming sessions; the dialogue is condensed but the numbers are verbatim from runs on this codebase.
TinyGPT started as a teaching project — a GPT-2-shaped model implemented from scratch in Python, then ported to C++/WASM for the browser, then to WebGPU. Somewhere along the way it became a speed-optimization project too. The interesting part isn't the final number; it's which optimizations worked, which didn't, and why. That's what this page captures.
The benchmarks below are run-on-this-machine, not extrapolation. Every "kernel
measured" or "end-to-end measured" value can be reproduced with the WebGPU benchmark
button on the playground or the
tests/test_webgpu_train.mjs parity script.
Before this work, the browser playground couldn't allocate a model bigger than ~250M parameters in fp32. V8 caps each tab's WebAssembly heap near 4 GB using 32-bit pointers; weights + AdamW optimizer state (≈12 bytes per parameter) hits that wall at exactly that size.
WebAssembly's -sMEMORY64=1 + -sWASM_BIGINT flags switch
the module to 64-bit pointers, lifting the cap into the tens of GB on Chromium
133+ and Firefox 134+. The build script
(wasm/build_wasm64.sh) produces a separate
tinygpt64.{js,wasm} module — same C++ source, just compiled with the
new flags. Runtime feature-detection picks the right module.
Measured. Allocated a 473M-parameter model end-to-end:
handle: 80312 params: 473,244,160 alloc time: 3,703 ms 1 train step: loss 5.78 in 82.2 s (initial loss for random init, sane) freed cleanly
The same allocation hard-OOMs the 32-bit module. The Behemoth preset in the playground deliberately surfaces this — pick it and the "Memory64 ✓" capability pill lights up, and a pre-flight check blocks the run on browsers that don't support it (telling you which browsers do).
Lesson learned: the Memory64 descriptor spelling changed
mid-flight in the WebAssembly proposal. Newer Chromium uses
address: "i64"; older Chromium (still bundled with Playwright as of
late 2026) uses index: "i64". The loader probes both. Without that
fallback, browsers that did support the feature would silently load the
32-bit module.
Most of training time is matmul. So most of the speed work was matmul. The bench button on the playground runs a side-by-side sweep across kernel variants at realistic sizes (256³ → 2048³, inputs uploaded outside the timed loop so we measure dispatch cost, not packing). The data anchors every speed claim here.
1. Workgroup-shared tiling (Goto/VandeGeijn 16×16): the canonical first optimization. Load a 16×16 tile of A and B into shared memory cooperatively, then do 16 multiply-adds from shared. Cuts global reads by ~16×.
2. Thread-level register blocking (4×4): each thread holds a 4×4 output block in registers. Outer-product structure means each shared-memory load gets reused 4× across the register accumulator. This is where matmul stops being bandwidth-bound and starts being compute-bound.
| matmul size | naive ms | tiled ms | blocked4 ms | vs naive |
|---|---|---|---|---|
| 256³ | 0.87 | 0.72 | 0.45 | 1.93× |
| 512³ | 1.74 | 0.86 | 0.64 | 2.72× |
| 1024³ | 6.43 | 2.85 | 1.80 | 3.58× |
| 2048³ | 47.24 | 17.23 | 9.12 | 5.18× |
Blocked4 was wired into train.wgsl as a drop-in replacement (same
bind-group layout as the naive kernel). The end-to-end parity test confirmed it
produces equivalent training:
preset d_model speedup (WebGPU vs WASM SIMD mt) loss drift
Small 96 2.6× 1.1%
Medium 128 6.8× 1.4%
Large 192 9.3× 1.9%
XL 256 12.1× 2.5%
(drift = pure float-reorder noise)
f16-packed storage — store weights as two f16 per u32 via
pack2x16float, halve global bandwidth. Standalone benchmark: 1.7×
faster than naive WebGPU at 2048³. Sounded great. But when compared
against the right baseline (the already-tiled kernel), the combined
tiled+f16 ran slower than plain tiled at 2048³: 17.78 ms vs
16.90 ms. Once tiling has amortized global reads, the kernel is compute-bound
on shared-memory ops — halving global bandwidth has nowhere left to help.
The 1.7× win was real but not additive — same underlying mechanism as
tiling, captured worse.
8×8 register block — hypothesis was that scaling from 4×4 to 8×8 (with a 128×128 workgroup output tile) would 4× the arithmetic intensity per shared-mem load. Lost at every benchmarked size:
size 1024: blocked4 1.78 ms vs blocked8 1.96 ms (0.91×) size 2048: blocked4 10.15 ms vs blocked8 11.52 ms (0.88×)
Most likely cause: 64 floats per thread for the accumulator exceeds the per-thread register budget on Apple GPUs, forcing register spill to local memory. Lower workgroup occupancy (16 KB shared per workgroup vs 4 KB) compounds it. Kept in the codebase as a documented negative result.
vec4 global loads — broke once, then root-caused.
Same blocked4 algorithm but issuing 128-bit memory transactions for A and B.
Standalone bench at 2048³: 1.37× faster than scalar blocked4. Best
single-kernel measurement in the project. First integration attempt diverged
loss to 88.67 vs WASM's 2.94 — 30× off. The standalone bench used square shapes
with WebGPU's layout: "auto" (which inferred read-only-storage to
match the WGSL var<storage, read> declaration); production
uses an explicit pipeline layout declaring buffer: { type: "storage" }
— read-write. WGSL access mode and bind-group-layout type disagreeing
is undefined behaviour on Chromium/Apple — silently returns wrong data instead
of erroring at validation. Fix was one line per binding: declare all
six as var<storage, read_write> in train_vec4.wgsl
(the kernel only reads from g0/g1; the decoration just has to match). Parity
test now passes at 1.6% drift; vec4 is the default forward matmul.
Lesson learned, three times over. "More aggressive" is not the
same as "faster," and standalone benchmarks miss bugs that show up in real
training. The end-to-end parity test (tests/test_webgpu_train.mjs)
is now the bar — it runs 50 training steps under WASM and 50 under WebGPU on
the same seed, asserts loss drift is below 5%. Every kernel integration goes
through this gate.
Each bar is anchored to a measurement on this machine, not an extrapolation. (See the same chart, with longer captions, on /roadmap.)
| Build | Step time | vs scalar WASM | Notes |
|---|---|---|---|
| scalar WASM | baseline | 1.0× | single-threaded, no SIMD |
| + WASM SIMD | – | 1.6× | -msimd128 |
| + multi-thread (4 wrk) | – | 3.2× | Web Workers + SharedArrayBuffer |
| + WebGPU full stack — Small (d=96) | — | ~8.3× | 2.6× over WASM-SIMD-mt |
| + WebGPU full stack — Medium (d=128) | — | ~22× | 6.8× over WASM-SIMD-mt |
| + WebGPU full stack — Large (d=192) | — | ~30× | 9.3× over WASM-SIMD-mt |
| + WebGPU full stack — XL (d=256) | — | ~39× | 12.1× over WASM-SIMD-mt · top measured |
The speedup is a curve, not a single number. Small preset 2.6×, XL 12.1×, trending
upward because the blocked-4×4 matmul kernel's win scales with matmul size — GPU
work amortises better as d_model and ctx grow. Loss
drift stays under 2.5% across the curve, pure float-reorder noise from different
accumulation order in the GPU kernels.
This was originally the headline entry in the "what's next" list. It moved up over the course of two sessions.
Forward — workgroup-cooperative tiling. One workgroup
per (batch, head, Q-tile of 16 rows); K and V walked in
blocks of 16; the online-softmax state (m, l,
O) stays in registers across K blocks. Per-thread private
memory drops from array<f32, 1024> (the FA1-style
fallback) to 1 + 1 + hd floats — ≤ 66 at hd=64,
well inside the register file. Default attention path for every preset
up to Behemoth.
Backward — recomputes attention on the fly from
q, k, and a saved log-sum-exp
L = m + log(l) per Q row. Two new kernels
(attn_dscores_fa2 + attn_dv_fa2) replace the
ones that previously read the cached attn matrix.
The real memory win: with backward no longer reading the cached attn matrix, the forward dropped its second-pass writeback entirely. At Mega-class shapes (B=4, H=8, T=512) that's about 67 MB of global memory traffic per layer per step that now stays on-chip.
WASM SIMD 6.8 s loss 2.9385
WebGPU + FA2 fwd
+ FA2 back 0.7 s loss 2.8650 2.5% drift
Lesson learned, definitively this time: the algorithm-in-JS-first habit. Both halves of FA2 (forward and backward) were pinned in a Node parity test against a naive reference before any WGSL got written. That made each shader "transcribe the proven algorithm" rather than "debug from a wall of NaN." Without the JS reference I'd have been chasing register-spill bugs that the math itself would never have caused. Worth keeping a separate Node-side parity test for every future kernel that does anything non-trivial.
Most of the speed work is done. Two real threads left:
.tinygpt file format both ways. The
biggest single new project that could grow out of this one.
The most expensive bug in this project never made it into a WGSL file. The
browser default learning rate was 3e-3 — ten times the Python
reference's 3e-4. Training plateaued at loss ~2.45 on real
corpora and looked exactly like a modelling ceiling: smooth curve, slowly
asymptoting, no obvious tell. I spent two days suspecting the GPU kernels
before noticing the LR.
Lesson: kernel parity tests caught every numerical
deviation in the maths. Nothing was catching deviation in the
config defaults. The Python reference is the oracle for the
hyperparameters too — and the defaults need to be parity-checked the
same way the gradients are. Fixed in
browser/src/types.ts and
browser/src/pages/index.astro. Full write-up in
docs/lessons.md.
Same investigation, two more honest items:
The default corpus was 863 bytes. The playground was
initializing on an inline meta-explainer paragraph instead of real text.
No amount of model scaling lowers loss below the dataset's intrinsic
entropy. Default is now the full TinyShakespeare (~1.1 MB,
/shakespeare.txt) fetched on init.
The Memory64 build's ABI layer was untested. I shipped
the 64-bit WASM module claiming "trains 473M params in Node." It did —
when called directly from a one-off script. But tests/bench_wasm.mjs
loads the 32-bit module, so the 64-bit pthread+Memory64 path had
never been exercised in Node. The browser was calling into a
broken JS↔WASM bridge: _malloc returns Number, cwrap
pointer args expect BigInt, the conversion throws. Reproducer:
tests/test_wasm64_xl_node.mjs. The OOB the browser was
hitting at d_model ≥ 256 was a downstream consequence, not a kernel
bug. Tracked as task #66; the loader falls back to the 32-bit module
for XL/Massive/Mega/Behemoth in-browser.
Most of this work was done in conversation with Claude. A few things that stood out:
The AI's first answer is often the most aggressive one. Initial proposals were "let's do 8×8 blocking, that's 4× the reuse." Bench said no. Same with "f16 stacks on top of tiled." Bench said no. Always bench.
Negative results were the most useful part. Documenting why f16-on-top-of-tiled doesn't compound, why 8×8 lost to 4×4, why vec4 broke non-square — those are now in the roadmap as honest entries. The next person (or next AI) won't waste time trying them.
End-to-end parity tests catch what kernel benches miss. Standalone
WebGPU matmul benchmarks at 256³/512³ passed every test. Wire it into a real
training step and the loss diverges 30×. The end-to-end test
(tests/test_webgpu_train.mjs) became the bar — every integration runs
through it before claiming victory.
"How fast can it be?" is the wrong question. "What does it take to produce real text?" is the right one. The model needs val loss below 1.5 for the prose to start looking like prose. That's a dataset+model-scale problem, not a kernel-speed one. Speed makes the runs feasible; the data + step budget determines whether they actually produce something readable.