← TinyGPT · roadmap · devlog

Devlog — building TinyGPT while pair-programming with AI

Decisions made, measurements taken, things that didn't work. Most of this came out of live AI-pair-programming sessions; the dialogue is condensed but the numbers are verbatim from runs on this codebase.

TinyGPT started as a teaching project — a GPT-2-shaped model implemented from scratch in Python, then ported to C++/WASM for the browser, then to WebGPU. Somewhere along the way it became a speed-optimization project too. The interesting part isn't the final number; it's which optimizations worked, which didn't, and why. That's what this page captures.

The benchmarks below are run-on-this-machine, not extrapolation. Every "kernel measured" or "end-to-end measured" value can be reproduced with the WebGPU benchmark button on the playground or the tests/test_webgpu_train.mjs parity script.

Memory64 — breaking the 4 GB tab ceiling measured win

Before this work, the browser playground couldn't allocate a model bigger than ~250M parameters in fp32. V8 caps each tab's WebAssembly heap near 4 GB using 32-bit pointers; weights + AdamW optimizer state (≈12 bytes per parameter) hits that wall at exactly that size.

WebAssembly's -sMEMORY64=1 + -sWASM_BIGINT flags switch the module to 64-bit pointers, lifting the cap into the tens of GB on Chromium 133+ and Firefox 134+. The build script (wasm/build_wasm64.sh) produces a separate tinygpt64.{js,wasm} module — same C++ source, just compiled with the new flags. Runtime feature-detection picks the right module.

Measured. Allocated a 473M-parameter model end-to-end:

handle: 80312
params: 473,244,160
alloc time: 3,703 ms
1 train step:  loss 5.78  in 82.2 s  (initial loss for random init, sane)
freed cleanly

The same allocation hard-OOMs the 32-bit module. The Behemoth preset in the playground deliberately surfaces this — pick it and the "Memory64 ✓" capability pill lights up, and a pre-flight check blocks the run on browsers that don't support it (telling you which browsers do).

Lesson learned: the Memory64 descriptor spelling changed mid-flight in the WebAssembly proposal. Newer Chromium uses address: "i64"; older Chromium (still bundled with Playwright as of late 2026) uses index: "i64". The loader probes both. Without that fallback, browsers that did support the feature would silently load the 32-bit module.

Matmul kernel sweep — what worked, what didn't measured win

Most of training time is matmul. So most of the speed work was matmul. The bench button on the playground runs a side-by-side sweep across kernel variants at realistic sizes (256³ → 2048³, inputs uploaded outside the timed loop so we measure dispatch cost, not packing). The data anchors every speed claim here.

What worked, in order

1. Workgroup-shared tiling (Goto/VandeGeijn 16×16): the canonical first optimization. Load a 16×16 tile of A and B into shared memory cooperatively, then do 16 multiply-adds from shared. Cuts global reads by ~16×.

2. Thread-level register blocking (4×4): each thread holds a 4×4 output block in registers. Outer-product structure means each shared-memory load gets reused 4× across the register accumulator. This is where matmul stops being bandwidth-bound and starts being compute-bound.

matmul sizenaive mstiled msblocked4 msvs naive
256³0.870.720.451.93×
512³1.740.860.642.72×
1024³6.432.851.803.58×
2048³47.2417.239.125.18×

Blocked4 was wired into train.wgsl as a drop-in replacement (same bind-group layout as the naive kernel). The end-to-end parity test confirmed it produces equivalent training:

WASM SIMD       6.8 s · loss 2.9385
WebGPU+block    0.7 s · loss 2.9719   →  9.7× wall-clock, 1.1% loss drift
                                          (pure float-reorder noise)

What didn't work

f16-packed storage — store weights as two f16 per u32 via pack2x16float, halve global bandwidth. Standalone benchmark: 1.7× faster than naive WebGPU at 2048³. Sounded great. But when compared against the right baseline (the already-tiled kernel), the combined tiled+f16 ran slower than plain tiled at 2048³: 17.78 ms vs 16.90 ms. Once tiling has amortized global reads, the kernel is compute-bound on shared-memory ops — halving global bandwidth has nowhere left to help. The 1.7× win was real but not additive — same underlying mechanism as tiling, captured worse.

8×8 register block — hypothesis was that scaling from 4×4 to 8×8 (with a 128×128 workgroup output tile) would 4× the arithmetic intensity per shared-mem load. Lost at every benchmarked size:

size 1024:  blocked4 1.78 ms  vs  blocked8 1.96 ms  (0.91×)
size 2048:  blocked4 10.15 ms vs blocked8 11.52 ms (0.88×)

Most likely cause: 64 floats per thread for the accumulator exceeds the per-thread register budget on Apple GPUs, forcing register spill to local memory. Lower workgroup occupancy (16 KB shared per workgroup vs 4 KB) compounds it. Kept in the codebase as a documented negative result.

vec4 global loads — broke once, then root-caused. Same blocked4 algorithm but issuing 128-bit memory transactions for A and B. Standalone bench at 2048³: 1.37× faster than scalar blocked4. Best single-kernel measurement in the project. First integration attempt diverged loss to 88.67 vs WASM's 2.94 — 30× off. The standalone bench used square shapes with WebGPU's layout: "auto" (which inferred read-only-storage to match the WGSL var<storage, read> declaration); production uses an explicit pipeline layout declaring buffer: { type: "storage" } — read-write. WGSL access mode and bind-group-layout type disagreeing is undefined behaviour on Chromium/Apple — silently returns wrong data instead of erroring at validation. Fix was one line per binding: declare all six as var<storage, read_write> in train_vec4.wgsl (the kernel only reads from g0/g1; the decoration just has to match). Parity test now passes at 1.6% drift; vec4 is the default forward matmul.

Lesson learned, three times over. "More aggressive" is not the same as "faster," and standalone benchmarks miss bugs that show up in real training. The end-to-end parity test (tests/test_webgpu_train.mjs) is now the bar — it runs 50 training steps under WASM and 50 under WebGPU on the same seed, asserts loss drift is below 5%. Every kernel integration goes through this gate.

Speed evolution — the cumulative picture measured

Each bar is anchored to a measurement on this machine, not an extrapolation. (See the same chart, with longer captions, on /roadmap.)

BuildStep timevs scalar WASMNotes
scalar WASMbaseline1.0×single-threaded, no SIMD
+ WASM SIMD1.6×-msimd128
+ multi-thread (4 wrk)3.2×Web Workers + SharedArrayBuffer
+ WebGPU naive~1.0 s~7×Small preset, 50 steps
+ WebGPU blocked4 (fwd + back)0.7 s9.7×integration parity verified
+ subgroups (layernorm, cross-ent)0.7 s9.7×kicks in at d_model ≥ 256
+ fused softmax+value0.7 s9.7×saves attn round-trip; ctx-1024 ready

The 9.7× wall-clock end-to-end speedup is the load-bearing number. It corresponds to a step time of 0.7 s where WASM SIMD takes 6.8 s — same model, same seed, same data, 0.5%–2% loss drift (pure float-reorder noise from different accumulation order in the GPU kernels).

What's next — pipeline of remaining wins projected

Honest read on diminishing returns. Most of the easy wins are done. What's left:

Notes on pair-programming with AI

Most of this work was done in conversation with Claude. A few things that stood out:

The AI's first answer is often the most aggressive one. Initial proposals were "let's do 8×8 blocking, that's 4× the reuse." Bench said no. Same with "f16 stacks on top of tiled." Bench said no. Always bench.

Negative results were the most useful part. Documenting why f16-on-top-of-tiled doesn't compound, why 8×8 lost to 4×4, why vec4 broke non-square — those are now in the roadmap as honest entries. The next person (or next AI) won't waste time trying them.

End-to-end parity tests catch what kernel benches miss. Standalone WebGPU matmul benchmarks at 256³/512³ passed every test. Wire it into a real training step and the loss diverges 30×. The end-to-end test (tests/test_webgpu_train.mjs) became the bar — every integration runs through it before claiming victory.

"How fast can it be?" is the wrong question. "What does it take to produce real text?" is the right one. The model needs val loss below 1.5 for the prose to start looking like prose. That's a dataset+model-scale problem, not a kernel-speed one. Speed makes the runs feasible; the data + step budget determines whether they actually produce something readable.