← posttrainllm · roadmap · devlog · docs

Devlog — building posttrainllm while pair-programming with AI

Decisions made, measurements taken, things that didn't work. Most of this came out of live AI-pair-programming sessions; the dialogue is condensed but the numbers are verbatim from runs on this codebase.

posttrainllm started as a teaching project — a GPT-2-shaped model implemented from scratch in Python, then ported to C++/WASM for the browser, then to WebGPU. Somewhere along the way it became a speed-optimization project too. The interesting part isn't the final number; it's which optimizations worked, which didn't, and why. That's what this page captures.

The benchmarks below are run-on-this-machine, not extrapolation. Every "kernel measured" or "end-to-end measured" value can be reproduced with the WebGPU benchmark button on the playground or the tests/test_webgpu_train.mjs parity script.

Memory64 — breaking the 4 GB tab ceiling measured win

Before this work, the browser playground couldn't allocate a model bigger than ~250M parameters in fp32. V8 caps each tab's WebAssembly heap near 4 GB using 32-bit pointers; weights + AdamW optimizer state (≈12 bytes per parameter) hits that wall at exactly that size.

WebAssembly's -sMEMORY64=1 + -sWASM_BIGINT flags switch the module to 64-bit pointers, lifting the cap into the tens of GB on Chromium 133+ and Firefox 134+. The build script (wasm/build_wasm64.sh) produces a separate posttrainllm64.{js,wasm} module — same C++ source, just compiled with the new flags. Runtime feature-detection picks the right module.

Measured. Allocated a 473M-parameter model end-to-end:

handle: 80312
params: 473,244,160
alloc time: 3,703 ms
1 train step:  loss 5.78  in 82.2 s  (initial loss for random init, sane)
freed cleanly

The same allocation hard-OOMs the 32-bit module. The Behemoth preset in the playground deliberately surfaces this — pick it and the "Memory64 ✓" capability pill lights up, and a pre-flight check blocks the run on browsers that don't support it (telling you which browsers do).

Lesson learned: the Memory64 descriptor spelling changed mid-flight in the WebAssembly proposal. Newer Chromium uses address: "i64"; older Chromium (still bundled with Playwright as of late 2026) uses index: "i64". The loader probes both. Without that fallback, browsers that did support the feature would silently load the 32-bit module.

Matmul kernel sweep — what worked, what didn't measured win

Most of training time is matmul. So most of the speed work was matmul. The bench button on the playground runs a side-by-side sweep across kernel variants at realistic sizes (256³ → 2048³, inputs uploaded outside the timed loop so we measure dispatch cost, not packing). The data anchors every speed claim here.

What worked, in order

1. Workgroup-shared tiling (Goto/VandeGeijn 16×16): the canonical first optimization. Load a 16×16 tile of A and B into shared memory cooperatively, then do 16 multiply-adds from shared. Cuts global reads by ~16×.

2. Thread-level register blocking (4×4): each thread holds a 4×4 output block in registers. Outer-product structure means each shared-memory load gets reused 4× across the register accumulator. This is where matmul stops being bandwidth-bound and starts being compute-bound.

matmul size	naive ms	tiled ms	blocked4 ms	vs naive
256³	0.87	0.72	0.45	1.93×
512³	1.74	0.86	0.64	2.72×
1024³	6.43	2.85	1.80	3.58×
2048³	47.24	17.23	9.12	5.18×

Blocked4 was wired into train.wgsl as a drop-in replacement (same bind-group layout as the naive kernel). The end-to-end parity test confirmed it produces equivalent training:

preset   d_model   speedup (WebGPU vs WASM SIMD mt)   loss drift
Small      96            2.6×                              1.1%
Medium    128            6.8×                              1.4%
Large     192            9.3×                              1.9%
XL        256           12.1×                              2.5%
                                  (drift = pure float-reorder noise)

What didn't work

f16-packed storage — store weights as two f16 per u32 via pack2x16float, halve global bandwidth. Standalone benchmark: 1.7× faster than naive WebGPU at 2048³. Sounded great. But when compared against the right baseline (the already-tiled kernel), the combined tiled+f16 ran slower than plain tiled at 2048³: 17.78 ms vs 16.90 ms. Once tiling has amortized global reads, the kernel is compute-bound on shared-memory ops — halving global bandwidth has nowhere left to help. The 1.7× win was real but not additive — same underlying mechanism as tiling, captured worse.

8×8 register block — hypothesis was that scaling from 4×4 to 8×8 (with a 128×128 workgroup output tile) would 4× the arithmetic intensity per shared-mem load. Lost at every benchmarked size:

size 1024:  blocked4 1.78 ms  vs  blocked8 1.96 ms  (0.91×)
size 2048:  blocked4 10.15 ms vs blocked8 11.52 ms (0.88×)

Most likely cause: 64 floats per thread for the accumulator exceeds the per-thread register budget on Apple GPUs, forcing register spill to local memory. Lower workgroup occupancy (16 KB shared per workgroup vs 4 KB) compounds it. Kept in the codebase as a documented negative result.

vec4 global loads — broke once, then root-caused. Same blocked4 algorithm but issuing 128-bit memory transactions for A and B. Standalone bench at 2048³: 1.37× faster than scalar blocked4. Best single-kernel measurement in the project. First integration attempt diverged loss to 88.67 vs WASM's 2.94 — 30× off. The standalone bench used square shapes with WebGPU's layout: "auto" (which inferred read-only-storage to match the WGSL var<storage, read> declaration); production uses an explicit pipeline layout declaring buffer: { type: "storage" } — read-write. WGSL access mode and bind-group-layout type disagreeing is undefined behaviour on Chromium/Apple — silently returns wrong data instead of erroring at validation. Fix was one line per binding: declare all six as var<storage, read_write> in train_vec4.wgsl (the kernel only reads from g0/g1; the decoration just has to match). Parity test now passes at 1.6% drift; vec4 is the default forward matmul.

Lesson learned, three times over. "More aggressive" is not the same as "faster," and standalone benchmarks miss bugs that show up in real training. The end-to-end parity test (tests/test_webgpu_train.mjs) is now the bar — it runs 50 training steps under WASM and 50 under WebGPU on the same seed, asserts loss drift is below 5%. Every kernel integration goes through this gate.

Speed evolution — the cumulative picture measured

Each bar is anchored to a measurement on this machine, not an extrapolation. (See the same chart, with longer captions, on /roadmap.)

Build	Step time	vs scalar WASM	Notes
scalar WASM	baseline	1.0×	single-threaded, no SIMD
+ WASM SIMD	–	1.6×	-msimd128
+ multi-thread (4 wrk)	–	3.2×	Web Workers + SharedArrayBuffer
+ WebGPU full stack — Small (d=96)	—	~8.3×	2.6× over WASM-SIMD-mt
+ WebGPU full stack — Medium (d=128)	—	~22×	6.8× over WASM-SIMD-mt
+ WebGPU full stack — Large (d=192)	—	~30×	9.3× over WASM-SIMD-mt
+ WebGPU full stack — XL (d=256)	—	~39×	12.1× over WASM-SIMD-mt · top measured

The speedup is a curve, not a single number. Small preset 2.6×, XL 12.1×, trending upward because the blocked-4×4 matmul kernel's win scales with matmul size — GPU work amortises better as d_model and ctx grow. Loss drift stays under 2.5% across the curve, pure float-reorder noise from different accumulation order in the GPU kernels.

The lever that actually shipped — Flash Attention 2 measured win

This was originally the headline entry in the "what's next" list. It moved up over the course of two sessions.

Forward — workgroup-cooperative tiling. One workgroup per (batch, head, Q-tile of 16 rows); K and V walked in blocks of 16; the online-softmax state (m, l, O) stays in registers across K blocks. Per-thread private memory drops from array<f32, 1024> (the FA1-style fallback) to 1 + 1 + hd floats — ≤ 66 at hd=64, well inside the register file. Default attention path for every preset up to Behemoth.

Backward — recomputes attention on the fly from q, k, and a saved log-sum-exp L = m + log(l) per Q row. Two new kernels (attn_dscores_fa2 + attn_dv_fa2) replace the ones that previously read the cached attn matrix.

The real memory win: with backward no longer reading the cached attn matrix, the forward dropped its second-pass writeback entirely. At Mega-class shapes (B=4, H=8, T=512) that's about 67 MB of global memory traffic per layer per step that now stays on-chip.

WASM SIMD          6.8 s   loss 2.9385
WebGPU + FA2 fwd
       + FA2 back   0.7 s   loss 2.8650   2.5% drift

Lesson learned, definitively this time: the algorithm-in-JS-first habit. Both halves of FA2 (forward and backward) were pinned in a Node parity test against a naive reference before any WGSL got written. That made each shader "transcribe the proven algorithm" rather than "debug from a wall of NaN." Without the JS reference I'd have been chasing register-spill bugs that the math itself would never have caused. Worth keeping a separate Node-side parity test for every future kernel that does anything non-trivial.

What's next — what genuinely remains projected

Most of the speed work is done. Two real threads left:

Pre-trained model gallery — Cloudflare R2-hosted; manifest-driven UI; lets visitors load and continue-train from real checkpoints instead of just the one shipped demo. Deferred until the speed work was fully shipped so the gallery's implicit "you can train these too" promise stays honest. That bar is met now.
Native macOS app — MLX-Swift + SwiftUI, mirrors the playground but lifts the model-size ceiling into the 7B–30B range on Apple Silicon. Same .tinygpt file format both ways. The biggest single new project that could grow out of this one.

The bug that wasn't in any kernel honest miss

The most expensive bug in this project never made it into a WGSL file. The browser default learning rate was 3e-3 — ten times the Python reference's 3e-4. Training plateaued at loss ~2.45 on real corpora and looked exactly like a modelling ceiling: smooth curve, slowly asymptoting, no obvious tell. I spent two days suspecting the GPU kernels before noticing the LR.

Lesson: kernel parity tests caught every numerical deviation in the maths. Nothing was catching deviation in the config defaults. The Python reference is the oracle for the hyperparameters too — and the defaults need to be parity-checked the same way the gradients are. Fixed in browser/src/types.ts and browser/src/pages/index.astro. Full write-up in docs/archive/lessons.md.

Same investigation, two more honest items:

The default corpus was 863 bytes. The playground was initializing on an inline meta-explainer paragraph instead of real text. No amount of model scaling lowers loss below the dataset's intrinsic entropy. Default is now the full TinyShakespeare (~1.1 MB, /shakespeare.txt) fetched on init.

The Memory64 build's ABI layer was untested. I shipped the 64-bit WASM module claiming "trains 473M params in Node." It did — when called directly from a one-off script. But tests/bench_wasm.mjs loads the 32-bit module, so the 64-bit pthread+Memory64 path had never been exercised in Node. The browser was calling into a broken JS↔WASM bridge: _malloc returns Number, cwrap pointer args expect BigInt, the conversion throws. Reproducer: tests/test_wasm64_xl_node.mjs. The OOB the browser was hitting at d_model ≥ 256 was a downstream consequence, not a kernel bug. Tracked as task #66; the loader falls back to the 32-bit module for XL/Massive/Mega/Behemoth in-browser.

Notes on pair-programming with AI

Most of this work was done in conversation with Claude. A few things that stood out:

The AI's first answer is often the most aggressive one. Initial proposals were "let's do 8×8 blocking, that's 4× the reuse." Bench said no. Same with "f16 stacks on top of tiled." Bench said no. Always bench.

Negative results were the most useful part. Documenting why f16-on-top-of-tiled doesn't compound, why 8×8 lost to 4×4, why vec4 broke non-square — those are now in the roadmap as honest entries. The next person (or next AI) won't waste time trying them.

End-to-end parity tests catch what kernel benches miss. Standalone WebGPU matmul benchmarks at 256³/512³ passed every test. Wire it into a real training step and the loss diverges 30×. The end-to-end test (tests/test_webgpu_train.mjs) became the bar — every integration runs through it before claiming victory.

"How fast can it be?" is the wrong question. "What does it take to produce real text?" is the right one. The model needs val loss below 1.5 for the prose to start looking like prose. That's a dataset+model-scale problem, not a kernel-speed one. Speed makes the runs feasible; the data + step budget determines whether they actually produce something readable.