posttrainllm — the performance journey

●

Benchmark log

Measured

Machine: Apple M-series Build: emcc -O3 -msimd128 Driver: tests/bench_wasm.mjs

Every reported number on this page is run-on-this-machine, not extrapolation. The numbers below are ms / training step at batch 16/12/8 on the single-threaded WASM-SIMD build — the current shipped baseline.

Current shipped build — multi-threaded WASM SIMD:

Preset	Params	d_model	ctx	ms/step	tok/s
Small	0.37M	96	64	101	10,116
Medium	0.84M	128	96	357	4,305
Large	2.74M	192	128	1,191	1,289
XL	6.42M	256	128	1,851	553

(Previously, single-threaded SIMD: ~2× slower across the board. See lever 3.)

For WebGPU on the same hardware, the speedup is a scaling curve, not a flat ratio. End-to-end, measured via tests/test_webgpu_train.mjs: Small 2.6×, Medium 6.8×, Large 9.3×, XL 12.1× vs the multi-threaded WASM SIMD baseline above. The curve trends upward because GPU work amortises better with model size — the speed-evolution chart below shows the cumulative picture.

How to reproduce bash wasm/build_wasm.sh && node tests/bench_wasm.mjs from the repo root. Reports ms/step per preset, both forward and backward.

●

Speed evolution — across the preset curve, normalized to scalar baseline

Measured + extrapolated

Baseline: 1× = single-threaded scalar WASM Reading: each bar is the cumulative speedup over baseline

scalar baseline 1.0× measured

+ WASM SIMD 1.6× measured

+ multi-thread (4 workers) 3.2× measured

+ WebGPU full stack — Small (d=96) ~8.3× 2.6× over WASM-SIMD-mt · 1.1% drift

+ WebGPU full stack — Medium (d=128) ~22× 6.8× over WASM-SIMD-mt · 1.4% drift

+ WebGPU full stack — Large (d=192) ~30× 9.3× over WASM-SIMD-mt · 1.9% drift

+ WebGPU full stack — XL (d=256) ~39× 12.1× over WASM-SIMD-mt · 2.5% drift · top measured

Mega / Behemoth (projected) ≥ 15× projected · blocked by Memory64 ABI bug (task #66)

Solid teal bars are measured end-to-end on this codebase (multi-thread WASM SIMD vs full WebGPU stack: blocked4 + vec4 + subgroup reductions + FA2 fwd+bwd). Speedup grows with d_model because the blocked matmul kernel's win scales with matmul size. Striped bar is projected from kernel-level measurements; the in-browser Memory64 ABI bug at d_model ≥ 256 (task #66) currently blocks an honest end-to-end number for Mega and Behemoth.

orthogonal lever Memory64 doesn't appear as a bar because it lifts the model-size ceiling, not training throughput. At fixed Small-preset size it's a no-op — but it's the only thing that lets the whole optimised pipeline run on a 473M-param model in the first place (a config that hard-OOMs the 32-bit WASM build).

1

WebAssembly SIMD in the matmul inner loop

Shipped

Impact: ~1.6× per project notes Lives in: wasm/src/matmul.cpp

The C++ matmul is compiled twice — once scalar, once with -msimd128. With SIMD on, four f32 lanes multiply per cycle in the inner loop instead of one. docs/performance.md reports ~1.6×; current build is SIMD by default (the numbers in the Benchmark log above are SIMD-on).

The page's "WASM SIMD" pill at top shows whether your browser actually loaded the SIMD build. All Chromium-family browsers and Safari 16.4+ do.

Why now Smallest cost / biggest immediate win. Doesn't change any maths, just generates better machine code. See docs/performance.md.

2

WebGPU forward + backward + AdamW

Shipped · 2.6×–12.1× curve

Measured: 2.6× → 12.1× across Small → XL on M-series Lives in: webgpu/

The full training loop runs on the GPU — all 24 kernels written in WGSL, every one finite-difference and parity-checked against the WASM reference. Correct end to end.

Measured curve: Small 2.6×, Medium 6.8×, Large 9.3×, XL 12.1× over the multi-threaded WASM SIMD baseline, via tests/test_webgpu_train.mjs. Loss drift 1.1%–2.5% across the curve — float-reorder noise. The earlier single-preset "~7× on Small" number predated the multi-threaded WASM baseline and is withdrawn.

What's next Benchmark on NVIDIA + Intel iGPU + Snapdragon to build a per-hardware table. Then make WebGPU the default backend when available.

3

Multi-threaded WebAssembly

Shipped · ~2× measured

Measured: ~2× across all preset sizes Lives in: wasm/src/matmul.cpp · wasm/build_wasm.sh

matmul_forward and matmul_backward now split the M dimension across CPU cores via std::thread. Each thread takes a contiguous row slice; outputs don't overlap so no locks. The dB path is the exception — it accumulates over M, so we use per-thread scratch and a final reduction. Threading only kicks in when M ≥ 64.

The pthread WASM build requires SharedArrayBuffer, which requires cross-origin isolation. The _headers file sets COOP/COEP for Cloudflare Pages; vite.config.ts mirrors it for the dev server.

Config	d_model	1-thread	Threaded	Δ
Small	96	190 ms	101 ms	+88%
Medium	128	693 ms	357 ms	+94%
Large	192	2397 ms	1191 ms	+101%
XL	256	3797 ms	1851 ms	+105%

Why only 2×, not 4-8×: the workload is memory-bandwidth bound past ~2 threads. Each matmul reads the entire B matrix; that's the shared bottleneck. Adding cores past the BW limit gives diminishing returns. Real measurement consistent with this theory.

4

Tiled blocked matmul (cache-aware)

Tried · reverted (no measured win)

Measured: net wash across tested sizes Lives in: wasm/src/matmul.cpp

Tiled matmul (Tm=32, Tn=64, Tk=32 blocks) was implemented and benchmarked against the baseline on the same single-threaded WASM-SIMD build. The result:

Config	d_model	Baseline	Tiled	Δ
Small	96	190 ms	196 ms	-3%
Medium	128	693 ms	690 ms	±0%
Large	192	2397 ms	2248 ms	+6.7%
XL	256	3797 ms	3990 ms	-5%

Why the theoretical prediction (1.5-2×) didn't materialise here: the baseline matmul's inner loop is a fixed-bound for n in 0..N that emcc -O3 -msimd128 aggressively autovectorises into f32x4 FMA chains. The tiled variant introduces variable-bound inner loops (for n in n0..n1) that the autovectoriser handles less cleanly, so the SIMD win shrinks just as the cache win arrives. Net: wash.

What would change this A hand-written SIMD inner kernel with statically known tile sizes (32×4 SIMD micro-kernel + scalar epilogue) — the BLIS approach. That's ~2 days of careful work, vs the 50-line tiled patch tried here.

5

Mixed-precision weights (fp16 / bfloat16)

Withdrawn — deferred to native

Why not pursued in-browser

The in-browser experiment with fp16-packed weights on top of tiled matmul (lever 10b) showed no compound win — once tiling has amortised global-memory traffic, halving bandwidth has nowhere left to help. A full mixed-precision refactor (loss scaling, fp16 gradient accumulators, fp16 variants of every kernel) would be multi-week work for a marginal return on the sizes we run in browser.

Where this belongs instead: the native macOS app (lever 19). MLX-Swift supports bf16 natively, with hardware accelerators on Apple Silicon. The training loop's the same algorithm; the host's the right place to spend the complexity.

6

Flash Attention

Shipped as FA2 (see lever 12)

Potential: 1.15–2× total, scaling with ctx Paper: Dao et al. 2022

Standard attention materialises an N×N score matrix in memory; Flash Attention computes it in tiles so the full matrix never exists — saving memory and beating naïve attention on speed by avoiding HBM round-trips.

What changed: with the new Huge/Massive/Mega presets, ctx now goes to 256–512 — the regime where attention's share of step time goes from ~12% (ctx 64) to ~40% (ctx 256) to ~55% (ctx 512). At Mega (ctx 512), the score matrix at B=2, H=12, fp32 is ~25 MB per attention call — starting to hit WebGPU buffer pressure.

Estimated impact, today: ~1.18× on Massive, ~1.7× on Mega. The ctx=512 preset is the first where Flash Attention becomes the highest-ROI open lever.

Cost A new kernel from scratch — tiled forward + tiled backward in WGSL, plus finite-difference parity tests against the existing naïve attention. ~1–2 weeks of focused work.

7

Kernel fusion (forward + loss + backward + AdamW)

Withdrawn — readability tax outweighs the win

Real lift available: ~1.1–1.3× on small models only

Fusing forward + loss + backward + AdamW into one mega-kernel would kill some memory traffic — but the project's stated principle is "every layer can be understood." Fused kernels are notoriously unreadable. The right move would be a second build target ("fast" alongside "readable") which doubles maintenance for a modest gain.

On bigger models, matmul dominates anyway and dispatch overhead is in the noise. With blocked4 + FA2 shipped, the speedup curve already climbs from 2.6× (Small) to 12.1× (XL) — the structural wins are spent. Some fusion does exist where readability wasn't a sacrifice: AdamW + grad clip ride in one pass, residual + layernorm chain without intermediate writeback.

8

Local Python with CUDA / Apple MPS

Shipped — escape hatch

Impact: 50–100× Lives in: python_ref/

The Python reference runs the same model on PyTorch with full GPU support. On an M5 Pro: 10M-param model trains at ~24 s per 1,000 steps. Practical iteration speed for real models.

This is the right answer for anyone serious — the in-browser path is for learning the mechanics, not training the next ChatGPT. The Diagnostics section of the playground has the three commands you need.

9

WebAssembly Memory64 — break the 4 GB tab ceiling

Shipped · partial (browser ABI bug above ~250MB heap)

Impact: ~2× model size (in fp32) Needs: posttrainllm64.{js,wasm}

V8 caps each tab's WASM heap near 4 GB on 32-bit pointers — that's ~250M fp32 params with Adam state, full stop. Memory64 (-sMEMORY64=1 -sWASM_BIGINT) switches the module to 64-bit pointers and lifts the cap into the tens of GB on Chromium 133+.

Measured (when called directly): 473M-param model (~5.6 GB heap with Adam state) allocates cleanly in Node — a config that hard-OOMed on the 32-bit module.

Caveat (task #66): the 64-bit module's JS↔WASM ABI wasn't being exercised by the existing tests/bench_wasm.mjs (which loads the 32-bit module), so a cwrap pointer-conversion bug shipped — _malloc returns Number but pointer args expect BigInt. The browser was calling into this broken bridge for XL and bigger; the loader now falls back to the 32-bit module for XL/Massive/Mega/Behemoth in-browser. Reproducer: tests/test_wasm64_xl_node.mjs. Lesson captured in docs/archive/lessons.md.

9b

Thread-blocked matmul (4×4 register block)

Kernel measured · biggest single kernel lever

Impact: ~5.2× over naive WebGPU matmul at 2048³ (measured) Lives in: webgpu/matmul_blocked.wgsl

Stacks two well-known wins. (a) Same 16×16 workgroup-shared tiling as lever 10, plus (b) each of the 256 threads computes a 4×4 block of output values held in registers. Workgroup outputs a 64×64 tile; each shared-memory load gets reused 4× across the thread's register accumulator via outer-product structure. Arithmetic intensity per shared-mem load climbs from ~1 fused multiply-add to ~16 — well past the point where matmul becomes compute-bound rather than memory-bound.

Measured on M-series WebGPU:

matmul size	naive ms	tiled ms	blocked ms	vs naive
256³	0.66	0.72	0.45	1.48×
512³	1.96	0.86	0.64	3.04×
1024³	6.43	2.85	1.80	3.58×
2048³	47.24	17.23	9.12	5.18×

Speedup grows with matrix size because bigger problems amortize workgroup-shared-memory loading more effectively across the 4×4 register reuse. At 2048³ (the kind of shape that shows up in Mega and Behemoth presets) the kernel runs 5.18× faster than the naive version and 1.89× faster than the merely-tiled one.

Open Drop-in replacement for the naive matmul in train.wgsl: same bind-group layout, output is bit-identical (modulo float reorder). Pipeline-integration is the next item.

9c

8×8 register block — tried, lost to 4×4

Honest result · register spill / lower occupancy

Impact: ~0.85× of blocked4 across all sizes (measured) Lives in: webgpu/matmul_blocked8.wgsl

Tried scaling the register block up from 4×4 to 8×8 (workgroup output tile 128×128 instead of 64×64). Hypothesis was that 4× the arithmetic intensity per shared-memory load would translate to ~1.5× more speedup on top of blocked4. Lost at every size:

matmul size	blocked4 ms	blocked8 ms	ratio
256³	0.33	0.55	0.60×
512³	0.54	0.75	0.72×
1024³	1.78	1.96	0.91×
2048³	10.15	11.52	0.88×

Most likely cause: 64 floats per thread for the accumulator exceeds the per-thread register budget on Apple GPUs, forcing register spill into local memory and tanking effective compute throughput. Lower workgroup occupancy (16 KB shared per workgroup vs 4 KB) compounds it — fewer concurrent workgroups per SM. Kept in the codebase as a documented negative result. Same lesson as f16-vs-tiled: more aggressive is not always faster; benchmark every variant.

10

Tiled matmul (workgroup-shared memory)

Kernel measured · superseded by blocked

Impact: ~2.5× over naive WebGPU matmul (measured) Lives in: webgpu/matmul_tiled.wgsl

Classic 16×16 tiled matmul using var<workgroup> shared memory (textbook Goto/VandeGeijn pattern). Each workgroup of 16×16 threads cooperatively loads A's and B's 16×16 blocks into shared memory, then each thread does 16 multiply-accumulates from shared. Effectively turns 16 global reads into 1 global + 16 shared, which on big matmuls is where the GPU starts looking like a GPU.

Measured on M-series WebGPU, dispatch-only timing:

matmul size	naive ms	tiled ms	speedup
256³	0.87	0.37	2.35×
512³	1.74	0.64	2.72×
1024³	6.00	2.48	2.42×
2048³	43.16	16.90	2.55×

Clean ~2.5× across every realistic size, peak ~2.7× at 512³. Parity validated at sizes ≤ 512.

Open Wire into train.wgsl — every forward + backward matmul uses the tiled kernel. Drop-in: the bind-group layout is identical to the naive kernel, so only the pipeline creation needs to point at the new shader source.

10b

f16-packed storage — tried, doesn't compound

Honest result · standalone win swallowed by tiling

Impact: ~1.7× vs naive, but ≤ tiled Lives in: webgpu/matmul_f16packed.wgsl · matmul_tiled_f16.wgsl

Weights live as packed half-precision (two f16 per u32 via pack2x16float built-ins), accumulation in f32. The standalone version beats naive by ~1.7× by halving global bandwidth. But once tiling is in place the kernel is no longer bandwidth-bound — it's compute-bound on shared-memory ops — and halving global bandwidth no longer helps. The combined tiled+f16 kernel is the same speed as plain tiled at 1024³ and a touch slower at 2048³ (17.78 ms vs 16.90 ms).

Lesson: always bench an optimization against the best baseline, not the naive one. The ~1.7× we measured earlier was real but not additive — it was a different way to get the same memory-traffic win that tiling already captures more thoroughly.

The packed kernel stays in the codebase as a reference + for cases where the model genuinely needs more total bytes than the GPU can hold (Behemoth-scale weight buffers), where halving storage isn't just about speed but about fitting at all.

11

WebGPU subgroups — fast reductions

Withdrawn — not the bottleneck

Real lift: ~1.1–1.2× on softmax/layernorm only

Subgroup intrinsics (subgroupAdd, subgroupMax) would shave a tree-reduction's log₂(blockDim) passes into one on softmax/layernorm/attention reductions. The shipped kernels already use workgroup-shared reductions, which on Apple GPUs are within 10–20% of what subgroups would deliver — and reductions are 5–8% of total step time on every preset we ship. The lift on the bottom line is ~1–2%; not worth the added kernel surface area, the "subgroups" extension gate, or the test plumbing.

12

Flash Attention 2 in WGSL

Shipped — fwd + bwd + writeback dropped

Memory saved: O(B·H·T²) per layer per step End-to-end drift vs WASM: 2.5% Reference: Dao 2023 (FA2)

Workgroup-cooperative forward — one workgroup per (batch, head, Q-tile of 16 rows), K and V walked in blocks of 16, online softmax in registers across K blocks. Default attention path when hd ≤ 64 (every preset up to Behemoth). Lives in webgpu/attention_fa2.wgsl.

Backward kernels (attn_dscores_fa2 + attn_dv_fa2) reconstruct P = exp(S − L) from q/k and the saved log-sum-exp instead of reading the cached attn matrix. That removed the forward's second-pass writeback entirely; on Mega-class shapes (B=4, H=8, T=512) ~67 MB of global memory traffic per layer per step now stays on-chip.

Verification Algorithm parity in Node (tests/test_fa2_parity.mjs + tests/test_fa2_backward_parity.mjs) — 12 forward checks, 18 backward checks, all within 1 ULP. End-to-end via tests/test_webgpu_train.mjs: WASM 6.8 s/step vs WebGPU + FA2 0.7 s/step, 2.5% loss drift after 50 steps.

Design notes fa2_forward_notes.md · fa2_backward_notes.md

13

LoRA fine-tuning in the browser

Withdrawn — moved to native

Folded into the macOS app (lever 19). Native MLX-Swift supports adapter training as a first-class operation; bringing the same feature to the WebGPU path would require a parallel set of WGSL kernels (LoRA-aware matmul, restricted optimizer step) for a duplicate-of-Python win. The Python reference at python_ref/lora.py stays as the canonical implementation.

14

Quantized inference (4-bit / 8-bit)

Withdrawn — moved to native

Folded into the macOS app (lever 19), where the quantization libraries are mature (MLX, llama.cpp-style GGUF). Int4/int8 in WGSL is doable but the test surface is large — every kernel needs a quantized variant plus parity tests — and the immediate need isn't sharp: Behemoth-class models already fit via Memory64 in browser. The win-to-effort ratio is poor relative to spending the same weeks on the native path.

15

Muon optimizer

Withdrawn — out of scope

A drop-in Newton-Schulz orthogonalisation of matrix-shaped gradients before stepping. Empirically matches AdamW in fewer steps. Skipped here because faster-convergence isn't the project's constraint — readability and per-step compute are. Listed in docs/feature_ideas.md for whenever someone wants to port it as a contributor experiment.

16

WASM Relaxed SIMD

Shipped — free uplift on the CPU path

Flags: -msimd128 -mrelaxed-simd In: wasm/build_wasm.sh + build_wasm64.sh

Newer SIMD ops (FMA, dot products, relaxed rounding) the older -msimd128 set leaves on the table. Both the 32-bit and 64-bit WASM builds enable -mrelaxed-simd; compilers emit the new opcodes where they help and fall back to the baseline SIMD otherwise. Runtime requirements: Chrome 114+, Firefox 120+, Safari 18.4+ — anywhere our COOP/COEP'd pthread build already runs.

Verified by re-running tests/test_wasm64_xl_node.mjs after the rebuild: same losses (5.57 → 3.04 over 5 XL steps), within run-to-run variance on per-step time. Free correctness; modest opportunistic speedup on the kernels the compiler decides to vectorise more aggressively.

17

Pre-trained model gallery

v1 shipping · 4 models

v1: Shakespeare · TinyStories · Python · Recipes Hosting: bundled in browser/public/gallery/

A Load from gallery button on the Setup screen opens a dialog with 4 cards: same architecture (12L, d=256, ctx=256, ~9.6M params), four different corpora. Each shows a sample, params, training loss, and one click loads the .tinygpt file through the same path as model upload.

All four models were trained in this browser via browser/train_gallery_one.mjs (Playwright drives a real Chromium tab against the dev server, 5000 steps, WebGPU). v1 is bundled in browser/public/gallery/; manifest.json drives the dialog so future entries drop in without code changes. v2 will move to R2 (≈ $0.015/GB-month, zero egress to Pages) once we cross the bundled-asset budget.

Pairs naturally with lever 18 (diverse data structures) — each gallery entry demonstrates a different kind of pattern the same architecture can pick up.

18

Diverse data structures — tables, songs, books, code

Planned

Impact: shows the architecture learning structure, not just words Pairs with the gallery

The playground accepts any UTF-8 text today, but the bundled demo and most curated corpora are English prose. What the architecture can actually learn isn't just language — it's any pattern with local + long-range structure. Worth demonstrating with a handful of materially different source types:

Tabular data (CSV / Excel-exported sheets). The model learns the row/column rhythm — commas and newlines in the right places, repeated header tokens, value-range patterns per column.
Songs (lyrics + chord sheets). Verse/chorus repetition, line length conventions, the way a chord line sits above a lyric line.
Full books. Chapter structure, dialogue attribution, prose vs. dialogue cadence. Already partway there with the Shakespeare demo, but a long-form prose book exposes the model's handling of paragraph- and chapter-scale structure.
Code. Indentation, balanced brackets, function signatures, the convention that def is followed by a colon and an indented block. A test of the model's hierarchical reasoning.

Each one is a 30-min-to-few-hours training run. The interesting move isn't the data plumbing (which already works — just paste UTF-8 or pick a Hugging Face dataset); it's the presentation: each gallery card shows side-by-side input format vs. generated output, so the visitor sees that the same architecture picked up the structure of whatever it was fed.

19

Quantization + LoRA fine-tuning

Planned · lands with the Mac app

Three capabilities, one bundle Cheapest implementation: native (MLX) — see lever 20

Three feature buckets that genuinely belong together — each one multiplies the value of the other two — and that all benefit from the same native ML primitives:

bf16 / fp16 weights and activations. Halves the memory footprint at training and inference time. First-class on Apple Silicon, with hardware accelerators. The in-browser fp16-packed experiment showed no compound gain (lever 10b); native is where it actually pays.
int8 / int4 quantized inference. Load a quantized checkpoint, sample without dequantising the full weight matrix. Lets the 9.6M-param Huge model ship at < 10 MB and lets much bigger models run sample-only without exceeding heap. MLX + GGUF-style formats handle this off-the-shelf.
LoRA fine-tuning. Load a base checkpoint, freeze the weights, train low-rank adapter matrices on your own corpus. The Python reference already supports this (docs/lora_guide.md); the hook is "load a gallery model → click Fine-tune → watch the voice shift in a few minutes." Pairs directly with lever 17.

Doing these in the WebGPU path was previously listed as three separate planned items. The honest read: each one would mean a parallel set of WGSL kernels (quantized matmul, LoRA-aware matmul, fp16 variants of every op), parity tests against the fp32 reference, and a doubled test matrix. Months of work for capability that the native ML framework gives in days. Better to ship them together on the native side first; back-port to in-browser only if real usage demands it.

20

Native macOS app (MLX-Swift)

Planned · biggest single new project

Impact: ~30× training throughput on the same machine File format: .tinygpt portable both ways

A native macOS app — SwiftUI shell, MLX-Swift training loop. Same model architecture, same .tinygpt file format. Train on your Mac, drop the checkpoint into the browser playground anywhere; load a browser-trained model into the Mac app to continue training at much higher throughput.

How much faster — and why. Same Huge model (12L, d=256, ctx=256, 5000 steps) on the same M-series Mac:

WebGPU (today): ~60 min · baseline
MLX (Python): ~5–6 min · ~10×
MLX-Swift + hand-tuned Metal: ~2–3 min · ~20–25×
+ Apple Neural Engine for inference: sampling ~30–50× faster

The compounding wins: unified memory (no upload/download tax — WebGPU pays this on every buffer round-trip); Metal Performance Shaders' GEMM kernels tuned per Apple GPU generation; ~3× the effective memory bandwidth (M3 Max: 400 GB/s vs WebGPU's ~150 GB/s effective); ANE (16-core, ~38 TOPS on M3) which WebGPU cannot reach; async compute queues overlapping compute + memcpy.

The architecture also lifts the parameter ceiling. Browser caps out around ~10M params before user experience falls apart (training-time budget, memory, GPU sharing with the compositor). MLX-Swift can comfortably train 100M–1B+ models on the same laptop — same code path, just a faster runtime.

The boundary between what belongs in-browser and what belongs native is in docs/shared_vs_native.md. Together with the gallery (lever 17), the diverse-data milestone (lever 18), and the quant + LoRA bundle (lever 19, which lands here), this is the remaining work.

21

Browser-side weight quantization

Shipped — 4-bit gallery variants, ~4× smaller download

Gallery file size: ~19 MB → ~5 MB per model Total gallery download: ~75 MB → ~20 MB Lives in: browser/finalize_gallery_int4.mjs · expandInt4WeightsOnly in main.ts

Storage-side 4-bit quantization for the gallery models. Block-wise symmetric scheme (block size 64, one fp16 scale per block, two int4 indices packed per byte — same shape as GGUF Q4_0). Conversion runs offline via finalize_gallery_int4.mjs against the existing fp16 .bin files; the browser dequantizes block-wise back to fp32 once at load time and hands the canonical layout to the existing WASM importer, which doesn't know anything changed. The fp16 files keep shipping alongside as the fallback.

Why not int4 GEMM on the GPU? Apple's WebGPU implementation lacks the integer matmul intrinsics that would make on-device int4 compute a win — the actual int4 multiply would still decode each weight before fma. Trying to do it in WGSL ends up slower than dequantizing once at load time and running the existing fp16-storage matmul. The honest goal here is download size + cold-load time, not inference speed: cold-start drops by 75% on the gallery path.

Quality. Gated by two checks: (1) a startup numerics gate (runInt4NumericsGate) that synthesizes a representative weight matrix and verifies the round-trip error fits the inherent 4-bit envelope — catches catastrophic bugs (wrong block size, endianness) without enforcing an unrealistic per-element bound; (2) the browser/smoke_int4.mjs node-side check that the published files round-trip with bounded per-tensor drift. End- to-end: the Shakespeare model still generates Shakespeare (see smoke_int4_browser.mjs).

Numbers 4 gallery models × ~19 MB fp16 → ~5 MB int4 each. Gate runs in <1 ms on synthetic data; conversion runs in ~1 s per model offline. Cache key keys on filename so int4 and fp16 cache independently in OPFS.

Not done True int4 GEMM (the speed win) still depends on hardware support that's not in WebGPU as of 2026. When the cooperative_matrix extension lands with int8/int4 types, the same files become a compute win too — see lever 22.

22

Browser frontier — tech we're tracking

Watching · revisit when stable

Theme: experimental web tech that could narrow the native gap None ships today — each is parked behind a flag or unfinished spec

A few in-browser speedups exist on the frontier but aren't ready for production users in 2026. Logged here so future-us knows exactly where to look when the spec or implementation lands.

WebGPU cooperative matrix (wmma / tensor-core mapping)

What: a WGSL extension exposing matrix-multiply-accumulate hardware (NVIDIA tensor cores, AMD MFMA, Apple AMX).
Expected gain: ~3–5× on NVIDIA / AMD; ~1.3× on Apple (Apple's AMX is less directly reachable via this path).
Status (May 2026): behind chrome://flags/#enable-unsafe-webgpu + --enable-features=WebGPUExperimentalFeatures. API still moving.
Skip rationale: Apple gain is the small case for our M-series target. NVIDIA gain doesn't help most browser users. Won't ship until API stabilises and Apple win improves.

WebNN — route to OS NN runtimes (CoreML / DirectML)

What: a Web API that hands neural-network graphs to the operating system, which then runs them on CoreML (Apple) / DirectML (Windows) / TFLite (Android). Can route to the Apple Neural Engine.
Expected gain: 3–5× on Apple / Windows for inference; ~2× for training where supported.
Status (May 2026): Chrome 126+ behind chrome://flags/#enable-webnn-api. Training support is minimal; mostly inference. API changing monthly.
Skip rationale: training-side coverage is still thin. Revisit when CoreML/DirectML backends support backward passes and the API freezes.

Async compute queues — overlap compute and memcpy

What: WebGPU lets you create multiple queues; submit copy commands on one while compute runs on another. Hides upload/download latency.
Expected gain: ~1.2–1.5×, mostly during sampling and the small handful of CPU-touching steps.
Status: supported today.
Skip rationale: our hot path is GPU-bound, not transfer-bound — once the first batch is uploaded, training never crosses the boundary. Worth picking up if we ever stream training data dynamically (e.g., gallery-scale corpora that don't fit in tab memory).

Hand-tuned WGSL kernels per matmul shape

What: the current matmul kernel is one general-purpose blocked variant. Writing per-shape kernels (e.g., one specifically for the 4 × dModel × ctx attention output, one for the dMlp × dModel MLP), each with workgroup sizes tuned to that shape, would close the gap to the theoretical ceiling.
Expected gain: ~1.5–2× on the matmul path.
Status: pure engineering — no spec dependency.
Skip rationale: weeks of work for a modest gain when the Mac app gets us 20–30× by switching runtimes entirely. We'll only come back here if Apple Neural Engine access to the web never materialises and we're still on WebGPU long-term.

If we ever stacked all of these

Stacked gains compound, but with diminishing returns on top of today's shipped stack: coop-matrix (1.3× on Apple) × per-shape kernels (1.7×) × async queues (1.2×) ≈ ~2.6× over today. Still an order of magnitude short of the Mac app's ~30×. That's the structural fact behind lever 20.