← Back to the playground
The performance journey
roadmap
A live ledger of every lever that would make in-browser training of TinyGPT
faster — what's shipped, what's blocked, what's open, and the honest reason
each is in the state it's in. ~80% of training time is matmul;
most of these levers attack it from a different angle.
Shipped — running today
Partial — exists but unverified
Blocked — external constraint
Open — not started
●
Benchmark log
Measured
Machine: Apple M-series
Build: emcc -O3 -msimd128
Driver: tests/bench_wasm.mjs
Every reported number on this page is run-on-this-machine, not extrapolation.
The numbers below are ms / training step at batch 16/12/8 on the
single-threaded WASM-SIMD build — the current shipped baseline.
Current shipped build — multi-threaded WASM SIMD:
| Preset | Params | d_model | ctx | ms/step | tok/s |
| Small | 0.37M | 96 | 64 | 101 | 10,116 |
| Medium | 0.84M | 128 | 96 | 357 | 4,305 |
| Large | 2.74M | 192 | 128 | 1,191 | 1,289 |
| XL | 6.42M | 256 | 128 | 1,851 | 553 |
(Previously, single-threaded SIMD: ~2× slower across the board. See lever 3.)
For WebGPU on the same hardware, the speedup is a
scaling curve, not a flat ratio. End-to-end,
measured via tests/test_webgpu_train.mjs:
Small 2.6×, Medium 6.8×, Large 9.3×, XL 12.1× vs
the multi-threaded WASM SIMD baseline above. The curve trends
upward because GPU work amortises better with model size — the
speed-evolution chart below shows the cumulative picture.
How to reproduce
bash wasm/build_wasm.sh && node tests/bench_wasm.mjs
from the repo root. Reports ms/step per preset, both forward and backward.
●
Speed evolution — across the preset curve, normalized to scalar baseline
Measured + extrapolated
Baseline: 1× = single-threaded scalar WASM
Reading: each bar is the cumulative speedup over baseline
scalar baseline
1.0×
measured
+ WASM SIMD
1.6×
measured
+ multi-thread (4 workers)
3.2×
measured
+ WebGPU full stack — Small (d=96)
~8.3×
2.6× over WASM-SIMD-mt · 1.1% drift
+ WebGPU full stack — Medium (d=128)
~22×
6.8× over WASM-SIMD-mt · 1.4% drift
+ WebGPU full stack — Large (d=192)
~30×
9.3× over WASM-SIMD-mt · 1.9% drift
+ WebGPU full stack — XL (d=256)
~39×
12.1× over WASM-SIMD-mt · 2.5% drift · top measured
Mega / Behemoth (projected)
≥ 15×
projected · blocked by Memory64 ABI bug (task #66)
Solid teal bars are measured end-to-end on this
codebase (multi-thread WASM SIMD vs full WebGPU stack: blocked4 +
vec4 + subgroup reductions + FA2 fwd+bwd). Speedup grows with
d_model because the blocked matmul kernel's win
scales with matmul size. Striped bar is projected
from kernel-level measurements; the in-browser Memory64 ABI bug
at d_model ≥ 256 (task #66) currently blocks an honest
end-to-end number for Mega and Behemoth.
orthogonal lever
Memory64 doesn't appear as a bar because it lifts the
model-size ceiling, not training throughput. At fixed Small-preset
size it's a no-op — but it's the only thing that lets the whole optimised
pipeline run on a 473M-param model in the first place (a config that
hard-OOMs the 32-bit WASM build).
1
WebAssembly SIMD in the matmul inner loop
Shipped
Impact: ~1.6× per project notes
Lives in: wasm/src/matmul.cpp
The C++ matmul is compiled twice — once scalar, once with -msimd128.
With SIMD on, four f32 lanes multiply per cycle in the inner loop instead of one.
docs/performance.md reports ~1.6×; current build is SIMD by default
(the numbers in the Benchmark log above are SIMD-on).
The page's "WASM SIMD" pill at top shows whether your browser actually loaded
the SIMD build. All Chromium-family browsers and Safari 16.4+ do.
Why now
Smallest cost / biggest immediate win.
Doesn't change any maths, just generates better machine code.
See docs/performance.md.
2
WebGPU forward + backward + AdamW
Shipped · 2.6×–12.1× curve
Measured: 2.6× → 12.1× across Small → XL on M-series
Lives in: webgpu/
The full training loop runs on the GPU — all 24 kernels written in WGSL,
every one finite-difference and parity-checked against the WASM reference.
Correct end to end.
Measured curve: Small 2.6×, Medium 6.8×, Large 9.3×, XL
12.1× over the multi-threaded WASM SIMD baseline, via
tests/test_webgpu_train.mjs. Loss drift 1.1%–2.5% across
the curve — float-reorder noise. The earlier single-preset "~7× on
Small" number predated the multi-threaded WASM baseline and is
withdrawn.
What's next
Benchmark on NVIDIA + Intel iGPU + Snapdragon to build a per-hardware
table. Then make WebGPU the default backend when available.
3
Multi-threaded WebAssembly
Shipped · ~2× measured
Measured: ~2× across all preset sizes
Lives in: wasm/src/matmul.cpp · wasm/build_wasm.sh
matmul_forward and matmul_backward now split the
M dimension across CPU cores via std::thread. Each thread takes
a contiguous row slice; outputs don't overlap so no locks. The dB path is
the exception — it accumulates over M, so we use per-thread scratch and a
final reduction. Threading only kicks in when M ≥ 64.
The pthread WASM build requires SharedArrayBuffer, which requires
cross-origin isolation. The _headers file sets COOP/COEP for
Cloudflare Pages; vite.config.ts mirrors it for the dev server.
| Config | d_model | 1-thread | Threaded | Δ |
| Small | 96 | 190 ms | 101 ms | +88% |
| Medium | 128 | 693 ms | 357 ms | +94% |
| Large | 192 | 2397 ms | 1191 ms | +101% |
| XL | 256 | 3797 ms | 1851 ms | +105% |
Why only 2×, not 4-8×: the workload is memory-bandwidth
bound past ~2 threads. Each matmul reads the entire B matrix; that's the
shared bottleneck. Adding cores past the BW limit gives diminishing
returns. Real measurement consistent with this theory.
4
Tiled blocked matmul (cache-aware)
Tried · reverted (no measured win)
Measured: net wash across tested sizes
Lives in: wasm/src/matmul.cpp
Tiled matmul (Tm=32, Tn=64, Tk=32 blocks) was implemented and benchmarked
against the baseline on the same single-threaded WASM-SIMD build. The result:
| Config | d_model | Baseline | Tiled | Δ |
| Small | 96 | 190 ms | 196 ms | -3% |
| Medium | 128 | 693 ms | 690 ms | ±0% |
| Large | 192 | 2397 ms | 2248 ms | +6.7% |
| XL | 256 | 3797 ms | 3990 ms | -5% |
Why the theoretical prediction (1.5-2×) didn't materialise here:
the baseline matmul's inner loop is a fixed-bound for n in 0..N
that emcc -O3 -msimd128 aggressively autovectorises into f32x4
FMA chains. The tiled variant introduces variable-bound inner loops
(for n in n0..n1) that the autovectoriser handles less cleanly,
so the SIMD win shrinks just as the cache win arrives. Net: wash.
What would change this
A hand-written SIMD inner kernel with statically known
tile sizes (32×4 SIMD micro-kernel + scalar epilogue) — the BLIS approach.
That's ~2 days of careful work, vs the 50-line tiled patch tried here.
5
Mixed-precision weights (fp16 / bfloat16)
Withdrawn — deferred to native
Why not pursued in-browser
The in-browser experiment with fp16-packed weights on top of tiled
matmul (lever 10b) showed
no compound win — once tiling has amortised global-memory
traffic, halving bandwidth has nowhere left to help. A full
mixed-precision refactor (loss scaling, fp16 gradient accumulators,
fp16 variants of every kernel) would be multi-week work for a marginal
return on the sizes we run in browser.
Where this belongs instead: the native macOS app
(lever 19). MLX-Swift supports bf16 natively, with hardware
accelerators on Apple Silicon. The training loop's the same
algorithm; the host's the right place to spend the complexity.
6
Flash Attention
Shipped as FA2 (see lever 12)
Standard attention materialises an N×N score matrix in memory; Flash Attention
computes it in tiles so the full matrix never exists — saving memory and beating
naïve attention on speed by avoiding HBM round-trips.
What changed: with the new Huge/Massive/Mega presets, ctx now goes
to 256–512 — the regime where attention's share of step time goes from ~12% (ctx 64)
to ~40% (ctx 256) to ~55% (ctx 512). At Mega (ctx 512), the score matrix at
B=2, H=12, fp32 is ~25 MB per attention call — starting to hit WebGPU buffer pressure.
Estimated impact, today: ~1.18× on Massive, ~1.7× on Mega. The
ctx=512 preset is the first where Flash Attention becomes the highest-ROI open lever.
Cost
A new kernel from scratch — tiled forward + tiled backward in
WGSL, plus finite-difference parity tests against the existing naïve attention.
~1–2 weeks of focused work.
7
Kernel fusion (forward + loss + backward + AdamW)
Withdrawn — readability tax outweighs the win
Real lift available: ~1.1–1.3× on small models only
Fusing forward + loss + backward + AdamW into one mega-kernel would
kill some memory traffic — but the project's stated principle is
"every layer can be understood." Fused kernels are notoriously
unreadable. The right move would be a second build target ("fast"
alongside "readable") which doubles maintenance for a modest gain.
On bigger models, matmul dominates anyway and dispatch overhead is
in the noise. With blocked4 + FA2 shipped, the speedup curve already
climbs from 2.6× (Small) to 12.1× (XL) — the structural wins are
spent. Some fusion does exist where readability wasn't a sacrifice:
AdamW + grad clip ride in one pass, residual + layernorm chain
without intermediate writeback.
8
Local Python with CUDA / Apple MPS
Shipped — escape hatch
Impact: 50–100×
Lives in: python_ref/
The Python reference runs the same model on PyTorch with full GPU support.
On an M5 Pro: 10M-param model trains at ~24 s per 1,000 steps. Practical
iteration speed for real models.
This is the right answer for anyone serious — the in-browser path is for
learning the mechanics, not training the next ChatGPT. The Diagnostics
section of the playground has the three commands you need.
9
WebAssembly Memory64 — break the 4 GB tab ceiling
Shipped · partial (browser ABI bug above ~250MB heap)
Impact: ~2× model size (in fp32)
Needs: tinygpt64.{js,wasm}
V8 caps each tab's WASM heap near 4 GB on 32-bit pointers — that's
~250M fp32 params with Adam state, full stop. Memory64
(-sMEMORY64=1 -sWASM_BIGINT) switches the module to
64-bit pointers and lifts the cap into the tens of GB on Chromium 133+.
Measured (when called directly): 473M-param model
(~5.6 GB heap with Adam state) allocates cleanly in Node — a config
that hard-OOMed on the 32-bit module.
Caveat (task #66): the 64-bit module's JS↔WASM ABI
wasn't being exercised by the existing
tests/bench_wasm.mjs (which loads the 32-bit module),
so a cwrap pointer-conversion bug shipped — _malloc
returns Number but pointer args expect BigInt. The browser was
calling into this broken bridge for XL and bigger; the loader now
falls back to the 32-bit module for XL/Massive/Mega/Behemoth
in-browser. Reproducer: tests/test_wasm64_xl_node.mjs.
Lesson captured in
docs/lessons.md.
9b
Thread-blocked matmul (4×4 register block)
Kernel measured · biggest single kernel lever
Impact: ~5.2× over naive WebGPU matmul at 2048³ (measured)
Lives in: webgpu/matmul_blocked.wgsl
Stacks two well-known wins. (a) Same 16×16 workgroup-shared
tiling as lever 10, plus (b) each of the 256 threads computes a
4×4 block of output values held in registers. Workgroup outputs
a 64×64 tile; each shared-memory load gets reused 4× across the
thread's register accumulator via outer-product structure.
Arithmetic intensity per shared-mem load climbs from ~1 fused
multiply-add to ~16 — well past the point where matmul becomes
compute-bound rather than memory-bound.
Measured on M-series WebGPU:
| matmul size | naive ms | tiled ms | blocked ms | vs naive |
| 256³ | 0.66 | 0.72 | 0.45 | 1.48× |
| 512³ | 1.96 | 0.86 | 0.64 | 3.04× |
| 1024³ | 6.43 | 2.85 | 1.80 | 3.58× |
| 2048³ | 47.24 | 17.23 | 9.12 | 5.18× |
Speedup grows with matrix size because bigger problems amortize
workgroup-shared-memory loading more effectively across the 4×4
register reuse. At 2048³ (the kind of shape that shows up in
Mega and Behemoth presets) the kernel runs 5.18× faster than
the naive version and 1.89× faster than the merely-tiled one.
Open
Drop-in replacement for the naive matmul
in train.wgsl: same bind-group layout, output is
bit-identical (modulo float reorder). Pipeline-integration is
the next item.
9c
8×8 register block — tried, lost to 4×4
Honest result · register spill / lower occupancy
Impact: ~0.85× of blocked4 across all sizes (measured)
Lives in: webgpu/matmul_blocked8.wgsl
Tried scaling the register block up from 4×4 to 8×8 (workgroup
output tile 128×128 instead of 64×64). Hypothesis was that 4× the
arithmetic intensity per shared-memory load would translate to
~1.5× more speedup on top of blocked4. Lost at every
size:
| matmul size | blocked4 ms | blocked8 ms | ratio |
| 256³ | 0.33 | 0.55 | 0.60× |
| 512³ | 0.54 | 0.75 | 0.72× |
| 1024³ | 1.78 | 1.96 | 0.91× |
| 2048³ | 10.15 | 11.52 | 0.88× |
Most likely cause: 64 floats per thread for the accumulator
exceeds the per-thread register budget on Apple GPUs, forcing
register spill into local memory and tanking effective compute
throughput. Lower workgroup occupancy (16 KB shared per
workgroup vs 4 KB) compounds it — fewer concurrent workgroups
per SM. Kept in the codebase as a documented negative
result. Same lesson as f16-vs-tiled: more aggressive is
not always faster; benchmark every variant.
10
Tiled matmul (workgroup-shared memory)
Kernel measured · superseded by blocked
Impact: ~2.5× over naive WebGPU matmul (measured)
Lives in: webgpu/matmul_tiled.wgsl
Classic 16×16 tiled matmul using var<workgroup>
shared memory (textbook Goto/VandeGeijn pattern). Each workgroup
of 16×16 threads cooperatively loads A's and B's 16×16 blocks
into shared memory, then each thread does 16 multiply-accumulates
from shared. Effectively turns 16 global reads into 1 global +
16 shared, which on big matmuls is where the GPU starts looking
like a GPU.
Measured on M-series WebGPU, dispatch-only timing:
| matmul size | naive ms | tiled ms | speedup |
| 256³ | 0.87 | 0.37 | 2.35× |
| 512³ | 1.74 | 0.64 | 2.72× |
| 1024³ | 6.00 | 2.48 | 2.42× |
| 2048³ | 43.16 | 16.90 | 2.55× |
Clean ~2.5× across every realistic size, peak ~2.7× at 512³.
Parity validated at sizes ≤ 512.
Open
Wire into train.wgsl — every
forward + backward matmul uses the tiled kernel. Drop-in: the
bind-group layout is identical to the naive kernel, so only the
pipeline creation needs to point at the new shader source.
10b
f16-packed storage — tried, doesn't compound
Honest result · standalone win swallowed by tiling
Impact: ~1.7× vs naive, but ≤ tiled
Lives in: webgpu/matmul_f16packed.wgsl · matmul_tiled_f16.wgsl
Weights live as packed half-precision (two f16 per u32 via
pack2x16float built-ins), accumulation in f32. The
standalone version beats naive by ~1.7× by halving global
bandwidth. But once tiling is in place the
kernel is no longer bandwidth-bound — it's compute-bound on
shared-memory ops — and halving global bandwidth no longer
helps. The combined tiled+f16 kernel is the same speed
as plain tiled at 1024³ and a touch slower at 2048³ (17.78 ms
vs 16.90 ms).
Lesson: always bench an optimization against
the best baseline, not the naive one. The ~1.7× we
measured earlier was real but not additive — it was a different
way to get the same memory-traffic win that tiling already
captures more thoroughly.
The packed kernel stays in the codebase as a reference + for
cases where the model genuinely needs more total bytes than the
GPU can hold (Behemoth-scale weight buffers), where halving
storage isn't just about speed but about fitting at all.
11
WebGPU subgroups — fast reductions
Withdrawn — not the bottleneck
Real lift: ~1.1–1.2× on softmax/layernorm only
Subgroup intrinsics (subgroupAdd, subgroupMax)
would shave a tree-reduction's log₂(blockDim) passes into one
on softmax/layernorm/attention reductions. The shipped kernels already
use workgroup-shared reductions, which on Apple GPUs are within 10–20%
of what subgroups would deliver — and reductions are 5–8% of total
step time on every preset we ship. The lift on the bottom line is
~1–2%; not worth the added kernel surface area, the
"subgroups" extension gate, or the test plumbing.
12
Flash Attention 2 in WGSL
Shipped — fwd + bwd + writeback dropped
Memory saved: O(B·H·T²) per layer per step
End-to-end drift vs WASM: 2.5%
Reference: Dao 2023 (FA2)
Workgroup-cooperative forward — one workgroup per
(batch, head, Q-tile of 16 rows), K and V walked in
blocks of 16, online softmax in registers across K blocks. Default
attention path when hd ≤ 64 (every preset up to
Behemoth). Lives in webgpu/attention_fa2.wgsl.
Backward kernels (attn_dscores_fa2 +
attn_dv_fa2) reconstruct
P = exp(S − L) from q/k and
the saved log-sum-exp instead of reading the cached attn matrix.
That removed the forward's second-pass writeback entirely; on
Mega-class shapes (B=4, H=8, T=512) ~67 MB of global memory
traffic per layer per step now stays on-chip.
Verification
Algorithm parity in Node (tests/test_fa2_parity.mjs
+ tests/test_fa2_backward_parity.mjs) — 12 forward
checks, 18 backward checks, all within 1 ULP. End-to-end via
tests/test_webgpu_train.mjs: WASM 6.8 s/step
vs WebGPU + FA2 0.7 s/step, 2.5% loss drift after 50 steps.
13
LoRA fine-tuning in the browser
Withdrawn — moved to native
Folded into the macOS app (lever 19). Native MLX-Swift supports
adapter training as a first-class operation; bringing the same
feature to the WebGPU path would require a parallel set of WGSL
kernels (LoRA-aware matmul, restricted optimizer step) for a
duplicate-of-Python win. The Python reference at
python_ref/lora.py stays as the canonical
implementation.
14
Quantized inference (4-bit / 8-bit)
Withdrawn — moved to native
Folded into the macOS app (lever 19), where the quantization
libraries are mature (MLX, llama.cpp-style GGUF). Int4/int8 in
WGSL is doable but the test surface is large — every kernel needs
a quantized variant plus parity tests — and the immediate need
isn't sharp: Behemoth-class models already fit via Memory64 in
browser. The win-to-effort ratio is poor relative to spending the
same weeks on the native path.
15
Muon optimizer
Withdrawn — out of scope
A drop-in Newton-Schulz orthogonalisation of matrix-shaped
gradients before stepping. Empirically matches AdamW in fewer
steps. Skipped here because faster-convergence isn't the project's
constraint — readability and per-step compute are. Listed in
docs/feature_ideas.md
for whenever someone wants to port it as a contributor experiment.
16
WASM Relaxed SIMD
Shipped — free uplift on the CPU path
Flags: -msimd128 -mrelaxed-simd
In: wasm/build_wasm.sh + build_wasm64.sh
Newer SIMD ops (FMA, dot products, relaxed rounding) the older
-msimd128 set leaves on the table. Both the 32-bit
and 64-bit WASM builds enable -mrelaxed-simd;
compilers emit the new opcodes where they help and fall back to
the baseline SIMD otherwise. Runtime requirements: Chrome 114+,
Firefox 120+, Safari 18.4+ — anywhere our COOP/COEP'd pthread
build already runs.
Verified by re-running tests/test_wasm64_xl_node.mjs
after the rebuild: same losses (5.57 → 3.04 over 5 XL steps),
within run-to-run variance on per-step time. Free correctness;
modest opportunistic speedup on the kernels the compiler decides
to vectorise more aggressively.
17
Pre-trained model gallery
v1 shipping · 4 models
v1: Shakespeare · TinyStories · Python · Recipes
Hosting: bundled in browser/public/gallery/
A Load from gallery button on the Setup screen opens
a dialog with 4 cards: same architecture (12L, d=256, ctx=256,
~9.6M params), four different corpora. Each shows a sample, params,
training loss, and one click loads the .tinygpt file
through the same path as model upload.
All four models were trained in this browser via
browser/train_gallery_one.mjs (Playwright drives a
real Chromium tab against the dev server, 5000 steps, WebGPU).
v1 is bundled in browser/public/gallery/;
manifest.json drives the dialog so future entries
drop in without code changes. v2 will move to R2 (≈ $0.015/GB-month,
zero egress to Pages) once we cross the bundled-asset budget.
Pairs naturally with lever 18 (diverse data structures) — each
gallery entry demonstrates a different kind of pattern the same
architecture can pick up.
18
Diverse data structures — tables, songs, books, code
Planned
Impact: shows the architecture learning structure, not just words
Pairs with the gallery
The playground accepts any UTF-8 text today, but the bundled demo
and most curated corpora are English prose. What the architecture
can actually learn isn't just language — it's any pattern
with local + long-range structure. Worth demonstrating with a
handful of materially different source types:
-
Tabular data (CSV / Excel-exported sheets). The
model learns the row/column rhythm — commas and newlines in the
right places, repeated header tokens, value-range patterns per
column.
-
Songs (lyrics + chord sheets). Verse/chorus
repetition, line length conventions, the way a chord line sits
above a lyric line.
-
Full books. Chapter structure, dialogue
attribution, prose vs. dialogue cadence. Already partway there
with the Shakespeare demo, but a long-form prose book exposes
the model's handling of paragraph- and chapter-scale structure.
-
Code. Indentation, balanced brackets, function
signatures, the convention that
def is followed by
a colon and an indented block. A test of the model's
hierarchical reasoning.
Each one is a 30-min-to-few-hours training run. The interesting
move isn't the data plumbing (which already works — just paste
UTF-8 or pick a Hugging Face dataset); it's the presentation:
each gallery card shows side-by-side input format vs.
generated output, so the visitor sees that the same
architecture picked up the structure of whatever it was fed.
19
Quantization + LoRA fine-tuning
Planned · lands with the Mac app
Three capabilities, one bundle
Cheapest implementation: native (MLX) — see lever 20
Three feature buckets that genuinely belong together — each one
multiplies the value of the other two — and that all benefit from
the same native ML primitives:
- bf16 / fp16 weights and activations. Halves the
memory footprint at training and inference time. First-class on
Apple Silicon, with hardware accelerators. The in-browser
fp16-packed experiment showed no compound gain (lever 10b);
native is where it actually pays.
- int8 / int4 quantized inference. Load a quantized
checkpoint, sample without dequantising the full weight matrix.
Lets the 9.6M-param Huge model ship at < 10 MB and lets much
bigger models run sample-only without exceeding heap. MLX +
GGUF-style formats handle this off-the-shelf.
- LoRA fine-tuning. Load a base checkpoint, freeze
the weights, train low-rank adapter matrices on your own corpus.
The Python reference already supports this
(
docs/lora_guide.md);
the hook is "load a gallery model → click Fine-tune → watch the
voice shift in a few minutes." Pairs directly with lever 17.
Doing these in the WebGPU path was previously listed as three
separate planned items. The honest read: each one would mean a
parallel set of WGSL kernels (quantized matmul, LoRA-aware matmul,
fp16 variants of every op), parity tests against the fp32 reference,
and a doubled test matrix. Months of work for capability that the
native ML framework gives in days. Better to ship them together
on the native side first; back-port to in-browser only if real
usage demands it.
20
Native macOS app (MLX-Swift)
Planned · biggest single new project
Impact: ~30× training throughput on the same machine
File format: .tinygpt portable both ways
A native macOS app — SwiftUI shell, MLX-Swift training loop. Same
model architecture, same .tinygpt file format. Train
on your Mac, drop the checkpoint into the browser playground
anywhere; load a browser-trained model into the Mac app to
continue training at much higher throughput.
How much faster — and why. Same Huge model
(12L, d=256, ctx=256, 5000 steps) on the same M-series Mac:
- WebGPU (today): ~60 min · baseline
- MLX (Python): ~5–6 min · ~10×
- MLX-Swift + hand-tuned Metal: ~2–3 min · ~20–25×
- + Apple Neural Engine for inference: sampling ~30–50× faster
The compounding wins: unified memory (no upload/download
tax — WebGPU pays this on every buffer round-trip);
Metal Performance Shaders' GEMM kernels tuned per Apple GPU
generation; ~3× the effective memory bandwidth
(M3 Max: 400 GB/s vs WebGPU's ~150 GB/s effective);
ANE (16-core, ~38 TOPS on M3) which WebGPU cannot reach;
async compute queues overlapping compute + memcpy.
The architecture also lifts the parameter ceiling.
Browser caps out around ~10M params before user experience falls apart
(training-time budget, memory, GPU sharing with the compositor).
MLX-Swift can comfortably train 100M–1B+ models on the same laptop —
same code path, just a faster runtime.
The boundary between what belongs in-browser and
what belongs native is in
docs/shared_vs_native.md.
Together with the gallery (lever 17), the diverse-data milestone
(lever 18), and the quant + LoRA bundle (lever 19, which lands
here), this is the remaining work.
21
Browser-side weight quantization
Shipped — 4-bit gallery variants, ~4× smaller download
Gallery file size: ~19 MB → ~5 MB per model
Total gallery download: ~75 MB → ~20 MB
Lives in: browser/finalize_gallery_int4.mjs · expandInt4WeightsOnly in main.ts
Storage-side 4-bit quantization for the gallery models. Block-wise
symmetric scheme (block size 64, one fp16 scale per block, two
int4 indices packed per byte — same shape as GGUF Q4_0). Conversion
runs offline via finalize_gallery_int4.mjs against the
existing fp16 .bin files; the browser dequantizes
block-wise back to fp32 once at load time and hands the canonical
layout to the existing WASM importer, which doesn't know anything
changed. The fp16 files keep shipping alongside as the fallback.
Why not int4 GEMM on the GPU? Apple's WebGPU
implementation lacks the integer matmul intrinsics that would
make on-device int4 compute a win — the actual int4 multiply
would still decode each weight before fma. Trying to do it in
WGSL ends up slower than dequantizing once at load time and
running the existing fp16-storage matmul. The honest goal here
is download size + cold-load time, not
inference speed: cold-start drops by 75% on the gallery path.
Quality. Gated by two checks: (1) a startup
numerics gate (runInt4NumericsGate) that synthesizes
a representative weight matrix and verifies the round-trip
error fits the inherent 4-bit envelope — catches catastrophic
bugs (wrong block size, endianness) without enforcing an
unrealistic per-element bound; (2) the
browser/smoke_int4.mjs node-side check that the
published files round-trip with bounded per-tensor drift. End-
to-end: the Shakespeare model still generates Shakespeare
(see smoke_int4_browser.mjs).
Numbers
4 gallery models × ~19 MB fp16 → ~5 MB int4 each. Gate runs
in <1 ms on synthetic data; conversion runs in ~1 s per
model offline. Cache key keys on filename so int4 and fp16
cache independently in OPFS.
Not done
True int4 GEMM (the speed win) still depends on hardware
support that's not in WebGPU as of 2026. When the
cooperative_matrix extension lands with int8/int4
types, the same files become a compute win too — see lever 22.
22
Browser frontier — tech we're tracking
Watching · revisit when stable
Theme: experimental web tech that could narrow the native gap
None ships today — each is parked behind a flag or unfinished spec
A few in-browser speedups exist on the frontier but aren't ready
for production users in 2026. Logged here so future-us knows
exactly where to look when the spec or implementation lands.
WebGPU cooperative matrix (wmma / tensor-core mapping)
- What: a WGSL extension exposing matrix-multiply-accumulate hardware (NVIDIA tensor cores, AMD MFMA, Apple AMX).
- Expected gain: ~3–5× on NVIDIA / AMD; ~1.3× on Apple (Apple's AMX is less directly reachable via this path).
- Status (May 2026): behind
chrome://flags/#enable-unsafe-webgpu + --enable-features=WebGPUExperimentalFeatures. API still moving.
- Skip rationale: Apple gain is the small case for our M-series target. NVIDIA gain doesn't help most browser users. Won't ship until API stabilises and Apple win improves.
WebNN — route to OS NN runtimes (CoreML / DirectML)
- What: a Web API that hands neural-network graphs to the operating system, which then runs them on CoreML (Apple) / DirectML (Windows) / TFLite (Android). Can route to the Apple Neural Engine.
- Expected gain: 3–5× on Apple / Windows for inference; ~2× for training where supported.
- Status (May 2026): Chrome 126+ behind
chrome://flags/#enable-webnn-api. Training support is minimal; mostly inference. API changing monthly.
- Skip rationale: training-side coverage is still thin. Revisit when CoreML/DirectML backends support backward passes and the API freezes.
Async compute queues — overlap compute and memcpy
- What: WebGPU lets you create multiple queues; submit copy commands on one while compute runs on another. Hides upload/download latency.
- Expected gain: ~1.2–1.5×, mostly during sampling and the small handful of CPU-touching steps.
- Status: supported today.
- Skip rationale: our hot path is GPU-bound, not transfer-bound — once the first batch is uploaded, training never crosses the boundary. Worth picking up if we ever stream training data dynamically (e.g., gallery-scale corpora that don't fit in tab memory).
Hand-tuned WGSL kernels per matmul shape
- What: the current matmul kernel is one general-purpose blocked variant. Writing per-shape kernels (e.g., one specifically for the
4 × dModel × ctx attention output, one for the dMlp × dModel MLP), each with workgroup sizes tuned to that shape, would close the gap to the theoretical ceiling.
- Expected gain: ~1.5–2× on the matmul path.
- Status: pure engineering — no spec dependency.
- Skip rationale: weeks of work for a modest gain when the Mac app gets us 20–30× by switching runtimes entirely. We'll only come back here if Apple Neural Engine access to the web never materialises and we're still on WebGPU long-term.
If we ever stacked all of these
Stacked gains compound, but with diminishing returns on top of
today's shipped stack: coop-matrix (1.3× on Apple) × per-shape
kernels (1.7×) × async queues (1.2×) ≈ ~2.6× over
today. Still an order of magnitude short of the Mac
app's ~30×. That's the structural fact behind lever 20.