SQL factory POC Current-best candidate 2026-07-02

Qwen3-0.6B Routed SQL Specialist

This is the cleanest current proof of the factory thesis: a tiny 0.6B model can beat a small public SQL baseline on a frozen exact-match slice, but only the routed artifact survives both public and execution-style gates.

Headline Numbers

Public exact

0.531 64-row b-mc2/sql-create-context slice

T5-small baseline

0.484 same 64 public rows

Synthetic execution

0.860 50 heldout SQLite rows

Synthetic exact

0.840 same 50 heldout rows

Competitive Context

System Metric Score Size / Class Comparable? Readout
TinyGPT routed SQL v1 b-mc2 exact / synthetic exec 0.531 / 0.860 0.6B base + 2 routed LoRAs Direct Current local candidate; public exact and synthetic execution gates are both frozen.
T5-small local baseline b-mc2 exact 0.484 ~60M Direct Same 64-row public slice; TinyGPT is +4.7 points exact on this narrow gate.
Defog SQLCoder-7B-2 Defog SQL-Eval category scores 77.1-96% 7B Directional Strong public SQL specialist, but reported as category-level Defog SQL-Eval scores rather than this b-mc2 slice.
Snowflake Arctic-Text2SQL-R1-7B BIRD execution accuracy 68.47% 7B Not comparable Useful target class for public SQL execution; TinyGPT must add a BIRD/Spider execution gate before claiming this lane.
Snowflake Arctic-Text2SQL-R1-14B / 32B BIRD execution accuracy 70.04% / 71.83% 14B / 32B Not comparable Shows the current public high bar: execution accuracy, not exact string match.

Direct rows share this artifact's eval setup. Directional rows are useful market context but should not be read as leaderboard claims.

Adapter comparison

SetupPublic exactSynthetic execDecision
Public v4 only0.5310.240Route required
Blend v10.2970.560Reject
Best static composition0.5160.460Reject
BIRD + b-mc2 v50.4380.280Reject
Classifier-routed v10.5310.860Current best

Router verification

CheckResultEvidence
Unlabeled mixed rows11464 public / 50 synthetic
Public route reason64known_public_source
Synthetic route reason50sqlite_db_field
Route confidence>= 0.99all smoke rows

Release Blockers

Public execution benchmark missing

b-mc2 exact match is useful, but serious SQL claims need execution accuracy on public DBs.

Unblock: Add BIRD Mini-Dev SQLite or Spider SQLite execution fixtures once the DB bundle is local.

Output hygiene

The scorer extracts the first SELECT; completions can still include prose after the query.

Unblock: Add clean-SQL metric plus stopping/format preference data.

Not a specialist package yet

The adapters currently live under gitignored run folders, not package metadata.

Unblock: Package under specialists/ only after a ship decision on a public execution gate.

Evidence

Next Release Action

Publish this as a report artifact first. Do not present it as a shipped SQL model until public execution eval and clean-output gates pass.