← TinyGPT · docs · devlog · roadmap · speedup
source: docs/specialists/b1-sql-poc.md · view on GitHub ↗

B1 SQL POC

Status: expanded 0.6B factory POC complete; next step is preference tuning or a public benchmark slice.

Target

First factory POC: a narrow text-to-SQL specialist.

The goal is not to solve Spider yet. The goal is to prove the factory loop on a small deterministic task:

target -> data -> baseline eval -> candidate eval -> row failures -> report

Frozen Fixture

Fixture files live in evals/sql-poc/:

Smoke:

bash evals/sql-poc-smoke.sh

Expected dry-run scores:

ModelExecution accuracyExact matchRows
Baseline fixture0.6670.1676
Candidate fixture0.8330.8336

The smoke also checks row-level failure logging from tinygpt eval-sql --out.

Live POC Steps

  1. Generate SQL from the base model on evals/sql-poc/dev.jsonl.
  2. Score it with tinygpt eval-sql.
  3. Train the cheapest LoRA SFT candidate on evals/sql-poc/train.jsonl.
  4. Generate SQL from base+adapter on the same frozen dev set.
  5. Score candidate output with tinygpt eval-sql.
  6. Render a runs/<date>-sql-poc/ folder with baseline, candidate, row traces, report, and decision.

First Live POC Result

Run folder: runs/2026-07-02-sql-poc-qwen06/ (local, gitignored).

Model:

Result:

ModelExecution accuracyExact matchRows
Qwen3-0.6B baseline0.1670.0006
Qwen3-0.6B + SQL adapter0.8330.8336

Decision: retry-data.

Why: the run proves the factory mechanics and shows the 0.6B can be bent toward SQL output quickly, but the toy fixture has train/eval overlap and the model still sometimes emits prose around correct SQL. The next run needs non-overlapping data and preference examples for SQL-only completions.

Operational note: the metal build blocker was CLT selection, not missing Xcode. Use:

DEVELOPER_DIR=/Applications/Xcode-27.0.0-Beta.app/Contents/Developer \
  swift build --build-system native --product tinygpt

Expanded POC Result

Run folder: runs/2026-07-02-sql-expanded-qwen06/ (local, gitignored).

Dataset:

Result:

ModelExecution accuracyExact matchRows
Qwen3-0.6B baseline0.1600.14050
Qwen3-0.6B + expanded SQL adapter0.8600.84050

Failure taxonomy after SFT:

Failure typeCount
sql_wrong_schema3
sql_unneeded_join2
sql_wrong_filter1
sql_no_select1

Generated follow-up data:

Decision: retry-data. The loop works, but this is still synthetic fixture data and has no breadth regression suite.

Public Benchmark Probe

Run folder: runs/2026-07-02-sql-public-bmc2/ (local, gitignored).

Public source:

Metric: normalized exact SQL match. This is not execution accuracy because the HF dataset ships CREATE TABLE context and gold SQL, not populated SQLite DBs.

ModelExact matchRows
Qwen3-0.6B baseline0.04224
Qwen3-0.6B + expanded SQL adapter0.33324
cssupport/t5-small-awesome-text-to-sql0.45824

Readout:

HF specialized model scan:

CandidateSize / shapePractical note
cssupport/t5-small-awesome-text-to-sql~242 MB, T5 seq2seqSmall, public, strong cheap baseline; not TinyGPT-runtime compatible.
prem-research/prem-1B-SQL~1.35B params, Llama-familyBest next sub-4B SQL specialist to inspect/load; full download is multi-GB.
Ellbendls/Qwen-2.5-3b-Text_to_SQL~3.1B params, Qwen2Relevant under the “smaller than 4B” constraint, but still a larger candidate.
defog/sqlcoder-7b-2~6.7B paramsStrong known specialist, but above the current smallest-model target.

Beat-The-Small-Baseline Plan

Immediate target: beat cssupport/t5-small-awesome-text-to-sql on a non-overlapping public-style slice before moving to Spider execution.

Prepared data:

Curriculum mix:

BucketTrain rowsDev rows
aggregate7710
filter9012
group/having717
join20323
order/limit4610
projection252

Next run:

native-mac/.build/debug/tinygpt sft \
  ~/.cache/huggingface/hub/models--Qwen--Qwen3-0.6B/snapshots/c1899de289a04d12100db370d81485cdf75e47ca \
  --data evals/sql-public-bmc2-train/train.jsonl \
  --template plain \
  --out runs/2026-07-02-sql-public-bmc2/qwen06-public-bmc2.lora \
  --rank 8 \
  --alpha 16 \
  --steps 300 \
  --batch 1 \
  --max-seq 512 \
  --metal-cache-gb 8 \
  --throttle 0.5

Measured external baseline:

ModelExact matchRows
cssupport/t5-small-awesome-text-to-sql0.48464

Gate:

Beat-The-Small-Baseline Result

Run folder: runs/2026-07-02-sql-public-bmc2/ (local, gitignored).

Fixed public dev: evals/sql-public-bmc2-train-v2/dev.jsonl, 64 rows from b-mc2/sql-create-context.

AttemptData / recipePublic exactRows
cssupport/t5-small-awesome-text-to-sqlexternal specialized T5-small0.48464
v1512 rows, source index >= 2000, rank 8, 300 steps, lr 1e-30.34464
v22048 rows, source index >= 64, rank 16, 1200 steps, lr 1e-30.03164
v32048 rows, source index >= 64, rank 8, 600 steps, lr 5e-40.42264
v4join/group weighted rows, rank 8, 700 steps, lr 5e-40.53164

v4 curriculum readout:

Bucketv4 exactT5-small exact
aggregate6/107/10
filter7/126/12
group/having6/75/7
join3/235/23
order/limit10/107/10
projection2/21/2

Result: v4 beats the small public specialist on the fixed public exact gate (34/64 vs 31/64), but it is not shippable as the SQL specialist.

Regression check:

ModelSynthetic executionSynthetic exactRows
original expanded SQL adapter0.8600.84050
public v4 adapter0.2400.22050

Failure taxonomy on the synthetic regression:

Failure typeCount
sql_wrong_schema19
sql_wrong_filter10
sql_unneeded_join9

Generated follow-up data:

Decision: continue-training, not ship. We proved the data loop can beat the small external baseline on the public exact slice, but the adapter learned a public Spider/WikiSQL style that over-joins and hallucinates schema links on the synthetic execution fixture. The next loop should blend public weighted rows with the synthetic execution rows, or train/merge separate public and synthetic adapters, then require both gates to pass.

Blend Experiment

Question: can a single 0.6B adapter hold both the public exact-match style and the local synthetic execution style?

Blend data:

Run:

Result:

ModelPublic exactSynthetic executionSynthetic exact
public v4 adapter0.5310.2400.220
original synthetic adapternot measured on 64-row public gate0.8600.840
blended v1 adapter0.2970.5600.520

Blend failure taxonomy on synthetic regression:

Failure typeCount
sql_no_select17
sql_prose_wrapped3
sql_wrong_filter2

Decision: route-or-compose. Naive mixture SFT partially recovers synthetic execution but destroys the public exact gate, and its failure mode changes from schema over-joining to not reliably emitting clean SQL. This is interference, not just a weighting issue. The next push should keep the public and synthetic adapters separate and test adapter merge or routing:

Adapter Composition Sweep

Implementation: tinygpt serve and tinygpt generate now accept repeatable --lora plus --lora-weight, using the existing HF multi-LoRA stack injection.

Adapters:

Proxy sweep: 20 public rows + 20 synthetic rows.

Public weightSynthetic weightPublic exactSynthetic execSynthetic exact
1.00.250.5500.2500.200
1.00.500.5500.5000.400
1.00.750.5500.7000.500
0.751.00.5500.8000.550
0.501.00.5000.8000.750

Full-gate checks:

Public weightSynthetic weightPublic exactSynthetic execSynthetic exactDecision
1.00.500.5160.4600.320public pass, synthetic fail
1.00.750.4690.6400.360public fail, synthetic partial
0.751.00.4530.7800.500public fail, synthetic partial

Decision: route, not compose. Static adapter composition creates a smooth tradeoff but no tested weight passes both gates. The right next attempt is a router that chooses the public adapter for public Spider/WikiSQL-style schemas and the synthetic adapter for local execution schemas, then reports routed aggregate accuracy. This matches the empirical shape: each adapter is competent in its own distribution, and mixing them blurs both.

Routed Adapter Result

Implementation: scripts/run_sql_routed_generate.py routes each row to one specialist adapter and recombines predictions in original order.

Route classifier:

The old route field is now optional and only used with --trust-route-field. The default classifier is label-free and emits _route, _route_reason, and _route_confidence metadata.

Eval set:

Result:

StrategyPublic exactSynthetic executionSynthetic exact
public v4 only0.5310.2400.220
synthetic expanded onlynot measured on 64-row public gate0.8600.840
blend v10.2970.5600.520
best static composition tested0.5160.4600.320
routed adapters0.5310.8600.840

Classifier rerun artifacts:

Decision: current-best-routed-artifact. This is the first setup that passes both current SQL gates, and it no longer depends on hand-authored route labels. The next real benchmark step is to run BIRD Mini-Dev SQLite or Spider execution fixtures once their DB bundles are local.

BIRD-Augmented Public v5 Attempt

Question: can broader schema-rich public SQL data improve the public adapter’s hard join/generalization misses without collapsing the current b-mc2 gate?

Data:

Run:

Result:

ModelPublic exactSynthetic executionSynthetic exact
public v4 adapter0.5310.2400.220
BIRD+b-mc2 public v5 adapter0.4380.2800.320

Synthetic failure taxonomy for v5:

Failure typeCount
sql_wrong_filter13
sql_wrong_schema10
sql_no_select7
sql_unneeded_join4
sql_missing_join2

Decision: reject-v5. Broad BIRD augmentation was useful as a benchmark/data probe, but it hurt the fixed b-mc2 public gate and did not recover synthetic execution. The next push should avoid another broad SFT mixture. Better options:

Gate

For the POC:

For a real B1 ship gate, move from this toy fixture to Spider or another public text-to-SQL benchmark and keep the same factory artifact shape.