Public Artifacts
Public artifacts are first-class factory outputs. They are different from local run files: a public artifact should be understandable outside this repo, have a small committed metadata surface, and list its blockers as clearly as its wins.
Website surface: /artifacts. The website should behave like a built-in blog
for artifacts: one index page for scanning, one detail page per artifact, and
numbers/blockers/evidence on every page.
Release Rule
Every public artifact entry must include:
- artifact id and type
- current status
- what can be published now
- measured evidence
- competitive context
- blockers
- next release action
Weights, adapters, and large run outputs do not need to live in git, but the public artifact must explain where they came from, how they were evaluated, and why it is or is not ready to package.
Competition rows must be labeled as:
| Label | Meaning |
|---|---|
Direct | Same fixture, prompt/eval setup, metric, and scorer. |
Directional | Useful market or method context, but not the same eval. |
Not comparable | Public high bar or adjacent system that explains the target lane but cannot be claimed as a win/loss. |
Public copy should prefer “we beat X on this exact local gate” only for
Direct rows. Everything else is context until the same benchmark is run.
Artifact States
| State | Meaning |
|---|---|
release-ready-metadata | Small committed metadata exists; large weights may still be external. |
candidate-current-best | Best measured candidate, but not yet a shipped specialist package. |
report-only | Good public write-up/repro artifact, but no model should be used directly. |
blocked | Needs a named unblocker before public release work continues. |
parked | Real artifact, but not active in the factory sequence. |
Current Public Artifact List
| Artifact | Type | State | Public value | Next release action |
|---|---|---|---|---|
qwen3-4b-file-ops-distilled | Specialist package metadata | release-ready-metadata | Shows a real TinyGPT-built routed specialist: 58% -> 100% on file-ops hard gate, with breadth regression disclosed. | Decide whether to release metadata-only first or publish/host the multi-GB fused weights. |
qwen06-sql-routed-v1 | Routed SQL specialist POC | candidate-current-best | Shows the factory/router pattern on SQL: public exact 0.531 and synthetic execution 0.860 using separate routed adapters. | Convert to a public report artifact; package only after a public execution benchmark gate exists. |
factory-run-schema-v1 | Process artifact | report-only | Explains the repeatable target -> data -> post-training -> eval -> package -> report contract. | Add one canonical rendered example run and link it from the README. |
browser-playground | Demo artifact | parked | Public proof of the earlier browser/WASM/WebGPU learning track. | Keep parked unless it directly presents factory reports or artifacts. |
Artifact Details
qwen3-4b-file-ops-distilled
Status: release-ready-metadata
Committed surface:
specialists/qwen3-4b-file-ops-distilled/model_card.mdspecialists/qwen3-4b-file-ops-distilled/eval_report.jsonspecialists/qwen3-4b-file-ops-distilled/tinygpt.lock.jsonspecialists/qwen3-4b-file-ops-distilled/prompt.mdspecialists/registry.json
Measured evidence:
| Gate | Stock | Specialist |
|---|---|---|
| File-ops hard gate | 0.58 | 1.00 |
| File-ops hardgen heldout | - | 0.95 |
| Out-of-domain breadth | 0.596 | 0.423 |
Release blockers:
| Blocker | Why it matters | Unblock action |
|---|---|---|
| Weight distribution undecided | The lock points to ~/.cache/tinygpt/models/mt4b_fused, not a public download. | Choose metadata-only release or publish the artifact to a durable host. |
| Breadth regression is real | The model is unsafe as a general planner. | Keep routed-only positioning in all public copy. |
| Frontier/breadth caveat remains | The breadth suite is directly comparable but not fully frontier-validated. | Keep caveat in model card; do not oversell as general capability. |
qwen06-sql-routed-v1
Status: candidate-current-best
Current artifact shape:
- Public adapter:
runs/2026-07-02-sql-public-bmc2/qwen06-public-bmc2-v4-joinweighted.lora - Synthetic adapter:
runs/2026-07-02-sql-expanded-qwen06/qwen06-sql-expanded.lora - Router:
scripts/run_sql_routed_generate.py - Router smoke:
evals/sql-routed-router-smoke.sh - Eval fixture:
evals/sql-routed-mixed-v1/ - Report:
docs/specialists/b1-sql-poc.md
Measured evidence:
| Gate | Result |
|---|---|
| Public b-mc2 exact, 64 rows | 0.531 |
| T5-small public baseline, same 64 rows | 0.484 |
| Synthetic SQLite execution, 50 rows | 0.860 |
| Synthetic SQLite exact, 50 rows | 0.840 |
| Label-free router smoke | 64 public / 50 synthetic, all high-confidence |
Competitive context:
| System | Metric | Score | Comparable? | Readout |
|---|---|---|---|---|
| TinyGPT routed SQL v1 | b-mc2 exact / synthetic exec | 0.531 / 0.860 | Direct | Current local candidate. |
| T5-small local baseline | b-mc2 exact | 0.484 | Direct | Same 64-row public slice; TinyGPT is +4.7 points. |
| Defog SQLCoder-7B-2 | Defog SQL-Eval category scores | 77.1-96% | Directional | Strong public SQL specialist, but different benchmark and 7B size class. |
| Arctic-Text2SQL-R1-7B | BIRD execution accuracy | 68.47% | Not comparable | Public execution target class; TinyGPT needs BIRD/Spider execution before competing here. |
| Arctic-Text2SQL-R1-14B / 32B | BIRD execution accuracy | 70.04% / 71.83% | Not comparable | Current public high bar is execution accuracy, not exact string match. |
External source notes:
cssupport/t5-small-awesome-text-to-sqlis the direct local baseline model; the 0.484 score is our local rerun on the same 64-row b-mc2 slice.- Defog SQLCoder-7B-2 reports category-level Defog SQL-Eval scores, not one aggregate score.
- Arctic-Text2SQL-R1 reports BIRD execution accuracy; use it as the public execution target lane, not as a direct comparison.
- BFCL-V4 market rows on the website use the LLM Stats July 2026 snapshot only as directional context because that page marks the rows as self-reported and unverified. The official BFCL page remains the methodology/source-of-truth benchmark reference.
Rejected alternatives:
| Attempt | Public exact | Synthetic execution | Decision |
|---|---|---|---|
| Single public v4 adapter | 0.531 | 0.240 | route required |
| Blended SFT v1 | 0.297 | 0.560 | reject |
| Best static LoRA composition tested | 0.516 | 0.460 | reject |
| BIRD+b-mc2 v5 | 0.438 | 0.280 | reject |
Release blockers:
| Blocker | Why it matters | Unblock action |
|---|---|---|
| Public execution benchmark missing | b-mc2 exact match is useful but not enough for a serious SQL model claim. | Add BIRD Mini-Dev SQLite or Spider SQLite execution gate once DBs are local. |
Not packaged under specialists/ | Current adapter paths are local runs/ outputs, not package metadata. | Create a package only after decision.json is ship; until then publish as report-only/candidate. |
| Output hygiene is weak | Scorers extract the first SELECT; many completions still include prose after the query. | Add clean-SQL metric and stopping/format preference data before shipping. |
| Performance numbers missing | Public artifact should report latency, RAM, tok/s, and eval time. | Add one measured inference/eval run on the routed setup. |
| Data provenance needs public copy | b-mc2 and BIRD-derived rows have different licenses/provenance surfaces. | Add dataset license/provenance notes to the public report. |
Next release action:
Publish qwen06-sql-routed-v1 as a public report artifact, not a shipped
specialist package. The report should say: targeted 0.6B SQL adapter beats a
small T5 baseline on the frozen public exact slice, but the robust artifact is a
router over two specialists, and public execution benchmarking is the next gate.
Website page: /artifacts/qwen06-sql-routed-v1
Release Priority
qwen06-sql-routed-v1report artifact: best current story for the factory thesis because it includes failed attempts, routing, blockers, and measured improvement.qwen3-4b-file-ops-distilledmetadata artifact: strongest model win, but weight distribution and routed-only caveats must be handled carefully.factory-run-schema-v1: publish as process proof once one rendered run folder is canonical.browser-playground: leave parked unless it becomes a report browser.