← TinyGPT · docs · devlog · roadmap · speedup
source: docs/factory/public-artifacts.md · view on GitHub ↗

Public Artifacts

Public artifacts are first-class factory outputs. They are different from local run files: a public artifact should be understandable outside this repo, have a small committed metadata surface, and list its blockers as clearly as its wins.

Website surface: /artifacts. The website should behave like a built-in blog for artifacts: one index page for scanning, one detail page per artifact, and numbers/blockers/evidence on every page.

Release Rule

Every public artifact entry must include:

Weights, adapters, and large run outputs do not need to live in git, but the public artifact must explain where they came from, how they were evaluated, and why it is or is not ready to package.

Competition rows must be labeled as:

LabelMeaning
DirectSame fixture, prompt/eval setup, metric, and scorer.
DirectionalUseful market or method context, but not the same eval.
Not comparablePublic high bar or adjacent system that explains the target lane but cannot be claimed as a win/loss.

Public copy should prefer “we beat X on this exact local gate” only for Direct rows. Everything else is context until the same benchmark is run.

Artifact States

StateMeaning
release-ready-metadataSmall committed metadata exists; large weights may still be external.
candidate-current-bestBest measured candidate, but not yet a shipped specialist package.
report-onlyGood public write-up/repro artifact, but no model should be used directly.
blockedNeeds a named unblocker before public release work continues.
parkedReal artifact, but not active in the factory sequence.

Current Public Artifact List

ArtifactTypeStatePublic valueNext release action
qwen3-4b-file-ops-distilledSpecialist package metadatarelease-ready-metadataShows a real TinyGPT-built routed specialist: 58% -> 100% on file-ops hard gate, with breadth regression disclosed.Decide whether to release metadata-only first or publish/host the multi-GB fused weights.
qwen06-sql-routed-v1Routed SQL specialist POCcandidate-current-bestShows the factory/router pattern on SQL: public exact 0.531 and synthetic execution 0.860 using separate routed adapters.Convert to a public report artifact; package only after a public execution benchmark gate exists.
factory-run-schema-v1Process artifactreport-onlyExplains the repeatable target -> data -> post-training -> eval -> package -> report contract.Add one canonical rendered example run and link it from the README.
browser-playgroundDemo artifactparkedPublic proof of the earlier browser/WASM/WebGPU learning track.Keep parked unless it directly presents factory reports or artifacts.

Artifact Details

qwen3-4b-file-ops-distilled

Status: release-ready-metadata

Committed surface:

Measured evidence:

GateStockSpecialist
File-ops hard gate0.581.00
File-ops hardgen heldout-0.95
Out-of-domain breadth0.5960.423

Release blockers:

BlockerWhy it mattersUnblock action
Weight distribution undecidedThe lock points to ~/.cache/tinygpt/models/mt4b_fused, not a public download.Choose metadata-only release or publish the artifact to a durable host.
Breadth regression is realThe model is unsafe as a general planner.Keep routed-only positioning in all public copy.
Frontier/breadth caveat remainsThe breadth suite is directly comparable but not fully frontier-validated.Keep caveat in model card; do not oversell as general capability.

qwen06-sql-routed-v1

Status: candidate-current-best

Current artifact shape:

Measured evidence:

GateResult
Public b-mc2 exact, 64 rows0.531
T5-small public baseline, same 64 rows0.484
Synthetic SQLite execution, 50 rows0.860
Synthetic SQLite exact, 50 rows0.840
Label-free router smoke64 public / 50 synthetic, all high-confidence

Competitive context:

SystemMetricScoreComparable?Readout
TinyGPT routed SQL v1b-mc2 exact / synthetic exec0.531 / 0.860DirectCurrent local candidate.
T5-small local baselineb-mc2 exact0.484DirectSame 64-row public slice; TinyGPT is +4.7 points.
Defog SQLCoder-7B-2Defog SQL-Eval category scores77.1-96%DirectionalStrong public SQL specialist, but different benchmark and 7B size class.
Arctic-Text2SQL-R1-7BBIRD execution accuracy68.47%Not comparablePublic execution target class; TinyGPT needs BIRD/Spider execution before competing here.
Arctic-Text2SQL-R1-14B / 32BBIRD execution accuracy70.04% / 71.83%Not comparableCurrent public high bar is execution accuracy, not exact string match.

External source notes:

Rejected alternatives:

AttemptPublic exactSynthetic executionDecision
Single public v4 adapter0.5310.240route required
Blended SFT v10.2970.560reject
Best static LoRA composition tested0.5160.460reject
BIRD+b-mc2 v50.4380.280reject

Release blockers:

BlockerWhy it mattersUnblock action
Public execution benchmark missingb-mc2 exact match is useful but not enough for a serious SQL model claim.Add BIRD Mini-Dev SQLite or Spider SQLite execution gate once DBs are local.
Not packaged under specialists/Current adapter paths are local runs/ outputs, not package metadata.Create a package only after decision.json is ship; until then publish as report-only/candidate.
Output hygiene is weakScorers extract the first SELECT; many completions still include prose after the query.Add clean-SQL metric and stopping/format preference data before shipping.
Performance numbers missingPublic artifact should report latency, RAM, tok/s, and eval time.Add one measured inference/eval run on the routed setup.
Data provenance needs public copyb-mc2 and BIRD-derived rows have different licenses/provenance surfaces.Add dataset license/provenance notes to the public report.

Next release action:

Publish qwen06-sql-routed-v1 as a public report artifact, not a shipped specialist package. The report should say: targeted 0.6B SQL adapter beats a small T5 baseline on the frozen public exact slice, but the robust artifact is a router over two specialists, and public execution benchmarking is the next gate.

Website page: /artifacts/qwen06-sql-routed-v1

Release Priority

  1. qwen06-sql-routed-v1 report artifact: best current story for the factory thesis because it includes failed attempts, routing, blockers, and measured improvement.
  2. qwen3-4b-file-ops-distilled metadata artifact: strongest model win, but weight distribution and routed-only caveats must be handled carefully.
  3. factory-run-schema-v1: publish as process proof once one rendered run folder is canonical.
  4. browser-playground: leave parked unless it becomes a report browser.