First specialist package Release-ready metadata 2026-06-19

Qwen3-4B File-Ops Distilled

This is the strongest model win in the repo: a Mac-built specialist reaches 100% on the file-ops hard gate. It is also the clearest example of why routing is mandatory, because breadth drops outside the trained domain.

Headline Numbers

File-ops hard gate

100% up from 58% stock 4B

Heldout file-ops

95% hardgen heldout suite

Breadth after tuning

42.3% down from 59.6% stock

Artifact size

7.5GB local HF/MLX safetensors directory

Competitive Context

System Metric Score Size / Class Comparable? Readout
TinyGPT Qwen3-4B file-ops specialist local file-ops hard gate 100% 4B, 7.5GB package Direct Domain specialist result; not a general BFCL leaderboard submission.
Stock Qwen3-4B same local file-ops hard gate 58% 4B Direct Before/after delta is +42 points on the frozen domain gate.
Frontier calibration same local file-ops hard gate ~99-100% frontier API/teacher Direct Used as the ceiling check for whether the eval is a usable ruler.
BFCL V4 public leader overall BFCL V4 accuracy 75.0% large public model Directional LLM Stats snapshot for Qwen3.7 Max; it marks BFCL-V4 rows as self-reported/unverified, so this is market context only.
BFCL V4 public average overall BFCL V4 accuracy 61.1% 13 tracked models Directional LLM Stats reports 13 self-reported rows and 0 verified rows; TinyGPT still needs a full BFCL submission for direct comparison.
Qwen3.5-4B public BFCL-V4 row overall BFCL V4 accuracy 50.3% 4B Directional Closest public 4B-class tool-calling row in the same LLM Stats snapshot, but still not the local file-ops gate.

Direct rows share this artifact's eval setup. Directional rows are useful market context but should not be read as leaderboard claims.

Measured result

GateStockSpecialistReadout
File-ops hard gate0.581.00Domain win
File-ops hardgen heldout-0.95Generalizes within file ops
Out-of-domain breadth0.5960.423Regression; route only

Release Blockers

Weight distribution undecided

The package lock points to a local cache path, not a public artifact host.

Unblock: Decide metadata-only release vs durable hosted weight release.

Breadth regression

The tuned model is wrong to use as a general planner.

Unblock: Keep all public copy routed-only and include the negative-transfer table.

Evidence

Next Release Action

Release as metadata/model-card first, or publish the fused weights only with routed-only warnings attached.