Tool-calling: how close can a Mac-local small model get to frontier?
What: the arc from “our 1.7B scores ~55% on tool-calling” to a frontier-validated metric, an honest size curve, and a distillation result that closes most of the gap. Why it matters here: this is the cost-compression thesis made concrete — reach frontier capability at a fraction of the cost, measured on a ruler we trust.
Recorded 2026-06-14. Companion principle: the frontier-ceiling gate in
AGENTS.md (“Eval philosophy”). Distillation mechanics: distillation.md.
1. The eval was broken before the models were
The first “55%” came from scoring against hermes-fc gold with exact-string match. That metric is unwinnable:
- ~29% of held-out examples have ungroundable gold args — device IDs, txn
codes, whole JSON payloads, even a literal
unique_nft_identifierplaceholder — values that appear nowhere in the prompt. No model can reproduce them. - A frontier model (Claude via
claude -p) scored ~12% on the hard cases, and its answers were frequently more correct than the gold. Hard exact-match ceiling ≈ 71%.
Rule that came out of this: before any benchmark grades a Mac model, a frontier model must ace it (~100%). If frontier can’t, the eval is broken — fix or drop it. hermes-fc is now training-only, never a reported metric.
2. The ruler we trust: BFCL with AST matching
BFCL golds are verified
groundable and multi-valued (each param lists acceptable values). We built a
controlled harness (single-turn categories) and validated it: frontier = 124/125
(99.2%). The lone miss (parallel_9) is a doubly-underdetermined gold (batched
array call ≡ parallel calls; "5:00 PM" ≡ "5 pm") — accepted as passing.
Legitimate harness fixes made to reach frontier 100% (not to inflate small models):
Python-syntax instruction (BFCL’s own convention), implicit-multiplication
canonicalization (3*x ≡ 3x — BFCL’s own gold 3x**2 is non-executable),
recursive nested-dict matching, and a brace-matching parser.
Two parser bugs were hiding the local models’ real ability. The first regex discarded multi-call outputs that used one closing
</tool_call>for several blocks (a model convention); the second couldn’t parse bare-JSON calls with nestedarguments. Fixing them moved the distilled 1.7B’s parallel_multiple from a fake 8% to a real 60%, and the 4B’s simple_python from a fake 0% to 84%. Lesson: a lenient, well-tested parser is part of a fair eval.
3. The honest size curve (validated BFCL slice, n=25/category)
| Model | simple | multiple | parallel | par_mult | live_s | live_m | avg |
|---|---|---|---|---|---|---|---|
| Frontier (Claude) | 100 | 100 | 96 | 100 | 100 | — | ~99 |
| 30B-A3B (≈3B active) | — | — | 96 | 96 | — | — | ~frontier |
| base-4B (stock) | 84 | 96 | 92 | 80 | 88 | 60 | 83 |
| base-1.7B (stock) | 92 | 96 | 20 | 16 | 72 | 42 | 56 |
| distilled-1.7B (hermes) | 80 | 84 | 68 | 60 | 76 | 24 | 65 |
| FT-1.7B (ToolACE) | 80 | 96 | 68 | 64 | 92 | 56 | 76 |
Headlines: the 30B-A3B matches frontier on multi-call at ~3B active params (the cost-compression proof). The stock 1.7B is already frontier-level on single-call (92/96) — its only real weakness is multi-call decomposition.
4. Conclusion on distillation/SFT
Best result via distillation/SFT so far: avg 76 for the 1.7B (fine-tuned on 8,270 ToolACE examples, 42% multi-call, prompts identical to the eval). It closed most of the multi-call gap (parallel 20→68, parallel_multiple 16→64), and now beats the 4B on live_simple (92 vs 88). That’s ~2/3 of the way from base-1.7B (56) to base-4B (83).
But SFT-distillation plateaus short of frontier-parity, and it trades:
- Hard multi-call still lags (parallel/parallel_multiple 68/64 vs 4B 92/80, frontier 96/100).
- It regressed single-call (92→80) — the same trade hermes showed. Fine-tuning on a tool-call corpus dilutes the base’s already-strong single-call.
This matches the validated thesis (see distillation.md): distillation can match but not exceed its data/teacher ceiling. Remaining distillation levers we have NOT exhausted: (a) fix the data mix to recover single-call; (b) distill from our local 30B (already frontier-level on multi-call) instead of a generic dataset. Pure SFT’s cap for the 1.7B looks to be ~the 4B’s level, not frontier.
5. Where reinforcement learning comes in
To exceed the SFT ceiling you need RL — it optimizes a verifiable reward directly rather than imitating data. Status of the from-scratch MLX GRPO work (no MLX GRPO exists; built here):
- Loop validated on GSM8K (reward trended up vs the zero-shot floor) — sample K rollouts → verifiable reward → group-normalize advantage (no critic) → policy gradient on the LoRA.
- Now has a real reward: the validated BFCL AST matcher is a verifiable reward function — exactly what RLVR needs. Earlier GRPO-on-tool-calls used the broken exact-match; that’s fixed.
- Known stability fix pending: the tool-call GRPO run spiked (loss → -59) without KL regularization. Add a KL-to-reference penalty before the real runs.
Result (2026-06-14). GRPO on FT-1.7B (reward = graded AST match + over-emission penalty; KL to frozen FT-1.7B; held-out BFCL [25:] prompts; grad-accumulation for Qwen3’s 151k-vocab logits). Stable throughout (loss ~0, KL ≤ 0.025 — no blowup). It delivered a modest, targeted lift, exactly on the categories we aimed at:
| simple | multiple | parallel | par_mult | live_s | live_m | avg | |
|---|---|---|---|---|---|---|---|
| SFT (FT-1.7B) | 80 | 96 | 68 | 64 | 92 | 56 | 76 |
| +GRPO | 80 | 96 | 68 | 68 | 92 | 64 | 78 |
The arc: base 56 → SFT 76 → GRPO 78. Conclusion: SFT does the heavy lifting; RL is a small targeted top-up (+8 live_multiple, +4 parallel_multiple). The 1.7B plateaus ~78 — short of the 4B and frontier on hard multi-call. Strong result (4B-competitive on 4/6) but not parity. → escalate to the 4B.
6. The 4B sweep — and the punchline: on a strong base, selection beats training
We surveyed the best small bases (June 2026) and ran the recipe. Validated BFCL slice:
| Model | simple | mult | par | par_m | live_s | live_m | avg |
|---|---|---|---|---|---|---|---|
| Frontier (Claude) | 100 | 100 | 96 | 100 | 100 | — | ~99 |
| Qwen3-4B-2507 bf16 — STOCK | 92 | 96 | 96 | 96 | 88 | 56 | 87.3 |
| Hammer2.1-3b stock (FC-specialist*) | 96 | 100 | 92 | 88 | 84 | 60 | 86.7 |
| Qwen3-4B-2507 + ToolRL-GRPO | 92 | 96 | 92 | 92 | 92 | 56 | 86.7 |
| Qwen3.5-4B-8bit stock | 88 | 100 | 80 | 84 | 92 | 72 | 86.0 |
| Qwen3-4B-2507 4-bit stock | 84 | 96 | 92 | 80 | 88 | 60 | 83.3 |
| Qwen3-4B + function-masking SFT | 84 | 92 | 88 | 84 | 72 | 60 | 80.0 |
*Hammer trains on BFCL-like data → structured scores partly by-design.
Findings:
- Precision was the biggest lever of the whole project. bf16 (87.3) vs 4-bit (83.3) = +4 for free — the FC-quantization finding, confirmed. The best result came from a flag, not training.
- ToolRL-GRPO is neutral on a strong base (87.3→86.7; traded parallel for live_simple). RL’s headroom needs examples the model gets inconsistently — scarce when the base is already ~87. (It did help the weak 1.7B: +2.)
- Function-masking SFT regressed the strong base (−7). The masking trick nudged its target (live_multiple 56→60) but the SFT process taxed everything else. Hammer’s trick works from-scratch in the base’s native training, not LoRA-bolted onto a strong model.
live_multipleis the wall — no intervention moved real-user function-selection past ~60 (frontier ~90+). A data/base gap, not a training-knob gap.- Qwen3.5 (qwen3_5 arch) is inference-only here — mlx_lm has no backward for it; can’t fine-tune/GRPO on this Mac.
Verdict: best Mac-local 4B tool-caller = Qwen3-4B-Instruct-2507 @ bf16, STOCK — beats SOTA-4B (Hammer-4B 76), zero training, every training intervention made it worse. The meta-lesson: training is the lever for weak bases (1.7B 56→78); on a strong base the wins are base + precision selection. We proved this empirically rather than assuming it.
7. Closing the eval gaps (the honest headline)
Two fixes made the number trustworthy:
live_multiplewas never frontier-gated — running it, frontier scored only 84, with ~3/4 misses being under-determined golds (USA≡United States, a fuller address penalized) — the hermes disease again. Adding country-alias + multi-word-superset semantic matching lifted frontier 84→92 (sound, no over-accept) and our 4B 56→64, sound categories unchanged. Honest headline: Qwen3-4B-2507 bf16 = 88.7, frontier = 98.0.- Irrelevance probe: our 4B abstains 40/40 (100%) when no tool fits — no over-triggering. (Format-sensitivity to trivial rewording: 0pp — but that’s a weak templated proxy; real LLM-paraphrase, where the field sees 13–19pt drops, is future work.)
On the sound categories the 4B is ~93–94 — near-frontier. The residual gap is
genuine capability on the hardest real-user args (live_multiple 64 vs frontier 92), not
something more training of this base fixes.
8. Multi-turn / agentic — the cliff, measured (2026-06-14)
Single-turn (88.7) said nothing about holding a conversation. Built a stateful
multi-turn harness reusing BFCL’s machinery (execute_multi_turn_func_call +
multi_turn_checker + the involved_classes backends); scripts/bfcl_multiturn_eval.py.
The build lesson: the inference side is the eval. A hand-rolled text-transcript
prompt under-elicited badly — the 30B-A3B scored 0/8 despite acing single-turn (96/96).
The fix: drive the model with its native tool-calling chat template (tools= catalog +
proper assistant/tool message roles — what Qwen3 was trained on) + BFCL’s own
multi-turn behaviour prompt. That lifted the 30B from 0% → 50% — in BFCL’s expected
~50-70% range for multi_turn_base (the hardest category; even frontier doesn’t ace it).
The cliff (multi_turn_base, native-template harness, n=20 same examples):
| Model | single-turn | multi-turn | drop |
|---|---|---|---|
| Qwen3-30B-A3B (3B active) | ~96 | 45% | −51 |
| Qwen3-4B-2507 bf16 | 88.7 | 25% | −64 |
So even the strong 30B more than halves on the hardest multi-turn; the 4B drops to 25%.
But “95-96% multi-turn” is unachievable on multi_turn_base for anyone — the BFCL-V4
leader sits at 75% overall and multi-turn runs lower; frontier caps ~50-70%. By our own
frontier-ceiling rule, a 95% bar there is mis-calibrated. So we built a difficulty-graded,
right-sized gate — deterministic single-backend (GorillaFileSystem) agentic tasks tuned so a
strong model aces the easier tiers while the gap to a small model grows (scripts/make_multiturn_gates.py).
The capability gradient — frontier-validated (DeepSeek-V4-pro, true frontier):
| Tier | task shape | DeepSeek-V4-pro | 30B-A3B (proxy) | Qwen3-4B-2507 bf16 |
|---|---|---|---|---|
| single-turn | one call | ~99 | ~96 | 88.7 |
| easy multi-turn | 1-2 calls | 100% | 100% | 94% |
| moderate | 3-4 calls + cd-nav | 100% | 100% | 86% |
| hard (canonical gate) | 5-7 calls, deep nesting, 4 turns | 100% | 83% | 58% |
multi_turn_base (BFCL hardest) | multi-backend, long | — | 45% | 25% |
The hard tier is the sound, discriminating gate — a true frontier model (DeepSeek-V4-pro,
via OpenAI function-calling) aces it 100%, while the 4B clearly cliffs to 58% (a 42-pt
frontier-to-small gap). So make_multiturn_gates.py + the DeepSeek backend
(bfcl_multiturn_deepseek.py) give a calibrated multi-turn ruler: frontier ~100%, and the 4B’s
curve 94 → 86 → 58 maps exactly where it degrades as agentic complexity rises. (The 30B-A3B
sits between at 83% on hard — a strong but not-frontier 3B-active proxy.)
The 4B is a capable simple-agent, not a poor agent — near-frontier on easy/moderate flows, cliffing only as depth/length/turns grow. We expected the climb lever to be multi-turn RL; in fact rejection-sampling distillation alone cleared it (§8.1). RL (GRPO) stays available as a further top-up, but wasn’t needed to reach frontier-parity on this gate.
(API note: BFCL func docs use "type":"dict"; OpenAI/DeepSeek require "object" — the DeepSeek
backend normalizes BFCL’s type vocabulary to JSON-schema, and needs a curl User-Agent to clear
Cloudflare. Key read from /tmp/deepseek_key / $DS_KEY_FILE, never committed.)
(Build note: the inference side IS the eval — a hand-rolled text transcript scored the 30B 0%;
the native tool-calling chat template + proper roles fixed it. And echo-content tasks were
flaky because small models over-call and a stray touch blanks the file — idempotent
mkdir/mv/rm tasks make the gate clean. Both are real “agentic eval is subtle” lessons.)
8.1 Climbing the cliff — frontier-trajectory distillation (2026-06-16)
Goal: get the 4B from its 58% hard-tier cliff to the ~95% frontier level without stepping up to 8B. Two levers, stacked:
- Free first — a plan-then-execute system prompt (plan the full call sequence, act one
step at a time, never repeat a succeeded call, stop when done). Stock 4B 58 → 75 on the
12-task gate (the harness
SYSis nowMT_SYS-overridable). A real +17, but brittle. - The durable lever — RFT (rejection-sampling distillation). Recipe, all Mac-local:
- Scale the task family:
gen_multiturn_trajdata.pytemplates hundreds of deterministic, idempotent GorillaFileSystem agentic tasks (gold-validated viamulti_turn_checker). - Teacher trajectories: run DeepSeek-V4-pro over 100 held-out tasks (
bfcl_multiturn_deepseek.py --dump), keep only the 99 the checker passed (rejection sampling ⇒ clean labels). - Render in the student’s own format:
render_sft_from_traj.pyre-emits each trajectory through the 4B’s chat template (tools=+<tool_call>+ tool roles) as mlx_lm text. - LoRA SFT: 16 layers, lr 1e-5, 4 epochs.
--grad-checkpointis mandatory — every example is ~3.1-3.7k tokens (the 18-tool catalog floors them) and the 151k-vocab logits OOM the backward pass without it. One command:distill_multiturn.sh.
- Scale the task family:
The climb (held-out, zero train/eval content overlap — verified):
| Qwen3-4B-2507 | hard gate (12) | 40-task held-out set |
|---|---|---|
| stock | 58% | — |
| + plan prompt | 75% | 60% |
| + distilled (99 frontier trajectories) | 100% | 95% |
The 4B now matches DeepSeek-V4-pro (100%) on the frontier-validated hard gate — frontier-parity on multi-turn file-system agency, at 4B, locally, no 8B needed.
The tradeoff (honest): single-turn BFCL slipped ~87 → 83 avg (simple_python −8, parallel_multiple −16, multiple/parallel unchanged, live_multiple +4) — the classic specialization cost of a narrow SFT. Recoverable by mixing single-turn data into the SFT or fewer epochs if we want both skills; for a multi-turn agentic product (Pace) the trade is strongly positive.
Scope: proven on the GorillaFileSystem multi-turn domain (the gate’s backend). Generalization to other agentic backends (trading, ticketing, …) is untested — distilling a multi-backend mix is the obvious next step. The recipe — author verifiable tasks → frontier RFT → SFT in the student’s template — is domain-general.
Reproducible for free (gold behaviour-cloning ≡ frontier distillation). When the trajectories
were lost to a /tmp wipe, we rebuilt the identical model with no teacher API: for verifiable
tasks the gold ground-truth is the correct trajectory, so gold_to_sft_traj.py synthesizes
SFT data by executing the gold per turn (free, fast, deterministic). It reproduced 100% hard /
95% hardgen exactly — but only after one non-obvious fix: the SFT data must demonstrate the
turn-completion STOP signal (an assistant message emitting no tool calls after the work is
done). Without it the model never learns to stop, over-calls at eval, and lands at 75% — the gap
that a teacher’s trajectories close for free because they naturally end each turn with a no-call
message. (Lesson with teeth for self-improvement: knowing when to stop is a learned behaviour,
not a freebie — a ReST loop must reward it.)
8.2 Conclusive head-to-head — Pace incumbent (Gemma) vs the 4B (2026-06-16)
Pace ships Gemma; this is the deciding comparison. Same hard gate, same plan prompt,
n=12. Gemma scored zero-shot via LM Studio (OpenAI function-calling); the distilled 4B is
the §8.1 specialist; frontier + stock-4B are anchors. Reproducible via headtohead_multiturn.sh.
| Model | params | hard-gate task-completion |
|---|---|---|
| DeepSeek-V4-pro (frontier anchor) | — | 100% |
| Qwen3-4B-2507 — distilled | 4B | 100% |
| Gemma-4-12b-qat | 12B | 83% |
| Qwen3-4B-2507 — stock (+plan prompt) | 4B | 75% |
| Gemma-3-12b | 12B | 33% |
The distilled 4B matches frontier and beats both Gemma-12B variants at ⅓ the parameters — higher agentic accuracy and smaller/faster/less-RAM. For a multi-turn agentic app, the distilled 4B is the clear winner over the incumbent.
Honest framing: the 4B is specialized on this domain (GorillaFileSystem) via cheap frontier-distillation; Gemma is zero-shot. The claim is the project thesis — a cheaply specialized small model beats a larger general model on the target task — not “4B > 12B in general.” Distilling Gemma the same way would likely lift it too. Caveats: Gemma-3-12b’s 33% partly reflects weaker tool-call formatting (Gemma-4-qat’s 83% shows the protocol is fine, so most of the gap is real capability); Pace’s exact production Gemma is still TBD (if Gemma-3, the upgrade is dramatic; if Gemma-4-qat, still +17pp at ⅓ size); decode tok/s + RAM (the 4B wins both structurally) and single-turn (distilled 4B ~83) are the remaining leaderboard columns.
8.3 Domain saturated at 4B + a free frontier backend (2026-06-16)
Pushed a longer-horizon veryhard tier (6-8 turns, 9-16 calls, heavy cd-navigation; new
templates the 4B never trained on — gen_multiturn_trajdata.py … veryhard) to see if a harder
gate would finally separate the distilled 4B from frontier and justify a bigger model:
| Model | veryhard (12) |
|---|---|
| gpt-5.5 (true frontier) | 100% |
| Qwen3-4B-2507 — distilled | 100% |
| DeepSeek-V4-pro | 83% |
| Qwen3-4B-2507 — stock (+plan) | 25% |
The distilled 4B aces it too — matching the strongest frontier (gpt-5.5) on longer unseen tasks, while stock collapses to 25%. Conclusion: the file-ops agentic domain is saturated at 4B — a harder file-ops gate won’t discriminate it, so a distilled 12B has no payoff here. The real open question is breadth (other BFCL backends — trading/ticketing/travel — that the 4B never trained on), not depth. (DeepSeek’s 83% < gpt-5.5’s 100% just means DeepSeek is a slightly less reliable frontier on fiddly 16-call navigation; the gate is sound — true frontier aces it.)
Free frontier backend (cost fix): validation + teacher trajectories now run on the Codex
CLI (gpt-5.5), free under subscription — scripts/bfcl_multiturn_codex.py drives it
single-shot per step via codex exec --output-schema (forced JSON tool-calls), reusing the same
BFCL executor + checker. Gotcha: OpenAI strict structured-output requires additionalProperties: false on every object and forbids free-form objects, so arguments is passed as a JSON string
and parsed. This retires the paid DeepSeek API for routine frontier work.
8.4 Breadth — narrow distillation causes negative transfer (2026-06-16)
Saturation at file-ops raised the real question: does the specialist generalize? Tested on 52 held-out single-backend, non-filesystem BFCL multi_turn tasks (TradingBot 20, VehicleControlAPI 19, TravelAPI 13 — domains the 4B never trained on), same generic prompt, distilled vs stock:
| 4B | file-ops (hard gate) | out-of-domain breadth (52) |
|---|---|---|
| stock | 58% | 59.6% |
| distilled (file-ops only) | 100% | 42.3% |
The file-ops distillation made the model worse everywhere else — 60% → 42%, a −17pt regression. Apples-to-apples (same tasks, same prompt), so this is real catastrophic forgetting / negative transfer, not noise. We bought depth (file-ops 58→100) at the cost of breadth. The distilled 4B is a file-ops specialist, not a better agent.
Implications:
- For a multi-domain product (Pace): a narrowly-distilled model is the wrong artifact unless it’s routed (used only on its domain). The general fix is multi-backend distillation — train across all backends at once so gains don’t come with forgetting.
- This is the strongest motivation for the self-improving auto-curriculum (self-improving-agents.md): a loop that samples + filters across the whole task distribution trains breadth and depth together, structurally avoiding the single-domain over-fit we just measured.
- Caveat: the absolute bar here isn’t frontier-validated yet (these are real multi_turn_base tasks; frontier may not ace all 52). The relative negative-transfer result is airtight regardless; the gpt-5.5 ceiling check (free, via the Codex backend) would place the absolute gap.
8.5 The gold-cloning ceiling — why breadth needs interleaved trajectories (2026-06-16)
Tried the obvious fix for §8.4’s negative transfer: multi-backend gold-cloning — 248 tasks spanning file-ops + Trading/Vehicle/Travel/Ticket/Message/Twitter (clean split from the 52-task eval), gold-cloned and SFT’d. It made breadth worse:
| 4B | file-ops (depth) | breadth (52 out-of-domain) |
|---|---|---|
| stock | 58% | 59.6% ← still the best breadth |
| file-ops gold-distill | 100% | 42.3% |
| multi-backend gold-distill | 100% | 30.8% |
Root cause (measured, not guessed): 52% of multi-backend turns have a call argument that
comes from a tool result, not the user prompt. Example multi_turn_base_57: the gold lumps one
turn as get_zipcode(...), get_zipcode(...), estimate_distance(cityA='69238', cityB='51479') —
where 69238/51479 are the zipcodes the get_zipcode calls return. Behaviour-cloning that
gold teaches the model to (a) emit all calls blind before seeing any result and (b) hallucinate
the specific result values. More such data → more harm.
The law: gold-cloning ≡ frontier distillation only when call args are derivable from the user
prompt (file-ops: names, paths — §8.1 worked for exactly this reason). For data-dependent
agency, the thing to learn is the trajectory structure — call → read result → use result in the next call — and the gold does not contain that structure. Cloning concrete result-values is
anti-learning.
What actually fixes breadth: interleaved trajectories that demonstrate reading a result before
using it — either a frontier teacher (gpt-5.5 via the free Codex backend, which calls
get_zipcode, reads 69238, then calls estimate_distance) or the model’s own rollouts in a
ReST/RL loop (interleaving is intrinsic; the checker filters the correct ones). This is the
decisive, evidence-backed motivation for self-improving-agents.md:
the loop teaches the one thing gold-cloning structurally can’t.
Practical takeaway for Pace: today the stock 4B (60%) is the best multi-domain agent; the gold-distilled models win only on their narrow domain (file-ops 100%). Either route to the specialist on its domain, or train breadth with real interleaved trajectories — not gold-clones.
See also
- distillation.md — the distillation workflow + match-vs-from-scratch protocol.
- eval-methodology-2026-06-08.md — broader eval protocol.
- performance.md — the WASM register/cache-blocked matmul finding (microbench vs real-workload).
AGENTS.md→ “Eval philosophy” — the frontier-ceiling gate + reach-frontier-at-lower-cost goal.