Huge decode
696 tok/s 96M/Huge preset, ctx 1024Huge Preset Decode Throughput
This is a runtime artifact, not a model-quality claim. Its value is operational: local eval loops need throughput, stable serving, and cheap repeated generation.
Headline Numbers
Mega pilot
293 tok/s 960M pilotWarm TTFT p99
5.8ms reported runtime metricCompetitive Context
| System | Metric | Score | Size / Class | Comparable? | Readout |
|---|---|---|---|---|---|
| TinyGPT Huge preset | decode throughput | 696 tok/s | 96M | Direct | Local runtime baseline for cheap repeated eval/smoke loops. |
| TinyGPT Mega pilot | decode throughput | 293 tok/s | 960M | Direct | Shows the throughput drop as local model size approaches specialist scale. |
| External serving stacks | same benchmark | not measured | MLX/llama.cpp/Ollama class | Not comparable | Needs a shared prompt/config/device table before public competitive serving claims. |
Direct rows share this artifact's eval setup. Directional rows are useful market context but should not be read as leaderboard claims.
Runtime numbers
| Metric | Value | Use |
|---|---|---|
| Decode throughput | 696 tok/s | Fast local eval/smoke loops |
| Mega pilot throughput | 293 tok/s | Boundary mapping for larger local models |
| Warm TTFT p99 | 5.8ms | Interactive serving viability |
Release Blockers
Preset-specific
The headline number is not a blanket claim for all HF models or specialists.
Unblock: Attach latency/RAM/tok-s numbers to each future specialist artifact.