Evaluating Small Language Models for Front-Door Routing: A Harmonized Benchmark and Synthetic-Traffic Experiment
arXiv cs.CL / 4/6/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that small language models (1–4B parameters) may now be accurate and fast enough to handle “front-door routing” task classification with near-zero marginal cost and sub-second latency, making routing overhead negligible in inference budgets.
- Using a harmonized offline benchmark across Phi-3.5-mini, Qwen2.5-1.5B, and Qwen-2.5-3B on identical Azure T4 hardware, Qwen-2.5-3B achieves the best exact-match accuracy (0.783) and the strongest latency–accuracy tradeoff, including nonzero accuracy across all six task families.
- In a synthetic-traffic randomized experiment comparing Phi-4-mini, Qwen-2.5-3B, and DeepSeek-V3 against a no-routing control, DeepSeek-V3 scores highest accuracy (0.830) but misses a pre-registered P95 latency requirement (2,295 ms).
- Qwen-2.5-3B is Pareto-dominant among self-hosted options in the experiment (0.793 accuracy, 988 ms median latency, and $0 marginal cost), but no tested model satisfies standalone production viability criteria (≥0.85 accuracy and ≤2,000 ms P95).
- The authors conclude that cost and latency prerequisites appear met, yet a remaining 6–8 percentage point accuracy gap—and the open question of whether correct routing guarantees downstream output quality—limits production readiness.
Related Articles

Black Hat Asia
AI Business

How Bash Command Safety Analysis Works in AI Systems
Dev.to

How I Built an AI Agent That Earns USDC While I Sleep — A Complete Guide
Dev.to

How to Get Better Output from AI Tools (Without Burning Time and Tokens)
Dev.to

How I Added LangChain4j Without Letting It Take Over My Spring Boot App
Dev.to