Evaluating Small Language Models for Front-Door Routing: A Harmonized Benchmark and Synthetic-Traffic Experiment

arXiv cs.CL / 4/6/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that small language models (1–4B parameters) may now be accurate and fast enough to handle “front-door routing” task classification with near-zero marginal cost and sub-second latency, making routing overhead negligible in inference budgets.
  • Using a harmonized offline benchmark across Phi-3.5-mini, Qwen2.5-1.5B, and Qwen-2.5-3B on identical Azure T4 hardware, Qwen-2.5-3B achieves the best exact-match accuracy (0.783) and the strongest latency–accuracy tradeoff, including nonzero accuracy across all six task families.
  • In a synthetic-traffic randomized experiment comparing Phi-4-mini, Qwen-2.5-3B, and DeepSeek-V3 against a no-routing control, DeepSeek-V3 scores highest accuracy (0.830) but misses a pre-registered P95 latency requirement (2,295 ms).
  • Qwen-2.5-3B is Pareto-dominant among self-hosted options in the experiment (0.793 accuracy, 988 ms median latency, and $0 marginal cost), but no tested model satisfies standalone production viability criteria (≥0.85 accuracy and ≤2,000 ms P95).
  • The authors conclude that cost and latency prerequisites appear met, yet a remaining 6–8 percentage point accuracy gap—and the open question of whether correct routing guarantees downstream output quality—limits production readiness.

Abstract

Selecting the appropriate model at inference time -- the routing problem -- requires jointly optimizing output quality, cost, latency, and governance constraints. Existing approaches delegate this decision to LLM-based classifiers or preference-trained routers that are themselves costly and high-latency, reducing a multi-objective optimization to single-dimensional quality prediction. We argue that small language models (SLMs, 1-4B parameters) have now achieved sufficient reasoning capability for sub-second, zero-marginal-cost, self-hosted task classification, potentially making the routing decision negligible in the inference budget. We test this thesis on a six-label taxonomy through two studies. Study 1 is a harmonized offline benchmark of Phi-3.5-mini, Qwen2.5-1.5B, and Qwen-2.5-3B on identical Azure T4 hardware, serving stack, quantization, and a fixed 60-case corpus. Qwen-2.5-3B achieves the best exact-match accuracy (0.783), the strongest latency-accuracy tradeoff, and the only nonzero accuracy on all six task families. Study 2 is a pre-registered four-arm randomized experiment under synthetic traffic with an effective sample size of 60 unique cases per arm, comparing Phi-4-mini, Qwen-2.5-3B, and DeepSeek-V3 against a no-routing control. DeepSeek-V3 attains the highest accuracy (0.830) but fails the pre-registered P95 latency gate (2,295 ms); Qwen-2.5-3B is Pareto-dominant among self-hosted models (0.793 accuracy, 988 ms median, $0 marginal cost). No model meets the standalone viability criterion (>=0.85 accuracy, <=2,000 ms P95). The cost and latency prerequisites for SLM-based routing are met; the accuracy gap of 6-8 percentage points and the untested question of whether correct classification translates to downstream output quality bound the remaining distance to production viability.