RouteNLP: Closed-Loop LLM Routing with Conformal Cascading and Distillation Co-Optimization

arXiv cs.CL / 4/28/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • RouteNLP is a closed-loop LLM routing framework that directs NLP queries across a tiered model portfolio to cut inference costs while meeting per-task quality requirements.
  • It combines a difficulty-aware router trained with preference and quality signals, a confidence-calibrated cascading mechanism using conformal prediction for robust thresholds, and a co-optimization loop that distills knowledge into cheaper models after escalation failures.
  • In an 8-week enterprise pilot (~5K queries/day), RouteNLP reduced inference costs by 58% while keeping 91% response acceptance and sharply improving p99 latency from 1,847 ms to 387 ms.
  • Across a six-task benchmark in finance, customer service, and legal domains, it delivers 40–85% cost reductions while preserving high quality (96–100% on structured tasks and 96–98% on generation tasks), with 74.5% of routed generation outputs matching or exceeding frontier-model quality in human evaluation.

Abstract

Serving diverse NLP workloads with large language models is costly: at one enterprise partner, inference costs exceeded $200K/month despite over 70% of queries being routine tasks well within the capability of smaller models. We present RouteNLP, a closed-loop framework that routes queries across a tiered model portfolio to minimize cost while satisfying per-task quality constraints. The framework integrates three components: a difficulty-aware router with shared task-conditioned representations trained on preference data and quality signals; confidence-calibrated cascading that uses conformal prediction for distribution-free threshold initialization; and a distillation-routing co-optimization loop that clusters escalation failures, applies targeted knowledge distillation to cheaper models, and automatically retrains the router, yielding over twice the cost improvement of untargeted distillation. In an 8-week pilot deployment processing ~5K queries/day at an enterprise customer-service division, RouteNLP reduced inference costs by 58% while maintaining 91% response acceptance and reducing p99 latency from 1,847 ms to 387 ms. On a six-task benchmark spanning finance, customer service, and legal domains, the framework achieves 40-85% cost reduction while retaining 96-100% quality on structured tasks and 96-98% on generation tasks, with human evaluation confirming that 74.5% of routed generation outputs match or exceed frontier-model quality.