RouteNLP: Closed-Loop LLM Routing with Conformal Cascading and Distillation Co-Optimization

arXiv cs.CL / 4/28/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

RouteNLP is a closed-loop LLM routing framework that directs NLP queries across a tiered model portfolio to cut inference costs while meeting per-task quality requirements.
It combines a difficulty-aware router trained with preference and quality signals, a confidence-calibrated cascading mechanism using conformal prediction for robust thresholds, and a co-optimization loop that distills knowledge into cheaper models after escalation failures.
In an 8-week enterprise pilot (~5K queries/day), RouteNLP reduced inference costs by 58% while keeping 91% response acceptance and sharply improving p99 latency from 1,847 ms to 387 ms.
Across a six-task benchmark in finance, customer service, and legal domains, it delivers 40–85% cost reductions while preserving high quality (96–100% on structured tasks and 96–98% on generation tasks), with 74.5% of routed generation outputs matching or exceeding frontier-model quality in human evaluation.

Abstract

Serving diverse NLP workloads with large language models is costly: at one enterprise partner, inference costs exceeded $200K/month despite over 70% of queries being routine tasks well within the capability of smaller models. We present RouteNLP, a closed-loop framework that routes queries across a tiered model portfolio to minimize cost while satisfying per-task quality constraints. The framework integrates three components: a difficulty-aware router with shared task-conditioned representations trained on preference data and quality signals; confidence-calibrated cascading that uses conformal prediction for distribution-free threshold initialization; and a distillation-routing co-optimization loop that clusters escalation failures, applies targeted knowledge distillation to cheaper models, and automatically retrains the router, yielding over twice the cost improvement of untargeted distillation. In an 8-week pilot deployment processing ~5K queries/day at an enterprise customer-service division, RouteNLP reduced inference costs by 58% while maintaining 91% response acceptance and reducing p99 latency from 1,847 ms to 387 ms. On a six-task benchmark spanning finance, customer service, and legal domains, the framework achieves 40-85% cost reduction while retaining 96-100% quality on structured tasks and 96-98% on generation tasks, with human evaluation confirming that 74.5% of routed generation outputs match or exceed frontier-model quality.

Black Hat USA

AI Business

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

How to Build Traceable and Evaluated LLM Workflows Using Promptflow, Prompty, and OpenAI

MarkTechPost

AI 编程工具对比 2026：Claude Code vs Cursor vs Gemini CLI vs Codex

Dev.to

How I Improved My YouTube Shorts and Podcast Audio Workflow with AI Tools

Dev.to

RouteNLP: Closed-Loop LLM Routing with Conformal Cascading and Distillation Co-Optimization

Key Points

Abstract

Related Articles

Black Hat USA

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

How to Build Traceable and Evaluated LLM Workflows Using Promptflow, Prompty, and OpenAI

AI 编程工具对比 2026：Claude Code vs Cursor vs Gemini CLI vs Codex

How I Improved My YouTube Shorts and Podcast Audio Workflow with AI Tools

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer