ConsRoute:Consistency-Aware Adaptive Query Routing for Cloud-Edge-Device Large Language Models
arXiv cs.AI / 3/24/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces ConsRoute, a consistency-aware adaptive query routing framework for cloud-edge-device LLM inference to reduce latency and inference cost without significantly degrading response quality.
- ConsRoute improves routing decisions by using a reranker to measure fine-grained semantic consistency between responses from different model tiers, providing soft supervision beyond coarse output-quality gap estimates.
- To keep edge-device overhead low, it reuses hidden states from the LLM’s prefilling stage as compact query representations, avoiding extra encoders or additional inference passes.
- The method clusters these representations and uses Bayesian optimization to learn cluster-specific routing thresholds that balance quality, latency, and cost across heterogeneous query distributions.
- Experiments report near-cloud quality performance (≥95%) while cutting end-to-end latency and inference cost by about 40%, outperforming prior routing baselines.
Related Articles

Composer 2: What is new and Compares with Claude Opus 4.6 & GPT-5.4
Dev.to
How UCP Breaks Your E-Commerce Tracking Stack: A Platform-by-Platform Analysis
Dev.to
AI Text Analyzer vs Asking Friends: Which Gives Better Perspective?
Dev.to
[D] Cathie wood claims ai productivity wave is starting, data shows 43% of ceos save 8+ hours weekly
Reddit r/MachineLearning

Microsoft hires top AI researchers from Allen Institute for AI for Suleyman's Superintelligence team
THE DECODER