Lightweight Query Routing for Adaptive RAG: A Baseline Study on RAGRouter-Bench

arXiv cs.CL / 4/7/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The study addresses an efficiency problem in Retrieval-Augmented Generation (RAG): choosing the best retrieval strategy per query based on query type to reduce token cost without losing capability.
  • It provides the first systematic evaluation of lightweight classifier-based query routing on RAGRouter-Bench, using five classical classifiers with three feature regimes (TF-IDF, MiniLM sentence embeddings, and structural features), resulting in 15 feature/classifier combinations.
  • The best-performing setup, TF-IDF features with an SVM, reaches 0.928 macro-F1 and 93.2% accuracy, while achieving simulated 28.1% token savings compared with always using the most expensive retrieval paradigm.
  • Lexical TF-IDF features outperform semantic sentence embeddings by 3.1 macro-F1 points, indicating surface keyword patterns are strong predictors of query-type complexity.
  • Domain analysis shows medical queries are the hardest to route and legal queries are the most tractable, and the authors identify a remaining gap for corpus-aware routing approaches.

Abstract

Retrieval-Augmented Generation pipelines span a wide range of retrieval strategies that differ substantially in token cost and capability. Selecting the right strategy per query is a practical efficiency problem, yet no routing classifiers have been trained on RAGRouter-Bench \citep{wang2026ragrouterbench}, a recently released benchmark of 7,727 queries spanning four knowledge domains, each annotated with one of three canonical query types: factual, reasoning, and summarization. We present the first systematic evaluation of lightweight classifier-based routing on this benchmark. Five classical classifiers are evaluated under three feature regimes, namely, TF-IDF, MiniLM sentence embeddings \citep{reimers2019sbert}, and hand-crafted structural features, yielding 15 classifier feature combinations. Our best configuration, TF-IDF with an SVM, achieves a macro-averaged F1 of \mathbf{0.928} and an accuracy of \mathbf{93.2\%}, while simulating \mathbf{28.1\%} token savings relative to always using the most expensive paradigm. Lexical TF-IDF features outperform semantic sentence embeddings by 3.1 macro-F1 points, suggesting that surface keyword patterns are strong predictors of query-type complexity. Domain-level analysis reveals that medical queries are hardest to route and legal queries most tractable. These results establish a reproducible query-side baseline and highlight the gap that corpus-aware routing must close.