How You Ask Matters! Adaptive RAG Robustness to Query Variations

arXiv cs.CL / 4/14/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces the first large-scale benchmark focused on semantically identical but surface-form-different query variations to test Adaptive RAG robustness.
  • It evaluates how query rewrites affect answer quality, computational cost, and the retrieval decision logic that determines when retrieval is triggered.
  • The authors find a major robustness gap: even small surface changes can drastically change retrieval behavior and degrade accuracy.
  • Larger models perform better overall, but robustness to query variations does not scale proportionally with model size.
  • The results highlight a key practical vulnerability for Adaptive RAG systems, exposing the need for stronger handling of query paraphrases and rewrite-induced shifts.

Abstract

Adaptive Retrieval-Augmented Generation (RAG) promises accuracy and efficiency by dynamically triggering retrieval only when needed and is widely used in practice. However, real-world queries vary in surface form even with the same intent, and their impact on Adaptive RAG remains under-explored. We introduce the first large-scale benchmark of diverse yet semantically identical query variations, combining human-written and model-generated rewrites. Our benchmark facilitates a systematic evaluation of Adaptive RAG robustness by examining its key components across three dimensions: answer quality, computational cost, and retrieval decisions. We discover a critical robustness gap, where small surface-level changes in queries dramatically alter retrieval behavior and accuracy. Although larger models show better performance, robustness does not improve accordingly. These findings reveal that Adaptive RAG methods are highly vulnerable to query variations that preserve identical semantics, exposing a critical robustness challenge.