Can QPP Choose the Right Query Variant? Evaluating Query Variant Selection for RAG Pipelines

arXiv cs.CL / 4/27/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies whether Query Performance Prediction (QPP) can select the best query reformulation variant in RAG pipelines without running full retrieval and generation for every variant.
  • Unlike traditional QPP that estimates query difficulty across topics, it focuses on intra-topic discrimination by choosing among multiple semantically equivalent variants for the same information need.
  • Experiments on TREC-RAG show a “utility gap”: variants that score well on retrieval ranking metrics (e.g., nDCG) do not necessarily yield the best generated answers.
  • Despite this divergence, QPP can reliably pick variants that improve end-to-end answer quality, and lightweight pre-retrieval predictors often achieve similar or better results than costly post-retrieval methods.
  • Overall, the findings support latency-efficient variant selection to make RAG more computationally affordable while maintaining output quality.

Abstract

Large Language Models (LLMs) have made query reformulation ubiquitous in modern retrieval and Retrieval-Augmented Generation (RAG) pipelines, enabling the generation of multiple semantically equivalent query variants. However, executing the full pipeline for every reformulation is computationally expensive, motivating selective execution: can we identify the best query variant before incurring downstream retrieval and generation costs? We investigate Query Performance Prediction (QPP) as a mechanism for variant selection across ad-hoc retrieval and end-to-end RAG. Unlike traditional QPP, which estimates query difficulty across topics, we study intra-topic discrimination - selecting the optimal reformulation among competing variants of the same information need. Through large-scale experiments on TREC-RAG using both sparse and dense retrievers, we evaluate pre- and post-retrieval predictors under correlation- and decision-based metrics. Our results reveal a systematic divergence between retrieval and generation objectives: variants that maximize ranking metrics such as nDCG often fail to produce the best generated answers, exposing a "utility gap" between retrieval relevance and generation fidelity. Nevertheless, QPP can reliably identify variants that improve end-to-end quality over the original query. Notably, lightweight pre-retrieval predictors frequently match or outperform more expensive post-retrieval methods, offering a latency-efficient approach to robust RAG.