CUE-R: Beyond the Final Answer in Retrieval-Augmented Generation

arXiv cs.CL / 4/8/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • CUE-R is proposed as a lightweight, intervention-based evaluation framework to measure the operational utility of each retrieved evidence item in single-shot RAG, using shallow retrieval-use traces.
  • It perturbs individual evidence items with REMOVE, REPLACE, and DUPLICATE operators, then evaluates effects on correctness, proxy grounding faithfulness, confidence error, and an additional trace-divergence signal.
  • Experiments on HotpotQA and 2WikiMultihopQA using Qwen-3 8B and GPT-5.2 find that REMOVE/REPLACE largely degrade correctness and grounding while strongly shifting traces, while DUPLICATE tends to be redundant but not fully neutral.
  • The study argues that answer-only RAG evaluation can miss important evidence-level effects, and shows non-additive interactions between multi-hop evidence items (e.g., removing both supports can hurt more than removing either alone).

Abstract

As language models shift from single-shot answer generation toward multi-step reasoning that retrieves and consumes evidence mid-inference, evaluating the role of individual retrieved items becomes more important. Existing RAG evaluation typically targets final-answer quality, citation faithfulness, or answer-level attribution, but none of these directly targets the intervention-based, per-evidence-item utility view we study here. We introduce CUE-R, a lightweight intervention-based framework for measuring per-evidence-item operational utility in single-shot RAG using shallow observable retrieval-use traces. CUE-R perturbs individual evidence items via REMOVE, REPLACE, and DUPLICATE operators, then measures changes along three utility axes (correctness, proxy-based grounding faithfulness, and confidence error) plus a trace-divergence signal. We also outline an operational evidence-role taxonomy for interpreting intervention outcomes. Experiments on HotpotQA and 2WikiMultihopQA with Qwen-3 8B and GPT-5.2 reveal a consistent pattern: REMOVE and REPLACE substantially harm correctness and grounding while producing large trace shifts, whereas DUPLICATE is often answer-redundant yet not fully behaviorally neutral. A zero-retrieval control confirms that these effects arise from degradation of meaningful retrieval. A two-support ablation further shows that multi-hop evidence items can interact non-additively: removing both supports harms performance far more than either single removal. Our results suggest that answer-only evaluation misses important evidence effects and that intervention-based utility analysis is a practical complement for RAG evaluation.