CUE-R: Beyond the Final Answer in Retrieval-Augmented Generation

arXiv cs.CL / 4/8/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

CUE-R is proposed as a lightweight, intervention-based evaluation framework to measure the operational utility of each retrieved evidence item in single-shot RAG, using shallow retrieval-use traces.
It perturbs individual evidence items with REMOVE, REPLACE, and DUPLICATE operators, then evaluates effects on correctness, proxy grounding faithfulness, confidence error, and an additional trace-divergence signal.
Experiments on HotpotQA and 2WikiMultihopQA using Qwen-3 8B and GPT-5.2 find that REMOVE/REPLACE largely degrade correctness and grounding while strongly shifting traces, while DUPLICATE tends to be redundant but not fully neutral.
The study argues that answer-only RAG evaluation can miss important evidence-level effects, and shows non-additive interactions between multi-hop evidence items (e.g., removing both supports can hurt more than removing either alone).

Abstract

As language models shift from single-shot answer generation toward multi-step reasoning that retrieves and consumes evidence mid-inference, evaluating the role of individual retrieved items becomes more important. Existing RAG evaluation typically targets final-answer quality, citation faithfulness, or answer-level attribution, but none of these directly targets the intervention-based, per-evidence-item utility view we study here. We introduce CUE-R, a lightweight intervention-based framework for measuring per-evidence-item operational utility in single-shot RAG using shallow observable retrieval-use traces. CUE-R perturbs individual evidence items via REMOVE, REPLACE, and DUPLICATE operators, then measures changes along three utility axes (correctness, proxy-based grounding faithfulness, and confidence error) plus a trace-divergence signal. We also outline an operational evidence-role taxonomy for interpreting intervention outcomes. Experiments on HotpotQA and 2WikiMultihopQA with Qwen-3 8B and GPT-5.2 reveal a consistent pattern: REMOVE and REPLACE substantially harm correctness and grounding while producing large trace shifts, whereas DUPLICATE is often answer-redundant yet not fully behaviorally neutral. A zero-retrieval control confirms that these effects arise from degradation of meaningful retrieval. A two-support ablation further shows that multi-hop evidence items can interact non-additively: removing both supports harms performance far more than either single removal. Our results suggest that answer-only evaluation misses important evidence effects and that intervention-based utility analysis is a practical complement for RAG evaluation.

[N] Just found out that Milla Jovovich is a dev, invested in AI, and just open sourced a project

Reddit r/MachineLearning

ALTK‑Evolve: On‑the‑Job Learning for AI Agents

Hugging Face Blog

Context Windows Are Getting Absurd — And That's a Good Thing

Dev.to

Google isn’t an AI-first company despite Gemini being great

Reddit r/artificial

GitHub Weekly: Copilot SDK Goes Public, Cloud Agent Breaks Free

Dev.to

CUE-R: Beyond the Final Answer in Retrieval-Augmented Generation

Key Points

Abstract

Related Articles

[N] Just found out that Milla Jovovich is a dev, invested in AI, and just open sourced a project

ALTK‑Evolve: On‑the‑Job Learning for AI Agents

Context Windows Are Getting Absurd — And That's a Good Thing

Google isn’t an AI-first company despite Gemini being great

GitHub Weekly: Copilot SDK Goes Public, Cloud Agent Breaks Free

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer