Retrieval Improvements Do Not Guarantee Better Answers: A Study of RAG for AI Policy QA

arXiv cs.CL / 3/26/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The study evaluates retrieval-augmented generation (RAG) for AI policy question answering using the AGORA corpus of 947 AI policy documents, focusing on dense legal language and overlapping regulations.
  • The authors build a RAG pipeline with a ColBERT-based retriever (fine-tuned via contrastive learning) and a generator aligned to human preferences using Direct Preference Optimization (DPO), adapting the system with synthetic queries and pairwise preferences.
  • Domain-specific retrieval fine-tuning improves retrieval metrics, but it does not consistently improve end-to-end answer relevance and faithfulness for policy QA.
  • In some cases, stronger retrieval increases confident hallucinations when the necessary documents are missing from the corpus, underscoring limits of component-level optimization.
  • The findings warn builders of policy-focused RAG systems that improvements to individual modules may not yield reliable grounded answers over dynamic regulatory collections, motivating end-to-end evaluation and robustness work.

Abstract

Retrieval-augmented generation (RAG) systems are increasingly used to analyze complex policy documents, but achieving sufficient reliability for expert usage remains challenging in domains characterized by dense legal language and evolving, overlapping regulatory frameworks. We study the application of RAG to AI governance and policy analysis using the AI Governance and Regulatory Archive (AGORA) corpus, a curated collection of 947 AI policy documents. Our system combines a ColBERT-based retriever fine-tuned with contrastive learning and a generator aligned to human preferences using Direct Preference Optimization (DPO). We construct synthetic queries and collect pairwise preferences to adapt the system to the policy domain. Through experiments evaluating retrieval quality, answer relevance, and faithfulness, we find that domain-specific fine-tuning improves retrieval metrics but does not consistently improve end-to-end question answering performance. In some cases, stronger retrieval counterintuitively leads to more confident hallucinations when relevant documents are absent from the corpus. These results highlight a key concern for those building policy-focused RAG systems: improvements to individual components do not necessarily translate to more reliable answers. Our findings provide practical insights for designing grounded question-answering systems over dynamic regulatory corpora.