UR$^2$: Unify RAG and Reasoning through Reinforcement Learning

arXiv cs.CL / 4/27/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes UR$^2$, a reinforcement-learning framework designed to unify Retrieval-Augmented Generation (RAG) with complex reasoning by dynamically coordinating when to retrieve and how to reason.
  • UR$^2$ uses a difficulty-aware curriculum that triggers retrieval selectively for harder examples, aiming to balance retrieval vs. reasoning more effectively than fixed-retrieval setups.
  • It also introduces a hybrid knowledge access strategy that combines domain-specific offline corpora with on-the-fly LLM-generated summaries to improve robustness to noisy or imperfect information.
  • Experiments across open-domain QA, MMLU-Pro, medical, and mathematical reasoning tasks show UR$^2$ outperforms prior RAG and RL baselines and reaches performance comparable to GPT-4o-mini and GPT-4.1-mini on multiple benchmarks.
  • The method is implemented using models such as Qwen-2.5-3/7B and LLaMA-3.1-8B, and the authors provide code via GitHub.

Abstract

Large Language Models (LLMs) have shown strong capabilities through two complementary paradigms: Retrieval-Augmented Generation (RAG) for knowledge grounding and Reinforcement Learning from Verifiable Rewards (RLVR) for complex reasoning. However, existing attempts to unify these paradigms remain narrow in scope, typically limited to open-domain QA with fixed retrieval settings, which constrains generalization to broader domains. To address this limitation, we propose UR^2 (Unified RAG and Reasoning)), a general reinforcement learning framework that dynamically coordinates retrieval and reasoning. UR^2 introduces two key designs: a difficulty-aware curriculum that selectively invokes retrieval only for challenging instances, and a hybrid knowledge access strategy that combines domain-specific offline corpora with on-the-fly LLM-generated summaries. Together, these components mitigate the imbalance between retrieval and reasoning and improve robustness to noisy information. Experiments on open-domain QA, MMLU-Pro, medical, and mathematical reasoning tasks show that UR^2, built on Qwen-2.5-3/7B and LLaMA-3.1-8B, consistently outperforms existing RAG and RL baselines, and achieves performance comparable to GPT-4o-mini and GPT-4.1-mini on several benchmarks. Our code is available at https://github.com/Tsinghua-dhy/UR2.