OThink-SRR1: Search, Refine and Reasoning with Reinforced Learning for Large Language Models

arXiv cs.CL / 4/23/2026

📰 NewsModels & Research

Key Points

  • The paper introduces OThink-SRR1, an approach that improves Retrieval-Augmented Generation (RAG) for LLMs on complex multi-hop questions by adding an iterative Search–Refine–Reason loop.
  • Its key innovation is a Refine stage that distills retrieved documents into concise, relevant facts to reduce irrelevant “noise” that can derail reasoning.
  • The work presents GRPO-IR, an end-to-end reinforcement learning algorithm that rewards correct evidence identification while penalizing overly heavy retrieval, targeting both accuracy and efficiency.
  • Experiments on four multi-hop QA benchmarks show higher accuracy than strong baselines while using fewer retrieval steps and fewer tokens.
  • Overall, OThink-SRR1 is positioned as a strong foundation for information-seeking agents that need reliable, cost-aware retrieval and reasoning.

Abstract

Retrieval-Augmented Generation (RAG) expands the knowledge of Large Language Models (LLMs), yet current static retrieval methods struggle with complex, multi-hop problems. While recent dynamic retrieval strategies offer improvements, they face two key challenges: 1) irrelevant retrieved noise can misdirect the reasoning process, and 2) processing full documents incurs prohibitive computational and latency costs. To address these issues, we propose OThink-SRR1, a framework that enhances large models with an iterative Search-Refine-Reason process trained via reinforcement learning. Its core Refine stage distills retrieved documents into concise, relevant facts before reasoning. We introduce GRPO-IR, an end-to-end reinforcement learning algorithm that rewards accurate evidence identification while penalizing excessive retrievals, thus training the model to be both focused and efficient. Experiments on four multi-hop QA benchmarks show our approach achieves superior accuracy over strong baselines while using fewer retrieval steps and tokens. This positions OThink-SRR1 as a potent foundational model for information-seeking agents.