S^3-R1: Learning to Retrieve and Answer Step-by-Step with Synthetic Data

arXiv cs.LG / 5/5/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces S^3-R1, a framework that improves reinforcement-learning post-training for search-and-answer agents by combining synthetic data with denser learning signals.
It builds a programmatic synthetic generation and curation pipeline to create diverse multi-hop questions from existing documents, and uses retrieval-based verification to focus on intermediate-difficulty questions.
The training uses a reward design that scores both intermediate search quality and the correctness of the final answer, reducing credit-assignment issues caused by sparse outcome rewards.
Experiments indicate S^3-R1 outperforms prior baselines, learning better search and synthesis strategies and achieving up to a 10% gain in robust generalization on out-of-domain datasets.

Abstract

Reinforcement learning (RL) post-training has enabled newer capabilities in models, such as agentic tool-use for search. However, these models struggle primarily due to limitations with sparse outcome-based rewards and a lack of training data that encapsulates questions of differing hardness, which results in models not performing deeper searches with tools to collect evidence for question-answering. To address these limitations, we introduce S^3-R1 (Synthetic data and stabilized Search R1), a framework that couples a data-centric approach with denser learning signals. We first develop a synthetic generation and curation pipeline that programmatically derives diverse, multi-hop questions from existing documents. This pipeline incorporates a retrieval-based verification step to specifically isolate questions of intermediate difficulty. We then pair this expanded training set with a reward structure that evaluates both intermediate search quality and the correctness of the final answer. This setup directly mitigates the credit assignment problems inherent to sparse rewards. Our evaluations show that S^3-R1 outperforms existing baselines by learning more effective search and synthesis strategies, yielding up to a 10% improvement in robust generalization on out-of-domain datasets.