PRAISE: Prefix-Based Rollout Reuse in Agentic Search Training

arXiv cs.AI / 4/7/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces PRAISE, a training framework for agentic search that addresses reward sparsity and inefficient use of long-horizon RL rollouts in multi-turn retrieval-and-reasoning tasks like multi-hop QA.
  • PRAISE reuses partial search trajectories by extracting prefix states at different turns, generating intermediate answers from those prefixes, and using them to create additional training trajectories.
  • It derives step-level rewards by comparing performance across prefixes, improving credit assignment beyond supervision only at the final answer.
  • The approach jointly optimizes search policy learning and prefix answer evaluation using a single shared model, avoiding extra human annotations or a separate reward model.
  • Experiments on multi-hop QA benchmarks report consistent improvements over strong baselines, indicating better data efficiency and learning signals.

Abstract

In agentic search, large language models (LLMs) are trained to perform multi-turn retrieval and reasoning for complex tasks such as multi-hop question answering (QA). However, current search-based Reinforcement Learning (RL) methods suffer from two core limitations: expensive long-horizon rollouts are under-utilized during training, and supervision is typically available only at the final answer, resulting in severe reward sparsity. We present Prefix-based Rollout reuse for Agentic search with Intermediate Step rEwards (PRAISE), a framework for improving both data efficiency and credit assignment in agentic search training. Given a complete search trajectory, PRAISE extracts prefix states at different search turns, elicits intermediate answers from them, and uses these prefixes both to construct additional training trajectories and to derive step-level rewards from performance differences across prefixes. Our method uses a single shared model for both search policy learning and prefix answer evaluation, enabling joint optimization without extra human annotations or a separate reward model. Experiments on multi-hop QA benchmarks show that PRAISE consistently improves performance over strong baselines.