SubSearch: Intermediate Rewards for Unsupervised Guided Reasoning in Complex Retrieval

arXiv cs.CL / 4/10/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

SubSearch is a framework for training LLMs to do unsupervised guided multi-step reasoning in complex retrieval settings, where the correct reasoning path is not predetermined.
Instead of relying only on outcome-based reinforcement signals, it provides intermediate reward signals to incentivize higher-quality planning and reasoning at each step.
The method uses intrinsic, internally derived process rewards directly to optimize the generator, avoiding the need for external supervision or separately trained reward models with annotated trajectories.
Experiments on seven benchmarks (including QA and multi-hop QA) show that intermediate-step intrinsic rewards produce more robust reasoning traces than training with only final outcome rewards.
The authors suggest SubSearch can improve agentic integration of search engines for complex query answering and serves as a more data-efficient alternative to supervised process/reward modeling.

Abstract

Large language models (LLMs) are probabilistic in nature and perform more reliably when augmented with external information. As complex queries often require multi-step reasoning over the retrieved information, with no clear or predetermined reasoning path, they remain challenging. Recent approaches train models using reinforcement learning on the model's outcome, showing promise in improving how models handle complex information. We introduce SubSearch, a specialized framework that shifts from outcome-only supervision to intermediate reward signals that incentivize planning high-quality reasoning. Unlike previous work on process reward modeling, which focuses on training a separate reward model with annotated trajectories by either human annotators or large LLM judges, SubSearch directly optimizes the generator using intrinsic process rewards, which we define as internally-derived rewards, eliminating the need for external supervision, and moving towards autonomous information-intensive reasoning. Experiments on seven benchmarks show that rewarding intermediate reasoning steps with intrinsic rewards leads to more robust reasoning traces in both QA and multi-hop QA datasets over using only outcome rewards. SubSearch can help in building reasoning traces that allow agents to better integrate search engines for complex query answering, while offering a data-efficient alternative to supervised process modeling.

GLM 5.1 tops the code arena rankings for open models

Reddit r/LocalLLaMA

can we talk about how AI has gotten really good at lying to you?

Reddit r/artificial

AI just found thousands of zero-days. Your firewall is still pattern-matching from 2014

Dev.to

Emergency Room and the Vanishing Moat

Dev.to

I Built a 100% Browser-Based OCR That Never Uploads Your Documents — Here's How

Dev.to

SubSearch: Intermediate Rewards for Unsupervised Guided Reasoning in Complex Retrieval

Key Points

Abstract

Related Articles

GLM 5.1 tops the code arena rankings for open models

can we talk about how AI has gotten really good at lying to you?

AI just found thousands of zero-days. Your firewall is still pattern-matching from 2014

Emergency Room and the Vanishing Moat

I Built a 100% Browser-Based OCR That Never Uploads Your Documents — Here's How

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer