SubSearch: Intermediate Rewards for Unsupervised Guided Reasoning in Complex Retrieval
arXiv cs.CL / 4/10/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- SubSearch is a framework for training LLMs to do unsupervised guided multi-step reasoning in complex retrieval settings, where the correct reasoning path is not predetermined.
- Instead of relying only on outcome-based reinforcement signals, it provides intermediate reward signals to incentivize higher-quality planning and reasoning at each step.
- The method uses intrinsic, internally derived process rewards directly to optimize the generator, avoiding the need for external supervision or separately trained reward models with annotated trajectories.
- Experiments on seven benchmarks (including QA and multi-hop QA) show that intermediate-step intrinsic rewards produce more robust reasoning traces than training with only final outcome rewards.
- The authors suggest SubSearch can improve agentic integration of search engines for complex query answering and serves as a more data-efficient alternative to supervised process/reward modeling.
Related Articles

GLM 5.1 tops the code arena rankings for open models
Reddit r/LocalLLaMA
can we talk about how AI has gotten really good at lying to you?
Reddit r/artificial

AI just found thousands of zero-days. Your firewall is still pattern-matching from 2014
Dev.to

Emergency Room and the Vanishing Moat
Dev.to

I Built a 100% Browser-Based OCR That Never Uploads Your Documents — Here's How
Dev.to