TIPS: Turn-Level Information-Potential Reward Shaping for Search-Augmented LLMs

arXiv cs.AI / 3/25/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces Turn-Level Information Potential Reward Shaping (TIPS), a framework for training search-augmented LLMs with denser, turn-level rewards rather than relying on sparse outcome-only signals.
  • TIPS assigns rewards to each reasoning and tool-call segment based on how much it increases the likelihood of the correct answer under a teacher model, aiming to improve credit assignment across multi-step generations.
  • By using potential-based reward shaping, the approach provides fine-grained guidance that is intended to be more stable and policy-invariant than standard RL objectives.
  • Experiments on seven QA benchmarks show TIPS improves training stability and outperforms GRPO/PPO baselines, including an 11.8% Exact Match and 13.6% F1 gain over PPO with a Qwen-2.5 7B Instruct model.
  • The authors argue TIPS is a general solution to sparse-reward credit assignment for multi-turn LLM reasoning with tool use and search augmentation.

Abstract

Search-augmented large language models (LLMs) trained with reinforcement learning (RL) have achieved strong results on open-domain question answering (QA), but training still remains a significant challenge. The optimization is often unstable due to sparse rewards and difficult credit assignments across reasoning and tool calls. To address this, we introduce Turn-Level Information Potential Reward Shaping (TIPS), a simple framework that assigns dense, turn-level rewards to each reasoning + tool-call segment based on the increased likelihood of the correct answer under a teacher model. By leveraging the potential-based reward shaping, TIPS offers fine-grained and policy-invariant guidance that overcomes the limitations of outcome-only optimization. Evaluated on seven QA benchmarks, TIPS consistently outperforms GRPO/PPO baselines and substantially improves training stability. For instance, with a Qwen-2.5 7B Instruct model, TIPS improves the average Exact Match score by 11.8% and F1 by 13.6% relative to PPO. Our results demonstrate that turn-level information-potential reward shaping provides an effective and general solution to sparse-reward credit assignment for multi-turn LLM reasoning.