AI Navigate

Improving Search Agent with One Line of Code

arXiv cs.LG / 3/12/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • SAPO stands for Search Agent Policy Optimization and introduces a conditional token-level KL constraint to stabilize training of TARL-based search agents.
  • It addresses Importance Sampling Distribution Drift (ISDD) in GRPO, which previously caused sharp declines in importance sampling ratios and halted gradient updates.
  • SAPO requires only a one-line code modification to standard GRPO, enabling immediate deployment.
  • Experimental results across seven QA benchmarks show SAPO achieves +10.6 percentage points absolute improvement over Search-R1, with gains across model scales (1.5B and 14B) and families (Qwen, LLaMA).
  • The approach preserves gradient flow while preventing distribution drift by penalizing divergence only for positive tokens with low probability.

Abstract

Tool-based Agentic Reinforcement Learning (TARL) has emerged as a promising paradigm for training search agents to interact with external tools for a multi-turn information-seeking process autonomously. However, we identify a critical training instability that leads to catastrophic model collapse: Importance Sampling Distribution Drift(ISDD). In Group Relative Policy Optimization(GRPO), a widely adopted TARL algorithm, ISDD manifests as a precipitous decline in the importance sampling ratios, which nullifies gradient updates and triggers irreversible training failure. To address this, we propose \textbf{S}earch \textbf{A}gent \textbf{P}olicy \textbf{O}ptimization (\textbf{SAPO}), which stabilizes training via a conditional token-level KL constraint. Unlike hard clipping, which ignores distributional divergence, SAPO selectively penalizes the KL divergence between the current and old policies. Crucially, this penalty is applied only to positive tokens with low probabilities where the policy has shifted excessively, thereby preventing distribution drift while preserving gradient flow. Remarkably, SAPO requires only one-line code modification to standard GRPO, ensuring immediate deployability. Extensive experiments across seven QA benchmarks demonstrate that SAPO achieves \textbf{+10.6\% absolute improvement} (+31.5\% relative) over Search-R1, yielding consistent gains across varying model scales (1.5B, 14B) and families (Qwen, LLaMA).