Improving Search Agent with One Line of Code

arXiv cs.LG / 3/12/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

SAPO stands for Search Agent Policy Optimization and introduces a conditional token-level KL constraint to stabilize training of TARL-based search agents.
It addresses Importance Sampling Distribution Drift (ISDD) in GRPO, which previously caused sharp declines in importance sampling ratios and halted gradient updates.
SAPO requires only a one-line code modification to standard GRPO, enabling immediate deployment.
Experimental results across seven QA benchmarks show SAPO achieves +10.6 percentage points absolute improvement over Search-R1, with gains across model scales (1.5B and 14B) and families (Qwen, LLaMA).
The approach preserves gradient flow while preventing distribution drift by penalizing divergence only for positive tokens with low probability.

Abstract

Tool-based Agentic Reinforcement Learning (TARL) has emerged as a promising paradigm for training search agents to interact with external tools for a multi-turn information-seeking process autonomously. However, we identify a critical training instability that leads to catastrophic model collapse: Importance Sampling Distribution Drift(ISDD). In Group Relative Policy Optimization(GRPO), a widely adopted TARL algorithm, ISDD manifests as a precipitous decline in the importance sampling ratios, which nullifies gradient updates and triggers irreversible training failure. To address this, we propose \textbf{S}earch \textbf{A}gent \textbf{P}olicy \textbf{O}ptimization (\textbf{SAPO}), which stabilizes training via a conditional token-level KL constraint. Unlike hard clipping, which ignores distributional divergence, SAPO selectively penalizes the KL divergence between the current and old policies. Crucially, this penalty is applied only to positive tokens with low probabilities where the policy has shifted excessively, thereby preventing distribution drift while preserving gradient flow. Remarkably, SAPO requires only one-line code modification to standard GRPO, ensuring immediate deployability. Extensive experiments across seven QA benchmarks demonstrate that SAPO achieves \textbf{+10.6\% absolute improvement} (+31.5\% relative) over Search-R1, yielding consistent gains across varying model scales (1.5B, 14B) and families (Qwen, LLaMA).

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 3/12DailyView insight →

The Complete Guide to Model Context Protocol (MCP): Building AI-Native Applications in 2026

Dev.to

AI Shields Your Money: Banks’ New Fraud Fighters

Dev.to

Building AI Phone Systems for Veterinary Clinics — What Actually Works

Dev.to

How to Use Instagram Reels to Boost Sales [2026 Strategy]

Dev.to

[R] Adversarial Machine Learning

Reddit r/MachineLearning

Improving Search Agent with One Line of Code

Key Points

Abstract

💡 Insights using this article

Related Articles

The Complete Guide to Model Context Protocol (MCP): Building AI-Native Applications in 2026

AI Shields Your Money: Banks’ New Fraud Fighters

Building AI Phone Systems for Veterinary Clinics — What Actually Works

How to Use Instagram Reels to Boost Sales [2026 Strategy]

[R] Adversarial Machine Learning

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer