Improving Search Agent with One Line of Code
arXiv cs.LG / 3/12/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- SAPO stands for Search Agent Policy Optimization and introduces a conditional token-level KL constraint to stabilize training of TARL-based search agents.
- It addresses Importance Sampling Distribution Drift (ISDD) in GRPO, which previously caused sharp declines in importance sampling ratios and halted gradient updates.
- SAPO requires only a one-line code modification to standard GRPO, enabling immediate deployment.
- Experimental results across seven QA benchmarks show SAPO achieves +10.6 percentage points absolute improvement over Search-R1, with gains across model scales (1.5B and 14B) and families (Qwen, LLaMA).
- The approach preserves gradient flow while preventing distribution drift by penalizing divergence only for positive tokens with low probability.
Related Articles

報告:LLMにおける「自己言及的再帰」と「ステートフル・エミュレーション」の観測
note

フリーランスの泥臭い経験を資産に変える。AIの文章に「あなたの魂」を注入する技術。【コピペOK】
note

Hey dev.to community – sharing my journey with Prompt Builder, Insta Posts, and practical SEO
Dev.to

How to Build Passive Income with AI in 2026: A Developer's Practical Guide
Dev.to

The Research That Doesn't Exist
Dev.to