Co-Evolution of Policy and Internal Reward for Language Agents

arXiv cs.LG / 4/6/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

本論文は、LLMエージェントの長期行動における学習のボトルネックである「疎で遅延する報酬」を、外部報酬モデルに依存せず自己生成の内部報酬で解決する方針を提案しています。
提案手法Self-Guideは、推論時には自己生成した短いガイダンス信号で次の行動を誘導し、学習時には同じ信号をステップ単位の内部報酬へ変換して密な方策最適化を行います。
その結果、方策（policy）と内部報酬（internal reward）が相互に改善し合う「共進化ループ」が形成され、より良い方策がより良いガイダンスを生み、ガイダンスが方策をさらに押し上げると述べています。
3つのエージェント・ベンチマークで、推論時セルフガイダンス単体でも改善が見られ、GRPOで方策と内部報酬を同時に進化させると、環境報酬のみで学習したベースラインに対して約8%の上乗せが得られたと報告しています。

Abstract

Large language model (LLM) agents learn by interacting with environments, but long-horizon training remains fundamentally bottlenecked by sparse and delayed rewards. Existing methods typically address this challenge through post-hoc credit assignment or external reward models, which provide limited guidance at inference time and often separate reward improvement from policy improvement. We propose Self-Guide, a self-generated internal reward for language agents that supports both inference-time guidance and training-time supervision. Specifically, the agent uses Self-Guide as a short self-guidance signal to steer the next action during inference, and converts the same signal into step-level internal reward for denser policy optimization during training. This creates a co-evolving loop: better policy produces better guidance, and better guidance further improves policy as internal reward. Across three agent benchmarks, inference-time self-guidance already yields clear gains, while jointly evolving policy and internal reward with GRPO brings further improvements (8\%) over baselines trained solely with environment reward. Overall, our results suggest that language agents can improve not only by collecting more experience, but also by learning to generate and refine their own internal reward during acting and learning.