Reason in Chains, Learn in Trees: Self-Rectification and Grafting for Multi-turn Agent Policy Optimization
arXiv cs.AI / 4/10/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that multi-turn RL for LLM agents is limited by sparse and poorly credited rewards when training treats sampled trajectories as independent “chains.”
- It introduces T-STAR (Tree-structured Self-Taught Agent Rectification), which merges correlated steps across trajectories into a unified “Cognitive Tree” to recover latent reward structure.
- An Introspective Valuation mechanism propagates trajectory-level rewards back through the tree to compute variance-reduced, step-level relative advantage for more effective optimization.
- Using the Cognitive Tree, it proposes In-Context Thought Grafting to generate corrective reasoning by contrasting successful vs. failed branches at divergence points.
- Experiments on embodied, interactive, reasoning, and planning benchmarks show consistent improvements over strong baselines, especially for tasks requiring long reasoning chains.



