Implicit Turn-Wise Policy Optimization for Proactive User-LLM Interaction

arXiv cs.LG / 3/26/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes Implicit Turn-wise Policy Optimization (ITPO) to improve reinforcement learning for multi-turn human-AI collaboration when intermediate rewards are sparse and user responses are highly stochastic.
  • ITPO uses an implicit process reward model to convert sparse outcome signals into more reliable, turn-level (process) rewards, which are more stable than token-level rewards and can be normalized for added training stability.
  • Experiments on math tutoring, document writing, and medical recommendation show that ITPO combined with algorithms like PPO, GRPO, or RLOO improves convergence versus existing baselines.
  • Trajectory-level analysis indicates ITPO learns turn-wise preferences that align semantically with human judgment.
  • The authors report that the code is publicly available, supporting reproducibility and adoption for researchers working on proactive user-LLM interaction.

Abstract

Multi-turn human-AI collaboration is fundamental to deploying interactive services such as adaptive tutoring, conversational recommendation, and professional consultation. However, optimizing these interactions via reinforcement learning is hindered by the sparsity of verifiable intermediate rewards and the high stochasticity of user responses. To address these challenges, we introduce Implicit Turn-wise Policy Optimization (ITPO). ITPO leverages an implicit process reward model to derive fine-grained, turn-wise process rewards from sparse outcome signals. Unlike volatile token-level rewards, these turn-level signals exhibit superior robustness and may utilize a normalization mechanism to further enhance training stability. We evaluate ITPO across three representative multi-turn collaborative tasks: math tutoring, document writing, and medical recommendation. Empirical results demonstrate that ITPO, when combined with PPO, GRPO, or RLOO, consistently achieves improved convergence than existing baselines. Elaborate trajectory analysis confirms that ITPO infers turn-wise preferences that are semantically aligned with human judgment. Code is publicly available at https://github.com/Graph-COM/ITPO.