Mitigating Lost in Multi-turn Conversation via Curriculum RL with Verifiable Accuracy and Abstention Rewards

arXiv cs.CL / 5/1/2026

💬 OpinionModels & Research

Key Points

  • The paper addresses “Lost in Conversation” (LiC), where LLM performance degrades in multi-turn settings as more information is revealed.
  • It proposes RLAAR (Curriculum Reinforcement Learning with Verifiable Accuracy and Abstention Rewards), a curriculum RL framework that trains models to produce correct answers and to assess whether a question is solvable.
  • RLAAR uses a competence-gated curriculum that gradually increases dialogue difficulty, helping stabilize training while improving reliability.
  • Using multi-turn on-policy rollouts and a mixed-reward setup, the method teaches models to balance answering with informed abstention to reduce premature responses that drive LiC.
  • On LiC benchmarks, RLAAR improves LiC performance from 62.6% to 75.1% and increases calibrated abstention rates from 33.5% to 73.4%, showing more trustworthy multi-turn behavior.

Abstract

Large Language Models demonstrate strong capabilities in single-turn instruction following but suffer from Lost-in-Conversation (LiC), a degradation in performance as information is revealed progressively in multi-turn settings. Motivated by the current progress on Reinforcement Learning with Verifiable Rewards (RLVR), we propose Curriculum Reinforcement Learning with Verifiable Accuracy and Abstention Rewards (RLAAR), a framework that encourages models not only to generate correct answers, but also to judge the solvability of questions in the multi-turn conversation setting. Our approach employs a competence-gated curriculum that incrementally increases dialogue difficulty (in terms of instruction shards), stabilizing training while promoting reliability. Using multi-turn, on-policy rollouts and a mixed-reward system, RLAAR teaches models to balance problem-solving with informed abstention, reducing premature answering behaviors that cause LiC. Evaluated on LiC benchmarks, RLAAR significantly mitigates LiC performance decay (62.6% to 75.1%) and improves calibrated abstention rates (33.5% to 73.4%). Together, these results provide a practical recipe for building multi-turn reliable and trustworthy LLMs.