Asymmetric Actor-Critic for Multi-turn LLM Agents

arXiv cs.AI / 4/2/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses the challenge of achieving reliable behavior in multi-turn LLM agent conversations where retries are not possible, aiming for one-shot success in open-ended interactions.
  • It proposes an asymmetric actor-critic setup where a proprietary large LLM serves as a fixed “actor,” while a smaller open-source critic provides runtime supervision and can intervene within the same interaction trajectory.
  • The approach avoids training the actor by using a generation–verification asymmetry: large models generate high-quality responses, while smaller critics can effectively monitor and supervise actions.
  • It includes a data generation pipeline to create supervision signals for fine-tuning the critic, further improving reliability and task success.
  • Experiments on tau-bench and UserBench report significant gains over strong single-agent baselines, with fine-tuned lightweight critics rivaling or surpassing larger proprietary models in the critic role.

Abstract

Large language models (LLMs) exhibit strong reasoning and conversational abilities, but ensuring reliable behavior in multi-turn interactions remains challenging. In many real-world applications, agents must succeed in one-shot settings where retries are impossible. Existing approaches either rely on reflection or post-hoc evaluation, which require additional attempts, or assume fully trainable models that cannot leverage proprietary LLMs. We propose an asymmetric actor-critic framework for reliable conversational agents. A powerful proprietary LLM acts as the actor, while a smaller open-source critic provides runtime supervision, monitoring the actor's actions and intervening within the same interaction trajectory. Unlike training-based actor-critic methods, our framework supervises a fixed actor operating in open-ended conversational environments. The design leverages a generation-verification asymmetry: while high-quality generation requires large models, effective oversight can often be achieved by smaller ones. We further introduce a data generation pipeline that produces supervision signals for critic fine-tuning without modifying the actor. Experiments on \tau-bench and UserBench show that our approach significantly improves reliability and task success over strong single-agent baselines. Moreover, lightweight open-source critics rival or surpass larger proprietary models in the critic role, and critic fine-tuning yields additional gains over several state-of-the-art methods.