Efficient Agent Evaluation via Diversity-Guided User Simulation

arXiv cs.AI / 4/25/2026

📰 NewsModels & Research

共有:

Key Points

The paper argues that evaluating LLM-based customer agents is difficult because multi-turn, stochastic interactions require exploring many possible conversations.
It points out that existing linear Monte Carlo rollouts are computationally inefficient, since they repeatedly regenerate the same early prefixes and may miss rare but important user behaviors.
The authors introduce DIVERT, a snapshot-based, coverage-guided user simulation framework that saves full agent-environment state at key decision points and resumes from those snapshots.
DIVERT branches from “junctions” using diversity-inducing user responses to systematically explore alternative interaction trajectories, improving both efficiency and coverage.
Experiments indicate DIVERT finds more failures per token than standard rollout methods and identifies them across a broader set of tasks.

Abstract

Large language models (LLMs) are increasingly deployed as customer-facing agents, yet evaluating their reliability remains challenging due to stochastic, multi-turn interactions. Current evaluation protocols rely on linear Monte Carlo rollouts of complete agent-user conversations to estimate success. However, this approach is computationally inefficient, repeatedly regenerating identical early prefixes, and often fails to uncover deep failure modes that arise from rare user behaviors. We introduce DIVERT (Diversity-Induced Evaluation via Branching of Trajectories), an efficient, snapshot-based, coverage-guided user simulation framework for systematic exploration of agent-user interactions. DIVERT captures the full agent-environment state at critical decision points and resumes execution from these snapshots, enabling reuse of shared conversation prefixes and reducing redundant computation. From each junction, the framework branches using targeted, diversity-inducing user responses, allowing directed exploration of alternative interaction paths. By focusing evaluation on semantically diverse and underexplored trajectories, DIVERT improves both efficiency and coverage. Empirical results show that it discovers more failures per token compared to standard linear rollout protocols, while expanding the set of tasks on which failures are identified.

Underwhelming or underrated? DeepSeek V4 shows “impressive” gains

SCMP Tech

Debugging AI Agents in Production: ADK+Gemini Cloud Assist | Google Cloud NEXT '26

Dev.to

🤖 Learn Harness Engineering by Building a Mini Openclaw 🦞

Dev.to

Teaching Small Language Models to Remember: Giving LLMs a Notebook with Differentiable Neural Computers

Dev.to

Training LFM-2.5-350M on Reddit post summarization with GRPO on my 3x Mac Minis — final evals and t-test evals are here [P]

Reddit r/MachineLearning

Efficient Agent Evaluation via Diversity-Guided User Simulation

Key Points

Abstract

Related Articles

Underwhelming or underrated? DeepSeek V4 shows “impressive” gains

Debugging AI Agents in Production: ADK+Gemini Cloud Assist | Google Cloud NEXT '26

🤖 Learn Harness Engineering by Building a Mini Openclaw 🦞

Teaching Small Language Models to Remember: Giving LLMs a Notebook with Differentiable Neural Computers

Training LFM-2.5-350M on Reddit post summarization with GRPO on my 3x Mac Minis — final evals and t-test evals are here [P]

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer