Rethinking Failure Attribution in Multi-Agent Systems: A Multi-Perspective Benchmark and Evaluation

arXiv cs.AI / 3/27/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • Existing multi-agent systems (MAS) failure attribution benchmarks and methods often assume a single deterministic root cause, even though real failures can have multiple plausible attributions due to complex inter-agent dependencies and ambiguous execution paths.
  • The paper proposes a multi-perspective failure attribution paradigm that explicitly models attribution ambiguity rather than forcing a single “best” explanation.
  • It introduces MP-Bench, a new benchmark and evaluation protocol specifically designed for multi-perspective failure attribution in MAS.
  • Experiments indicate that prior claims that LLMs struggle at failure attribution are largely caused by shortcomings in earlier benchmark designs, and the new multi-perspective setup yields more realistic conclusions.
  • The authors argue that MAS debugging and reliability improvements require multi-perspective benchmarks and evaluation protocols to avoid misleading assessments.

Abstract

Failure attribution is essential for diagnosing and improving multi-agent systems (MAS), yet existing benchmarks and methods largely assume a single deterministic root cause for each failure. In practice, MAS failures often admit multiple plausible attributions due to complex inter-agent dependencies and ambiguous execution trajectories. We revisit MAS failure attribution from a multi-perspective standpoint and propose multi-perspective failure attribution, a practical paradigm that explicitly accounts for attribution ambiguity. To support this setting, we introduce MP-Bench, the first benchmark designed for multi-perspective failure attribution in MAS, along with a new evaluation protocol tailored to this paradigm. Through extensive experiments, we find that prior conclusions suggesting LLMs struggle with failure attribution are largely driven by limitations in existing benchmark designs. Our results highlight the necessity of multi-perspective benchmarks and evaluation protocols for realistic and reliable MAS debugging.
広告

Rethinking Failure Attribution in Multi-Agent Systems: A Multi-Perspective Benchmark and Evaluation | AI Navigate