Rethinking Failure Attribution in Multi-Agent Systems: A Multi-Perspective Benchmark and Evaluation

arXiv cs.AI / 3/27/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

Existing multi-agent systems (MAS) failure attribution benchmarks and methods often assume a single deterministic root cause, even though real failures can have multiple plausible attributions due to complex inter-agent dependencies and ambiguous execution paths.
The paper proposes a multi-perspective failure attribution paradigm that explicitly models attribution ambiguity rather than forcing a single “best” explanation.
It introduces MP-Bench, a new benchmark and evaluation protocol specifically designed for multi-perspective failure attribution in MAS.
Experiments indicate that prior claims that LLMs struggle at failure attribution are largely caused by shortcomings in earlier benchmark designs, and the new multi-perspective setup yields more realistic conclusions.
The authors argue that MAS debugging and reliability improvements require multi-perspective benchmarks and evaluation protocols to avoid misleading assessments.

Abstract

Failure attribution is essential for diagnosing and improving multi-agent systems (MAS), yet existing benchmarks and methods largely assume a single deterministic root cause for each failure. In practice, MAS failures often admit multiple plausible attributions due to complex inter-agent dependencies and ambiguous execution trajectories. We revisit MAS failure attribution from a multi-perspective standpoint and propose multi-perspective failure attribution, a practical paradigm that explicitly accounts for attribution ambiguity. To support this setting, we introduce MP-Bench, the first benchmark designed for multi-perspective failure attribution in MAS, along with a new evaluation protocol tailored to this paradigm. Through extensive experiments, we find that prior conclusions suggesting LLMs struggle with failure attribution are largely driven by limitations in existing benchmark designs. Our results highlight the necessity of multi-perspective benchmarks and evaluation protocols for realistic and reliable MAS debugging.

Black Hat Asia

AI Business

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

I shipped Google's TurboQuant as a vLLM plugin 72 hours after the paper — here's what nobody else tested

Dev.to

We built a governance layer for AI-assisted development (with runtime validation and real system)

Dev.to

No AI system using the forward inference pass can ever be conscious.

Reddit r/artificial

Rethinking Failure Attribution in Multi-Agent Systems: A Multi-Perspective Benchmark and Evaluation

Key Points

Abstract

Related Articles

Black Hat Asia

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

I shipped Google's TurboQuant as a vLLM plugin 72 hours after the paper — here's what nobody else tested

We built a governance layer for AI-assisted development (with runtime validation and real system)

No AI system using the forward inference pass can ever be conscious.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer