Rethinking XAI Evaluation: A Human-Centered Audit of Shapley Benchmarks in High-Stakes Settings

arXiv cs.AI / 4/27/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that Shapley-based XAI is undermined by fragmented variants and by evaluation methods that rely on quantitative proxies whose connection to human usefulness is not validated.
Using an amortized framework, the authors compare eight Shapley variants to measure semantic differences under low-latency constraints typical of operational risk workflows.
They run large-scale experiments across four risk datasets and a realistic fraud-detection setting with professional analysts reviewing 3,735 cases.
The results show quantitative metrics like sparsity and faithfulness do not reliably reflect human-perceived clarity or decision utility.
Although no Shapley formulation improved objective analyst performance, the explanations increased decision confidence, raising a serious automation-bias risk in high-stakes environments.

Abstract

Shapley values are a cornerstone of explainable AI, yet their proliferation into competing formulations has created a fragmented landscape with little consensus on practical deployment. While theoretical differences are well-documented, evaluation remains reliant on quantitative proxies whose alignment with human utility is unverified. In this work, we use a unified amortized framework to isolate semantic differences between eight Shapley variants under the low-latency constraints of operational risk workflows. We conduct a large-scale empirical evaluation across four risk datasets and a realistic fraud-detection environment involving professional analysts and 3,735 case reviews. Our results reveal a fundamental misalignment: standard quantitative metrics, such as sparsity and faithfulness, are decoupled from human-perceived clarity and decision utility. Furthermore, while no formulation improved objective analyst performance, explanations consistently increased decision confidence, signaling a critical risk of automation bias in high-stakes settings. These findings suggest that current evaluation proxies are insufficient for predicting downstream human impact, and we provide evidence-based guidance for selecting formulations and metrics in operational decision systems.