Rethinking XAI Evaluation: A Human-Centered Audit of Shapley Benchmarks in High-Stakes Settings

arXiv cs.AI / 4/27/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that Shapley-based XAI is undermined by fragmented variants and by evaluation methods that rely on quantitative proxies whose connection to human usefulness is not validated.
  • Using an amortized framework, the authors compare eight Shapley variants to measure semantic differences under low-latency constraints typical of operational risk workflows.
  • They run large-scale experiments across four risk datasets and a realistic fraud-detection setting with professional analysts reviewing 3,735 cases.
  • The results show quantitative metrics like sparsity and faithfulness do not reliably reflect human-perceived clarity or decision utility.
  • Although no Shapley formulation improved objective analyst performance, the explanations increased decision confidence, raising a serious automation-bias risk in high-stakes environments.

Abstract

Shapley values are a cornerstone of explainable AI, yet their proliferation into competing formulations has created a fragmented landscape with little consensus on practical deployment. While theoretical differences are well-documented, evaluation remains reliant on quantitative proxies whose alignment with human utility is unverified. In this work, we use a unified amortized framework to isolate semantic differences between eight Shapley variants under the low-latency constraints of operational risk workflows. We conduct a large-scale empirical evaluation across four risk datasets and a realistic fraud-detection environment involving professional analysts and 3,735 case reviews. Our results reveal a fundamental misalignment: standard quantitative metrics, such as sparsity and faithfulness, are decoupled from human-perceived clarity and decision utility. Furthermore, while no formulation improved objective analyst performance, explanations consistently increased decision confidence, signaling a critical risk of automation bias in high-stakes settings. These findings suggest that current evaluation proxies are insufficient for predicting downstream human impact, and we provide evidence-based guidance for selecting formulations and metrics in operational decision systems.