"Faithful to What?" On the Limits of Fidelity-Based Explanations

arXiv stat.ML / 4/21/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that fidelity-based evaluation in explainable AI can be misleading because it measures agreement with a neural network’s learned predictions rather than alignment with the underlying data-generating signal.
  • It introduces a diagnostic called the linearity score (λ(f)) that quantifies how linearly decodable a regression network’s input-output behavior is.
  • Experiments on both synthetic and real regression datasets show that surrogates may match a neural network closely (high fidelity) yet fail to reproduce the predictive improvements that set the neural network apart from simpler baselines.
  • In multiple cases, high-fidelity surrogate explanations even perform worse than straightforward linear baselines trained directly on the data.
  • The authors conclude that high-fidelity surrogate explanations do not necessarily explain task-relevant data structure, limiting their usefulness for reasoning about predictive performance.

Abstract

In explainable AI, surrogate models are commonly evaluated by their fidelity to a neural network's predictions. Fidelity, however, measures alignment to a learned model rather than alignment to the data-generating signal underlying the task. This work introduces the linearity score \lambda(f), a diagnostic that quantifies the extent to which a regression network's input--output behavior is linearly decodable. \lambda(f) is defined as an R^2 measure of surrogate fit to the network. Across synthetic and real-world regression datasets, we find that surrogates can achieve high fidelity to a neural network while failing to recover the predictive gains that distinguish the network from simpler models. In several cases, high-fidelity surrogates underperform even linear baselines trained directly on the data. These results demonstrate that explaining a model's behavior is not equivalent to explaining the task-relevant structure of the data, highlighting a limitation of fidelity-based explanations when used to reason about predictive performance.