Feature Attribution Stability Suite: How Stable Are Post-Hoc Attributions?

arXiv cs.AI / 4/6/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces the Feature Attribution Stability Suite (FASS) to benchmark how stable post-hoc feature attribution methods are under realistic input perturbations while controlling for prediction changes.
  • FASS improves evaluation by adding prediction-invariance filtering and splitting stability into structural similarity, rank correlation, and top-k Jaccard overlap, rather than relying on a single scalar metric.
  • Experiments across Integrated Gradients, GradientSHAP, Grad-CAM, and LIME show that stability varies strongly by perturbation family, with geometric perturbations producing much larger attribution instability than photometric ones.
  • Without conditioning on prediction preservation, the study finds that up to 99% of evaluated attribution pairs involve changed predictions, indicating that many prior stability results may conflate explanation fragility with model sensitivity.
  • Under the controlled evaluation, Grad-CAM shows the most consistently stable attribution patterns across ImageNet-1K, MS COCO, and CIFAR-10, across four architectures.

Abstract

Post-hoc feature attribution methods are widely deployed in safety-critical vision systems, yet their stability under realistic input perturbations remains poorly characterized. Existing metrics evaluate explanations primarily under additive noise, collapse stability to a single scalar, and fail to condition on prediction preservation, conflating explanation fragility with model sensitivity. We introduce the Feature Attribution Stability Suite (FASS), a benchmark that enforces prediction-invariance filtering, decomposes stability into three complementary metrics: structural similarity, rank correlation, and top-k Jaccard overlap-and evaluates across geometric, photometric, and compression perturbations. Evaluating four attribution methods (Integrated Gradients, GradientSHAP, Grad-CAM, LIME) across four architectures and three datasets-ImageNet-1K, MS COCO, and CIFAR-10, FASS shows that stability estimates depend critically on perturbation family and prediction-invariance filtering. Geometric perturbations expose substantially greater attribution instability than photometric changes, and without conditioning on prediction preservation, up to 99% of evaluated pairs involve changed predictions. Under this controlled evaluation, we observe consistent method-level trends, with Grad-CAM achieving the highest stability across datasets.