Feature Attribution Stability Suite: How Stable Are Post-Hoc Attributions?

arXiv cs.AI / 4/6/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces the Feature Attribution Stability Suite (FASS) to benchmark how stable post-hoc feature attribution methods are under realistic input perturbations while controlling for prediction changes.
FASS improves evaluation by adding prediction-invariance filtering and splitting stability into structural similarity, rank correlation, and top-k Jaccard overlap, rather than relying on a single scalar metric.
Experiments across Integrated Gradients, GradientSHAP, Grad-CAM, and LIME show that stability varies strongly by perturbation family, with geometric perturbations producing much larger attribution instability than photometric ones.
Without conditioning on prediction preservation, the study finds that up to 99% of evaluated attribution pairs involve changed predictions, indicating that many prior stability results may conflate explanation fragility with model sensitivity.
Under the controlled evaluation, Grad-CAM shows the most consistently stable attribution patterns across ImageNet-1K, MS COCO, and CIFAR-10, across four architectures.

Abstract

Post-hoc feature attribution methods are widely deployed in safety-critical vision systems, yet their stability under realistic input perturbations remains poorly characterized. Existing metrics evaluate explanations primarily under additive noise, collapse stability to a single scalar, and fail to condition on prediction preservation, conflating explanation fragility with model sensitivity. We introduce the Feature Attribution Stability Suite (FASS), a benchmark that enforces prediction-invariance filtering, decomposes stability into three complementary metrics: structural similarity, rank correlation, and top-k Jaccard overlap-and evaluates across geometric, photometric, and compression perturbations. Evaluating four attribution methods (Integrated Gradients, GradientSHAP, Grad-CAM, LIME) across four architectures and three datasets-ImageNet-1K, MS COCO, and CIFAR-10, FASS shows that stability estimates depend critically on perturbation family and prediction-invariance filtering. Geometric perturbations expose substantially greater attribution instability than photometric changes, and without conditioning on prediction preservation, up to 99% of evaluated pairs involve changed predictions. Under this controlled evaluation, we observe consistent method-level trends, with Grad-CAM achieving the highest stability across datasets.