Non-identifiability of Explanations from Model Behavior in Deep Networks of Image Authenticity Judgments

arXiv cs.CV / 4/9/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper examines whether deep neural networks that predict human image authenticity judgments also produce explanations that are robust and identifiable, rather than merely correlating with behavior.
  • Experiments across multiple frozen vision models show predictive accuracy can reach ~80% of the noise ceiling, but explanation quality varies: some models (e.g., VGG) appear to track general image quality rather than authenticity-specific factors.
  • Attribution methods tested (Grad-CAM, LIME, and multiscale pixel masking) yield attribution maps that are stable within an architecture (especially for EfficientNetB3 and Barlow Twins) and are more consistent for images judged more authentic.
  • However, attribution agreement across different architectures is weak even when predictive performance is similar, indicating the explanations are not reliably identifiable.
  • The authors use ensembling to improve authenticity prediction and obtain more image-level attribution, yet conclude that successful behavioral prediction does not imply explanations reflect underlying cognitive mechanisms.

Abstract

Deep neural networks can predict human judgments, but this does not imply that they rely on human-like information or reveal the cues underlying those judgments. Prior work has addressed this issue using attribution heatmaps, but their explanatory value in itself depends on robustness. Here we tested the robustness of such explanations by evaluating whether models that predict human authenticity ratings also produce consistent explanations within and across architectures. We fit lightweight regression heads to multiple frozen pretrained vision models and generated attribution maps using Grad-CAM, LIME, and multiscale pixel masking. Several architectures predicted ratings well, reaching about 80% of the noise ceiling. VGG models achieved this by tracking image quality rather than authenticity-specific variance, limiting the relevance of their attributions. Among the remaining models, attribution maps were generally stable across random seeds within an architecture, especially for EfficientNetB3 and Barlow Twins, and consistency was higher for images judged as more authentic. Crucially, agreement in attribution across architectures was weak even when predictive performance was similar. To address this, we combined models in ensembles, which improved prediction of human authenticity judgments and enabled image-level attribution via pixel masking. We conclude that while deep networks can predict human authenticity judgments well, they do not produce identifiable explanations for those judgments. More broadly, our findings suggest that post hoc explanations from successful models of behavior should be treated as weak evidence for cognitive mechanism.