Energy-Regularized Spatial Masking: A Novel Approach to Enhancing Robustness and Interpretability in Vision Models

arXiv cs.CV / 4/9/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces Energy-Regularized Spatial Masking (ERSM), which improves robustness and interpretability in vision CNNs by replacing brute-force dense feature processing with learned, input-adaptive feature selection.
  • ERSM embeds a lightweight Energy-Mask Layer that assigns each visual token a scalar “energy” combining unary intrinsic importance and a pairwise spatial coherence penalty, optimized via differentiable energy minimization.
  • The method avoids rigid sparsity budgets and heuristic pruning scores, instead letting the network discover an information-density equilibrium tailored to each input image.
  • Experiments on convolutional architectures show emergent sparsity, better robustness to structured occlusion, and more interpretable spatial masks while maintaining classification accuracy.
  • In deletion-based robustness tests, the learned energy ranking outperforms magnitude-based pruning and is argued to function as an intrinsic denoising mechanism that isolates semantic object regions without pixel-level supervision.

Abstract

Deep convolutional neural networks achieve remarkable performance by exhaustively processing dense spatial feature maps, yet this brute-force strategy introduces significant computational redundancy and encourages reliance on spurious background correlations. As a result, modern vision models remain brittle and difficult to interpret. We propose Energy-Regularized Spatial Masking (ERSM), a novel framework that reformulates feature selection as a differentiable energy minimization problem. By embedding a lightweight Energy-Mask Layer inside standard convolutional backbones, each visual token is assigned a scalar energy composed of two competing forces: an intrinsic Unary importance cost and a Pairwise spatial coherence penalty. Unlike prior pruning methods that enforce rigid sparsity budgets or rely on heuristic importance scores, ERSM allows the network to autonomously discover an optimal information-density equilibrium tailored to each input. We validate ERSM on convolutional architectures and demonstrate that it produces emergent sparsity, improved robustness to structured occlusion, and highly interpretable spatial masks, while preserving classification accuracy. Furthermore, we show that the learned energy ranking significantly outperforms magnitude-based pruning in deletion-based robustness tests, revealing ERSM as an intrinsic denoising mechanism that isolates semantic object regions without pixel-level supervision.