Suppressing Non-Semantic Noise in Masked Image Modeling Representations

arXiv cs.CV / 4/2/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • Masked Image Modeling (MIM) representations can unintentionally preserve non-semantic “noise,” which degrades inference performance.
  • The paper proposes a model-agnostic semantic-invariance scoring approach using PCA on real and synthetic non-semantic images.
  • It introduces Semantically Orthogonal Artifact Projection (SOAP), a post-hoc method that suppresses non-semantic information in patch representations without additional training.
  • SOAP is designed to be plug-and-play: it can be attached to different MIM-based models as a single linear head and yields consistent gains in zero-shot performance.

Abstract

Masked Image Modeling (MIM) has become a ubiquitous self-supervised vision paradigm. In this work, we show that MIM objectives cause the learned representations to retain non-semantic information, which ultimately hurts performance during inference. We introduce a model-agnostic score for semantic invariance using Principal Component Analysis (PCA) on real and synthetic non-semantic images. Based on this score, we propose a simple method, Semantically Orthogonal Artifact Projection (SOAP), to directly suppress non-semantic information in patch representations, leading to consistent improvements in zero-shot performance across various MIM-based models. SOAP is a post-hoc suppression method, requires zero training, and can be attached to any model as a single linear head.