Zero-Ablation Overstates Register Content Dependence in DINO Vision Transformers

arXiv cs.CV / 4/17/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The study challenges the common “zero-ablation” practice in vision transformers by showing that replacing token activations with zero vectors can exaggerate how much the model depends on exact register content in DINOv2+/DINOv3.
  • When the authors use several replacement controls (mean-substitution, noise-substitution, and cross-image register-shuffling), performance on classification, correspondence, and segmentation stays within about ~1 percentage point of the original baseline, unlike zeroing which causes much larger drops.
  • Per-patch cosine similarity analysis indicates that while all replacements perturb internal representations, zeroing introduces disproportionately large changes that align with the observed task degradation.
  • The paper concludes that downstream performance in frozen-feature settings relies more on “plausible, register-like” activations than on exact image-specific register values, while registers still play a role in buffering features from [CLS] dependence and are linked to compressed patch geometry.
  • The key findings are reported to replicate at ViT-B scale, strengthening the generality of the replacement-control conclusion.

Abstract

Zero-ablation -- replacing token activations with zero vectors -- is widely used to probe token function in vision transformers. Register zeroing in DINOv2+registers and DINOv3 produces large drops (up to -36.6\,pp classification, -30.9\,pp segmentation), suggesting registers are functionally indispensable. However, three replacement controls -- mean-substitution, noise-substitution, and cross-image register-shuffling -- preserve performance across classification, correspondence, and segmentation, remaining within {\sim}1\,pp of the unmodified baseline. Per-patch cosine similarity shows these replacements genuinely perturb internal representations, while zeroing causes disproportionately large perturbations, consistent with why it alone degrades tasks. We conclude that zero-ablation overstates dependence on exact register content. In the frozen-feature evaluations we test, performance depends on plausible register-like activations rather than on exact image-specific values. Registers nevertheless buffer dense features from \texttt{[CLS]} dependence and are associated with compressed patch geometry. These findings, including the replacement-control results, replicate at ViT-B scale.