Zero-Ablation Overstates Register Content Dependence in DINO Vision Transformers

arXiv cs.CV / 4/17/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The study challenges the common “zero-ablation” practice in vision transformers by showing that replacing token activations with zero vectors can exaggerate how much the model depends on exact register content in DINOv2+/DINOv3.
When the authors use several replacement controls (mean-substitution, noise-substitution, and cross-image register-shuffling), performance on classification, correspondence, and segmentation stays within about ~1 percentage point of the original baseline, unlike zeroing which causes much larger drops.
Per-patch cosine similarity analysis indicates that while all replacements perturb internal representations, zeroing introduces disproportionately large changes that align with the observed task degradation.
The paper concludes that downstream performance in frozen-feature settings relies more on “plausible, register-like” activations than on exact image-specific register values, while registers still play a role in buffering features from [CLS] dependence and are linked to compressed patch geometry.
The key findings are reported to replicate at ViT-B scale, strengthening the generality of the replacement-control conclusion.

Abstract

Zero-ablation -- replacing token activations with zero vectors -- is widely used to probe token function in vision transformers. Register zeroing in DINOv2+registers and DINOv3 produces large drops (up to

-36.6

\,pp classification,

-30.9

\,pp segmentation), suggesting registers are functionally indispensable. However, three replacement controls -- mean-substitution, noise-substitution, and cross-image register-shuffling -- preserve performance across classification, correspondence, and segmentation, remaining within

{\sim}1

\,pp of the unmodified baseline. Per-patch cosine similarity shows these replacements genuinely perturb internal representations, while zeroing causes disproportionately large perturbations, consistent with why it alone degrades tasks. We conclude that zero-ablation overstates dependence on exact register content. In the frozen-feature evaluations we test, performance depends on plausible register-like activations rather than on exact image-specific values. Registers nevertheless buffer dense features from \texttt{[CLS]} dependence and are associated with compressed patch geometry. These findings, including the replacement-control results, replicate at ViT-B scale.