Finding Distributed Object-Centric Properties in Self-Supervised Transformers

arXiv cs.AI / 3/30/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper studies why self-supervised Vision Transformers (e.g., DINO) localize objects poorly when relying on [CLS] token attention, arguing that [CLS] trained for image-level summarization dilutes object-centric signals.
By analyzing patch-to-patch similarity computed from attention components across all layers (query, key, and value), the authors find that object-centric properties are encoded in the q/k/v-derived similarity maps and are not limited to the final layer.
The work introduces Object-DINO, a training-free method that clusters attention heads across layers based on patch similarity to automatically identify an object-centric cluster representing objects in images.
Experiments show Object-DINO improves unsupervised object discovery performance (CorLoc gains of +3.6 to +12.4) and reduces object hallucination in Multimodal Large Language Models via visual grounding.
Overall, the results suggest that extracting distributed object-centric information from self-supervised transformers can boost downstream tasks without any additional model training.

Abstract

Self-supervised Vision Transformers (ViTs) like DINO show an emergent ability to discover objects, typically observed in [CLS] token attention maps of the final layer. However, these maps often contain spurious activations resulting in poor localization of objects. This is because the [CLS] token, trained on an image-level objective, summarizes the entire image instead of focusing on objects. This aggregation dilutes the object-centric information existing in the local, patch-level interactions. We analyze this by computing inter-patch similarity using patch-level attention components (query, key, and value) across all layers. We find that: (1) Object-centric properties are encoded in the similarity maps derived from all three components (

q, k, v

), unlike prior work that uses only key features or the [CLS] token. (2) This object-centric information is distributed across the network, not just confined to the final layer. Based on these insights, we introduce Object-DINO, a training-free method that extracts this distributed object-centric information. Object-DINO clusters attention heads across all layers based on the similarities of their patches and automatically identifies the object-centric cluster corresponding to all objects. We demonstrate Object-DINO's effectiveness on two applications: enhancing unsupervised object discovery (+3.6 to +12.4 CorLoc gains) and mitigating object hallucination in Multimodal Large Language Models by providing visual grounding. Our results demonstrate that using this distributed object-centric information improves downstream tasks without additional training.

Black Hat Asia

AI Business

Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer

Simon Willison's Blog

Beyond the Chatbot: Engineering Multi-Agent Ecosystems in 2026

Dev.to

I missed the "fun" part in software development

Dev.to

The Billion Dollar Tax on AI Agents

Dev.to

Finding Distributed Object-Centric Properties in Self-Supervised Transformers

Key Points

Abstract

Related Articles

Black Hat Asia

Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer

Beyond the Chatbot: Engineering Multi-Agent Ecosystems in 2026

I missed the "fun" part in software development

The Billion Dollar Tax on AI Agents

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer