From Edges to Depth: Probing the Spatial Hierarchy in Vision Transformers

arXiv cs.CV / 4/28/2026

📰 NewsModels & Research

共有:

Key Points

The paper studies how Vision Transformers trained only for image classification learn spatial structure without spatial supervision during pretraining.
Layerwise probing on BSDS500 (local boundary structure) and NYU Depth V2 (per-patch depth) finds a hierarchy: boundaries become linearly decodable around layers 5–6, while depth emerges later and peaks at layer 8.
The spatial signals disappear at the final classification layer, and random-weight controls indicate the encodings are learned rather than artifacts of the model architecture.
Causal interventions (ablating specific probe directions and activation patching) show depth decoding depends on a particular learned direction and is re-derived across layers, with mid-layer changes affecting later layers most.
The authors conclude that classification-trained ViTs actively maintain a spatial hierarchy resembling the progression from early to late areas in primate visual cortex.

Abstract

Vision Transformers trained only on image classification routinely transfer to tasks that demand spatial understanding, yet they receive no spatial supervision during pretraining. We ask where and how robustly such structure is encoded. Probing a frozen ViT-B/16 layerwise for two complementary properties, local patch boundaries (BSDS500) and per-patch depth (NYU Depth V2), reveals a clear hierarchy: boundary structure becomes linearly decodable at layers 5-6 (AP = 0.833), while depth, which requires integrating global cues, peaks two to three layers later at layer 8 (MAE = 0.0875). Both signals collapse at the final classification layer, and random-weight controls confirm the encodings are learned rather than architectural. Causal interventions add specificity: ablating the single direction a linear depth probe reads degrades depth decoding by up to 165%, while ablating any other direction changes it by less than 1%. Targeted activation patching along that direction shows the depth signal is partially re-derived at each layer rather than passively carried in the residual stream, with mid-layer interventions persisting most strongly downstream. The result is that a classification-trained ViT develops an actively maintained spatial hierarchy that mirrors the early-to-late progression observed in the primate visual cortex.

DeepSeek V4 Released: 1.6T Parameters, 1M Context, and Floor-Shattering Prices

Dev.to

Tenstorrent TT-QuietBox 2 Specifications (Blackhole)

Reddit r/LocalLLaMA

Qwen3.6-27B-Q6_K - images

Reddit r/LocalLLaMA

LWiAI Podcast #242 - ChatGPT Images 2.0, Qwen 3.6 Max, Kimi-K2.6

Last Week in AI

Anthropic mass shipped 9 connectors and accidentally leaked their entire creative industry strategy

Reddit r/artificial

From Edges to Depth: Probing the Spatial Hierarchy in Vision Transformers

Key Points

Abstract

Related Articles

DeepSeek V4 Released: 1.6T Parameters, 1M Context, and Floor-Shattering Prices

Tenstorrent TT-QuietBox 2 Specifications (Blackhole)

Qwen3.6-27B-Q6_K - images

LWiAI Podcast #242 - ChatGPT Images 2.0, Qwen 3.6 Max, Kimi-K2.6

Anthropic mass shipped 9 connectors and accidentally leaked their entire creative industry strategy

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer