From Edges to Depth: Probing the Spatial Hierarchy in Vision Transformers
arXiv cs.CV / 4/28/2026
📰 NewsModels & Research
Key Points
- The paper studies how Vision Transformers trained only for image classification learn spatial structure without spatial supervision during pretraining.
- Layerwise probing on BSDS500 (local boundary structure) and NYU Depth V2 (per-patch depth) finds a hierarchy: boundaries become linearly decodable around layers 5–6, while depth emerges later and peaks at layer 8.
- The spatial signals disappear at the final classification layer, and random-weight controls indicate the encodings are learned rather than artifacts of the model architecture.
- Causal interventions (ablating specific probe directions and activation patching) show depth decoding depends on a particular learned direction and is re-derived across layers, with mid-layer changes affecting later layers most.
- The authors conclude that classification-trained ViTs actively maintain a spatial hierarchy resembling the progression from early to late areas in primate visual cortex.
Related Articles

DeepSeek V4 Released: 1.6T Parameters, 1M Context, and Floor-Shattering Prices
Dev.to

Tenstorrent TT-QuietBox 2 Specifications (Blackhole)
Reddit r/LocalLLaMA

Qwen3.6-27B-Q6_K - images
Reddit r/LocalLLaMA
LWiAI Podcast #242 - ChatGPT Images 2.0, Qwen 3.6 Max, Kimi-K2.6
Last Week in AI
Anthropic mass shipped 9 connectors and accidentally leaked their entire creative industry strategy
Reddit r/artificial