Attention in Space: Functional Roles of VLM Heads for Spatial Reasoning
arXiv cs.AI / 2026/3/24
💬 オピニオンSignals & Early TrendsIdeas & Deep AnalysisModels & Research
要点
- The paper examines how attention heads in vision-language models (VLMs) contribute to spatial reasoning using mechanistic interpretability and functional analysis of attention behavior.
- It introduces CogVSR, a dataset that breaks down complex spatial reasoning questions into step-by-step subquestions mapped to specific cognitive functions (e.g., spatial perception, relational reasoning) to support chain-of-thought-style evaluation.
- The authors develop a probing framework to identify attention heads specialized for different spatial/cognitive functions across multiple VLM families.
- Results show that functionally specialized heads are universally sparse, and heads specialized for spatial reasoning are fewer than those for other cognitive functions.
- Intervention experiments indicate that removing spatially functional heads degrades performance, while emphasizing latent spatial heads improves spatial understanding, suggesting pathways to enhance multimodal spatial reasoning.
