Attention in Space: Functional Roles of VLM Heads for Spatial Reasoning

arXiv cs.AI / 2026/3/24

💬 オピニオンSignals & Early TrendsIdeas & Deep AnalysisModels & Research

要点

  • The paper examines how attention heads in vision-language models (VLMs) contribute to spatial reasoning using mechanistic interpretability and functional analysis of attention behavior.
  • It introduces CogVSR, a dataset that breaks down complex spatial reasoning questions into step-by-step subquestions mapped to specific cognitive functions (e.g., spatial perception, relational reasoning) to support chain-of-thought-style evaluation.
  • The authors develop a probing framework to identify attention heads specialized for different spatial/cognitive functions across multiple VLM families.
  • Results show that functionally specialized heads are universally sparse, and heads specialized for spatial reasoning are fewer than those for other cognitive functions.
  • Intervention experiments indicate that removing spatially functional heads degrades performance, while emphasizing latent spatial heads improves spatial understanding, suggesting pathways to enhance multimodal spatial reasoning.

Abstract

Despite remarkable advances in large Vision-Language Models (VLMs), spatial reasoning remains a persistent challenge. In this work, we investigate how attention heads within VLMs contribute to spatial reasoning by analyzing their functional roles through a mechanistic interpretability lens. We introduce CogVSR, a dataset that decomposes complex spatial reasoning questions into step-by-step subquestions designed to simulate human-like reasoning via a chain-of-thought paradigm, with each subquestion linked to specific cognitive functions such as spatial perception or relational reasoning. Building on CogVSR, we develop a probing framework to identify and characterize attention heads specialized for these functions. Our analysis across diverse VLM families reveals that these functional heads are universally sparse, vary in number and distribution across functions. Notably, spatially specialized heads are fewer than those for other cognitive functions, highlighting their scarcity. We propose methods to activate latent spatial heads, improving spatial understanding. Intervention experiments further demonstrate their critical role in spatial reasoning: removing functional heads leads to performance degradation, while emphasizing them enhances accuracy. This study provides new interpretability driven insights into how VLMs attend to space and paves the way for enhancing complex spatial reasoning in multimodal models.