Visuospatial Perspective Taking in Multimodal Language Models

arXiv cs.CL / 3/26/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that multimodal language models’ (MLMs) perspective-taking in visuospatial contexts is insufficiently evaluated, especially compared with text-only or static scene benchmarks.
  • It introduces two adapted evaluation tasks, the Director Task (referential communication) and the Rotating Figure Task (varying angular disparities), to measure visuospatial perspective taking.
  • Across both tasks, MLMs exhibit notable weaknesses at Level 2 VPT, which specifically involves suppressing the model’s own perspective to adopt another’s.
  • The findings suggest current MLMs struggle to accurately represent and reason about alternative viewpoints, raising concerns for deployment in social and collaborative scenarios.

Abstract

As multimodal language models (MLMs) are increasingly used in social and collaborative settings, it is crucial to evaluate their perspective-taking abilities. Existing benchmarks largely rely on text-based vignettes or static scene understanding, leaving visuospatial perspective-taking (VPT) underexplored. We adapt two evaluation tasks from human studies: the Director Task, assessing VPT in a referential communication paradigm, and the Rotating Figure Task, probing perspective-taking across angular disparities. Across tasks, MLMs show pronounced deficits in Level 2 VPT, which requires inhibiting one's own perspective to adopt another's. These results expose critical limitations in current MLMs' ability to represent and reason about alternative perspectives, with implications for their use in collaborative contexts.