Visuospatial Perspective Taking in Multimodal Language Models

arXiv cs.CL / 3/26/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that multimodal language models’ (MLMs) perspective-taking in visuospatial contexts is insufficiently evaluated, especially compared with text-only or static scene benchmarks.
It introduces two adapted evaluation tasks, the Director Task (referential communication) and the Rotating Figure Task (varying angular disparities), to measure visuospatial perspective taking.
Across both tasks, MLMs exhibit notable weaknesses at Level 2 VPT, which specifically involves suppressing the model’s own perspective to adopt another’s.
The findings suggest current MLMs struggle to accurately represent and reason about alternative viewpoints, raising concerns for deployment in social and collaborative scenarios.

Abstract

As multimodal language models (MLMs) are increasingly used in social and collaborative settings, it is crucial to evaluate their perspective-taking abilities. Existing benchmarks largely rely on text-based vignettes or static scene understanding, leaving visuospatial perspective-taking (VPT) underexplored. We adapt two evaluation tasks from human studies: the Director Task, assessing VPT in a referential communication paradigm, and the Rotating Figure Task, probing perspective-taking across angular disparities. Across tasks, MLMs show pronounced deficits in Level 2 VPT, which requires inhibiting one's own perspective to adopt another's. These results expose critical limitations in current MLMs' ability to represent and reason about alternative perspectives, with implications for their use in collaborative contexts.

AgentDesk vs Hiring Another Consultant: A Cost Comparison

Dev.to

"Why Your AI Agent Needs a System 1"

Dev.to

When should we expect TurboQuant?

Reddit r/LocalLLaMA

AI as Your Customs Co-Pilot: Automating HS Code Chaos in Southeast Asia

Dev.to

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

Dev.to

Visuospatial Perspective Taking in Multimodal Language Models

Key Points

Abstract

Related Articles

AgentDesk vs Hiring Another Consultant: A Cost Comparison

"Why Your AI Agent Needs a System 1"

When should we expect TurboQuant?

AI as Your Customs Co-Pilot: Automating HS Code Chaos in Southeast Asia

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer