CVT-Bench: Counterfactual Viewpoint Transformations Reveal Unstable Spatial Representations in Multimodal LLMs
arXiv cs.CV / 3/24/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces CVT-Bench, a synthetic benchmark to test whether multimodal LLMs keep stable relational/spatial representations when hypothetical camera viewpoints change via counterfactual orbit transformations without re-rendering images.
- Experiments across 100 scenes and 6,000 relational queries show that even state-of-the-art MLLMs can degrade noticeably under viewpoint changes, with frequent cycle-consistency violations and rapid decay in relational stability.
- The study finds that representation choice matters: adding more structured inputs (e.g., textual bounding boxes and especially scene graphs) improves viewpoint stability compared with less structured visual inputs.
- Results indicate that strong single-view spatial accuracy may overestimate robustness, because induced spatial representations can be unstable under counterfactual viewpoint reasoning.
Related Articles

GDPR and AI Training Data: What You Need to Know Before Training on Personal Data
Dev.to
Edge-to-Cloud Swarm Coordination for heritage language revitalization programs with embodied agent feedback loops
Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to

AI Crawler Management: The Definitive Guide to robots.txt for AI Bots
Dev.to

Data Sovereignty Rules and Enterprise AI
Dev.to