Fake or Real, Can Robots Tell? Evaluating VLM Robustness to Domain Shift in Single-View Robotic Scene Understanding

arXiv cs.RO / 4/24/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper evaluates how well vision-language models (VLMs) perform single-view object captioning for tabletop scenes, using a physical “domain shift” between real tools and visually similar 3D-printed replicas.
  • Across multiple metrics, the authors find that VLMs can accurately describe common real-world objects but their performance drops substantially for 3D-printed items with altered texture, color, and materials.
  • The study identifies weaknesses in standard evaluation metrics, including cases where metrics fail to detect domain shifts or favor fluent captions that are nonetheless factually wrong.
  • The findings suggest important limitations when deploying foundation models in embodied robotic agents and motivate more robust model designs and evaluation protocols for physical environments.

Abstract

Robotic scene understanding increasingly relies on Vision-Language Models (VLMs) to generate natural language descriptions of the environment. In this work, we systematically evaluate single-view object captioning for tabletop scenes captured by a robotic manipulator, introducing a controlled physical domain shift that contrasts real-world tools with geometrically similar 3D-printed counterparts that differ in texture, colour, and material. We benchmark a suite of state-of-the-art, locally deployable VLMs across multiple metrics to assess semantic alignment and factual grounding. Our results demonstrate that while VLMs describe common real-world objects effectively, performance degrades markedly on 3D-printed items despite their structurally familiar forms. We further expose critical vulnerabilities in standard evaluation metrics, showing that some fail to detect domain shifts entirely or reward fluent but factually incorrect captions. These findings highlight the limitations of deploying foundation models for embodied agents and the need for more robust architectures and evaluation protocols in physical robotic applications.