Fake or Real, Can Robots Tell? Evaluating VLM Robustness to Domain Shift in Single-View Robotic Scene Understanding
arXiv cs.RO / 4/24/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper evaluates how well vision-language models (VLMs) perform single-view object captioning for tabletop scenes, using a physical “domain shift” between real tools and visually similar 3D-printed replicas.
- Across multiple metrics, the authors find that VLMs can accurately describe common real-world objects but their performance drops substantially for 3D-printed items with altered texture, color, and materials.
- The study identifies weaknesses in standard evaluation metrics, including cases where metrics fail to detect domain shifts or favor fluent captions that are nonetheless factually wrong.
- The findings suggest important limitations when deploying foundation models in embodied robotic agents and motivate more robust model designs and evaluation protocols for physical environments.
Related Articles

Your MCP server probably has too many tools
Dev.to

MCP Auth That Actually Works: OAuth for Remote Servers
Dev.to

GoDavaii's Day 5: When 22 Indian Languages Redefine 'Hard' in Health AI
Dev.to

Gemma 4 and Qwen 3.6 with q8_0 and q4_0 KV cache: KL divergence results
Reddit r/LocalLLaMA
Corea arresta a hombre por imagen IA falsa del lobo Neukgu: hasta 5 años
Dev.to