Assessing VLM-Driven Semantic-Affordance Inference for Non-Humanoid Robot Morphologies

arXiv cs.RO / 4/22/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The study evaluates whether vision-language models (VLMs) can infer affordances for robots with non-humanoid morphologies, an area that has been largely underexplored.
  • The authors build a hybrid dataset combining real-world annotated affordance–object relations with VLM-generated synthetic scenarios to support cross-category, cross-morphology experiments.
  • Results show VLMs generalize to non-humanoid robot forms but produce highly variable affordance inference performance depending on the object domain.
  • Across all robot morphologies and object categories, the models exhibit a consistent error profile: low false positive rates but high false negative rates, implying conservative affordance predictions.
  • The conservative bias is especially strong for novel tool-use scenarios and unusual object manipulations, suggesting that robotic deployments will likely need complementary methods to reduce overly cautious behavior while maintaining safety.

Abstract

Vision-language models (VLMs) have demonstrated remarkable capabilities in understanding human-object interactions, but their application to robotic systems with non-humanoid morphologies remains largely unexplored. This work investigates whether VLMs can effectively infer affordances for robots with fundamentally different embodiments than humans, addressing a critical gap in the deployment of these models for diverse robotic applications. We introduce a novel hybrid dataset that combines annotated real-world robotic affordance-object relations with VLM-generated synthetic scenarios, and perform an empirical analysis of VLM performance across multiple object categories and robot morphologies, revealing significant variations in affordance inference capabilities. Our experiments demonstrate that while VLMs show promising generalisation to non-humanoid robot forms, their performance is notably inconsistent across different object domains. Critically, we identify a consistent pattern of low false positive rates but high false negative rates across all morphologies and object categories, indicating that VLMs tend toward conservative affordance predictions. Our analysis reveals that this pattern is particularly pronounced for novel tool use scenarios and unconventional object manipulations, suggesting that effective integration of VLMs in robotic systems requires complementary approaches to mitigate over-conservative behaviour while preserving the inherent safety benefits of low false positive rates.