Unmasking the Illusion of Embodied Reasoning in Vision-Language-Action Models

arXiv cs.RO / 4/21/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that Vision-Language-Action (VLA) models’ high benchmark scores can be misleading, because success may not reflect true embodied reasoning.
  • It introduces BeTTER, a diagnostic benchmark that uses causal interventions (spatial and temporal changes) plus “kinematic isolation” to separate reasoning failures from low-level control limits.
  • Systematic tests show state-of-the-art VLAs fail badly in dynamic settings, relying on lexical-kinematic shortcuts, exhibiting behavioral inertia, and suffering semantic feature collapse.
  • Mechanistic analysis links these problems to architectural bottlenecks—especially capacity compression and myopic downsampling—which degrade the models’ core semantic representations.
  • Real-world robotic validation suggests the representational breakdown is not a simulation artifact, and that static evaluation protocols can hide these issues by encouraging overfitting to sensorimotor priors.

Abstract

Recent Vision-Language-Action (VLA) models report impressive success rates on standard robotic benchmarks, fueling optimism about general-purpose physical intelligence. However, recent evidence suggests a systematic misalignment between standard benchmark success and true embodied reasoning, raising the question of whether these high scores reflect genuine cognitive capability. To address this gap, we introduce BeTTER, a diagnostic Benchmark for Testing True Embodied Reasoning in robotic policies. BeTTER applies targeted causal interventions (e.g., spatial layout shifts, temporal extrapolation) while enforcing kinematic isolation to explicitly decouple high-level reasoning failures from low-level execution limits. Through systematic evaluation, we reveal that state-of-the-art VLAs catastrophically fail in dynamic scenarios, exhibiting severe lexical-kinematic shortcuts, behavioral inertia, and semantic feature collapse. Crucially, our mechanistic analysis traces these symptoms to fundamental architectural bottlenecks - such as capacity compression and myopic downsampling - which systematically degrade the model's foundational semantic representation. We demonstrate that highly static evaluation protocols effectively mask this degradation by allowing optimization to overfit to sensorimotor priors. Supported by real-world robotic validation, our findings confirm that this representational breakdown is not a simulation artifact, highlighting the critical need for future VLA paradigms to resolve the structural tension between high-frequency control and high-level reasoning.