SIMPACT: Simulation-Enabled Action Planning using Vision-Language Models

arXiv cs.RO / 4/1/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • SIMPACTは、VLMが物理の因果ダイナミクスを持たないという課題に対し、シミュレーションを介して物理理解をテスト時に補う「シミュレーション内蔵の行動計画」フレームワークを提案している。
  • 追加学習なしで、単一のRGB-D観測から効率的に物理シミュレーション(world modeling)を構築し、VLMがアクション提案→シミュレーションロールアウト観測→反復的に推論を更新できるとしている。
  • 言語推論と物理予測を統合することで、接触ダイナミクスやアクション結果を物理的に根拠づけて理解・計画することを目指している。
  • 5つの難易度の高い実世界の剛体・変形物の操作タスクで最先端性能を示し、汎用的なロボット操作モデルより優れたと報告している。
  • 物理理解を「効率的なテスト時シミュレーション」でVLM推論に埋め込むことが、より一般化された身体性(embodied intelligence)への有望な道だと結論づけている。

Abstract

Vision-Language Models (VLMs) exhibit remarkable common-sense and semantic reasoning capabilities. However, they lack a grounded understanding of physical dynamics. This limitation arises from training VLMs on static internet-scale visual-language data that contain no causal interactions or action-conditioned changes. Consequently, it remains challenging to leverage VLMs for fine-grained robotic manipulation tasks that require physical understanding, reasoning, and corresponding action planning. To overcome this, we present SIMPACT, a test-time, SIMulation-enabled ACTion Planning framework that equips VLMs with physical reasoning through simulation-in-the-loop world modeling, without requiring any additional training. From a single RGB-D observation, SIMPACT efficiently constructs physics simulations, enabling the VLM to propose informed actions, observe simulated rollouts, and iteratively refine its reasoning. By integrating language reasoning with physics prediction, our simulation-enabled VLM can understand contact dynamics and action outcomes in a physically grounded way. Our method demonstrates state-of-the-art performance on five challenging, real-world rigid-body and deformable manipulation tasks that require fine-grained physical reasoning, outperforming existing general-purpose robotic manipulation models. Our results demonstrate that embedding physics understanding via efficient simulation into VLM reasoning at test time offers a promising path towards generalizable embodied intelligence. Project webpage can be found at https://simpact-bot.github.io