Beyond Static Vision: Scene Dynamic Field Unlocks Intuitive Physics Understanding in Multi-modal Large Language Models

arXiv cs.CV / 4/7/2026

📰 NewsSignals & Early TrendsModels & Research

Key Points

  • 多モーダル大規模言語モデル(MLLM)は画像・動画理解に強い一方で、高レベルな物理推論、とりわけ「連続体(fluidなど)」のダイナミクス理解が大きく難しいことが示されている。
  • この弱点を切り出して評価するために、Next Frame Selection(NFS)とTemporal Coherence Verification(TCV)という2つのベンチマークタスクを提案し、最先端MLLMでも基礎課題の成績が低いと報告している。
  • その改善策として、Scene Dynamic Field(SDF)を提案し、物理シミュレータを活用したマルチタスク微調整で性能を大きく引き上げた。
  • fluid系タスクで最大20.7%の改善が得られ、さらに未見の物理領域への一般化も強いことを示している。
  • コードとデータが公開されており、コスト効率の良い「物理的に根拠のある」MLLM開発への有望な方向性を提示している。

Abstract

While Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in image and video understanding, their ability to comprehend the physical world has become an increasingly important research focus. Despite their improvements, current MLLMs struggle significantly with high-level physics reasoning. In this work, we investigate the first step of physical reasoning, i.e., intuitive physics understanding, revealing substantial limitations in understanding the dynamics of continuum objects. To isolate and evaluate this specific capability, we introduce two fundamental benchmark tasks: Next Frame Selection (NFS) and Temporal Coherence Verification (TCV). Our experiments demonstrate that even state-of-the-art MLLMs perform poorly on these foundational tasks. To address this limitation, we propose Scene Dynamic Field (SDF), a concise approach that leverages physics simulators within a multi-task fine-tuning framework. SDF substantially improves performance, achieving up to 20.7% gains on fluid tasks while showing strong generalization to unseen physical domains. This work not only highlights a critical gap in current MLLMs but also presents a promising cost-efficient approach for developing more physically grounded MLLMs. Our code and data are available at https://github.com/andylinx/Scene-Dynamic-Field.