Thinking in Dynamics: How Multimodal Large Language Models Perceive, Track, and Reason Dynamics in Physical 4D World

arXiv cs.CV / 3/16/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

It proposes Dyn-Bench, a multi-stage filtered, large-scale benchmark for evaluating multimodal LLMs on spatio-temporal dynamics, built from diverse real-world and synthetic video data, comprising 1k videos, 7k VQA pairs, and 3k dynamic object grounding pairs.
The benchmark enables robust evaluation of general, spatial, and region-level understanding to see how MLLMs perceive, track, and reason about evolving scenes in a 4D world.
The study finds that existing models struggle to maintain strong performance in both spatio-temporal reasoning and dynamic object grounding, often producing inconsistent interpretations of motion and interactions.
Conventional prompting strategies (e.g., chain-of-thought or caption-based hints) provide limited improvement, while structured integration approaches like Mask-Guided Fusion and Spatio-Temporal Textual Cognitive Map (ST-TCM) significantly boost dynamics perception and reasoning.

Abstract

Humans inhabit a physical 4D world where geometric structure and semantic content evolve over time, constituting a dynamic 4D reality (spatial with temporal dimension). While current Multimodal Large Language Models (MLLMs) excel in static visual understanding, can they also be adept at "thinking in dynamics", i.e., perceive, track and reason about spatio-temporal dynamics in evolving scenes? To systematically assess their spatio-temporal reasoning and localized dynamics perception capabilities, we introduce Dyn-Bench, a large-scale benchmark built from diverse real-world and synthetic video datasets, enabling robust and scalable evaluation of spatio-temporal understanding. Through multi-stage filtering from massive 2D and 4D data sources, Dyn-Bench provides a high-quality collection of dynamic scenes, comprising 1k videos, 7k visual question answering (VQA) pairs, and 3k dynamic object grounding pairs. We probe general, spatial and region-level MLLMs to express how they think in dynamics both linguistically and visually, and find that existing models cannot simultaneously maintain strong performance in both spatio-temporal reasoning and dynamic object grounding, often producing inconsistent interpretations of motion and interaction. Notably, conventional prompting strategies (e.g., chain-of-thought or caption-based hints) provide limited improvement, whereas structured integration approaches, including Mask-Guided Fusion and Spatio-Temporal Textual Cognitive Map (ST-TCM), significantly enhance MLLMs' dynamics perception and spatio-temporal reasoning in the physical 4D world. Code and benchmark are available at https://dyn-bench.github.io/.

How to Build an AI Team: The Solopreneur Playbook

Dev.to

CrewAI vs AutoGen vs LangGraph: Which Agent Framework to Use

Dev.to

14 Best Self-Hosted Claude Alternatives for AI and Coding in 2026

Dev.to

[P] Finetuned small LMs to VLM adapters locally and wrote a short article about it

Reddit r/MachineLearning

Experiment: How far can a 28M model go in business email generation?

Reddit r/LocalLLaMA

Thinking in Dynamics: How Multimodal Large Language Models Perceive, Track, and Reason Dynamics in Physical 4D World

Key Points

Abstract

Related Articles

How to Build an AI Team: The Solopreneur Playbook

CrewAI vs AutoGen vs LangGraph: Which Agent Framework to Use

14 Best Self-Hosted Claude Alternatives for AI and Coding in 2026

[P] Finetuned small LMs to VLM adapters locally and wrote a short article about it

Experiment: How far can a 28M model go in business email generation?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer