MotionScape: A Large-Scale Real-World Highly Dynamic UAV Video Dataset for World Models

arXiv cs.CV / 4/10/2026

📰 NewsSignals & Early TrendsModels & Research

Key Points

  • MotionScape is a new large-scale real-world UAV-view video dataset created to improve world models’ ability to predict complex, highly dynamic 3D dynamics under fast, unconstrained 6-DoF camera motion.
  • The dataset includes 30+ hours of 4K videos (4.5M+ frames) with semantically and geometrically aligned samples, pairing each video with accurate 6-DoF camera trajectories and fine-grained natural-language descriptions.
  • Its construction uses an automated multi-stage pipeline combining CLIP-based relevance filtering, temporal segmentation, robust visual SLAM for trajectory recovery, and LLM-driven semantic annotation.
  • Experiments reported in the paper indicate that the aligned semantic/geometric annotations improve existing world models’ simulation quality for complex 3D dynamics and large viewpoint shifts, supporting better UAV planning and decision-making.
  • MotionScape is publicly available via the provided GitHub link, enabling researchers to train and evaluate UAV-oriented world models with realistic motion priors.

Abstract

Recent advances in world models have demonstrated strong capabilities in simulating physical reality, making them an increasingly important foundation for embodied intelligence. For UAV agents in particular, accurate prediction of complex 3D dynamics is essential for autonomous navigation and robust decision-making in unconstrained environments. However, under the highly dynamic camera trajectories typical of UAV views, existing world models often struggle to maintain spatiotemporal physical consistency. A key reason lies in the distribution bias of current training data: most existing datasets exhibit restricted 2.5D motion patterns, such as ground-constrained autonomous driving scenes or relatively smooth human-centric egocentric videos, and therefore lack realistic high-dynamic 6-DoF UAV motion priors. To address this gap, we present MotionScape, a large-scale real-world UAV-view video dataset with highly dynamic motion for world modeling. MotionScape contains over 30 hours of 4K UAV-view videos, totaling more than 4.5M frames. This novel dataset features semantically and geometrically aligned training samples, where diverse real-world UAV videos are tightly coupled with accurate 6-DoF camera trajectories and fine-grained natural language descriptions. To build the dataset, we develop an automated multi-stage processing pipeline that integrates CLIP-based relevance filtering, temporal segmentation, robust visual SLAM for trajectory recovery, and large-language-model-driven semantic annotation. Extensive experiments show that incorporating such semantically and geometrically aligned annotations effectively improves the ability of existing world models to simulate complex 3D dynamics and handle large viewpoint shifts, thereby benefiting decision-making and planning for UAV agents in complex environments. The dataset is publicly available at https://github.com/Thelegendzz/MotionScape