Towards Generalizable Robotic Manipulation in Dynamic Environments

arXiv cs.RO / 4/16/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • VLA models are found to perform poorly on robotic manipulation tasks in dynamic environments, largely due to limited dynamic manipulation datasets and their dependence on single-frame observations that weakens spatiotemporal reasoning.
  • The paper introduces DOMINO, a large-scale dataset and benchmark with 35 tasks, hierarchical difficulty, 110K+ expert trajectories, and a multi-dimensional evaluation suite to study generalizable dynamic manipulation.
  • It evaluates existing VLA systems on dynamic tasks, tests training strategies for improving dynamic awareness, and shows that training on dynamic data can also improve transfer to static manipulation.
  • The authors propose PUMA, a dynamics-aware VLA architecture that uses scene-centric historical optical flow plus world queries for implicit short-horizon prediction of object-centric future states.
  • PUMA achieves state-of-the-art results, improving success rate by 6.3% absolute over baselines, and the authors release code and data via GitHub.

Abstract

Vision-Language-Action (VLA) models excel in static manipulation but struggle in dynamic environments with moving targets. This performance gap primarily stems from a scarcity of dynamic manipulation datasets and the reliance of mainstream VLAs on single-frame observations, restricting their spatiotemporal reasoning capabilities. To address this, we introduce DOMINO, a large-scale dataset and benchmark for generalizable dynamic manipulation, featuring 35 tasks with hierarchical complexities, over 110K expert trajectories, and a multi-dimensional evaluation suite. Through comprehensive experiments, we systematically evaluate existing VLAs on dynamic tasks, explore effective training strategies for dynamic awareness, and validate the generalizability of dynamic data. Furthermore, we propose PUMA, a dynamics-aware VLA architecture. By integrating scene-centric historical optical flow and specialized world queries to implicitly forecast object-centric future states, PUMA couples history-aware perception with short-horizon prediction. Results demonstrate that PUMA achieves state-of-the-art performance, yielding a 6.3% absolute improvement in success rate over baselines. Moreover, we show that training on dynamic data fosters robust spatiotemporal representations that transfer to static tasks. All code and data are available at https://github.com/H-EmbodVis/DOMINO.