Towards Generalizable Robotic Manipulation in Dynamic Environments

arXiv cs.RO / 4/16/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

VLA models are found to perform poorly on robotic manipulation tasks in dynamic environments, largely due to limited dynamic manipulation datasets and their dependence on single-frame observations that weakens spatiotemporal reasoning.
The paper introduces DOMINO, a large-scale dataset and benchmark with 35 tasks, hierarchical difficulty, 110K+ expert trajectories, and a multi-dimensional evaluation suite to study generalizable dynamic manipulation.
It evaluates existing VLA systems on dynamic tasks, tests training strategies for improving dynamic awareness, and shows that training on dynamic data can also improve transfer to static manipulation.
The authors propose PUMA, a dynamics-aware VLA architecture that uses scene-centric historical optical flow plus world queries for implicit short-horizon prediction of object-centric future states.
PUMA achieves state-of-the-art results, improving success rate by 6.3% absolute over baselines, and the authors release code and data via GitHub.

Abstract

Vision-Language-Action (VLA) models excel in static manipulation but struggle in dynamic environments with moving targets. This performance gap primarily stems from a scarcity of dynamic manipulation datasets and the reliance of mainstream VLAs on single-frame observations, restricting their spatiotemporal reasoning capabilities. To address this, we introduce DOMINO, a large-scale dataset and benchmark for generalizable dynamic manipulation, featuring 35 tasks with hierarchical complexities, over 110K expert trajectories, and a multi-dimensional evaluation suite. Through comprehensive experiments, we systematically evaluate existing VLAs on dynamic tasks, explore effective training strategies for dynamic awareness, and validate the generalizability of dynamic data. Furthermore, we propose PUMA, a dynamics-aware VLA architecture. By integrating scene-centric historical optical flow and specialized world queries to implicitly forecast object-centric future states, PUMA couples history-aware perception with short-horizon prediction. Results demonstrate that PUMA achieves state-of-the-art performance, yielding a 6.3% absolute improvement in success rate over baselines. Moreover, we show that training on dynamic data fosters robust spatiotemporal representations that transfer to static tasks. All code and data are available at https://github.com/H-EmbodVis/DOMINO.

Black Hat Asia

AI Business

oh-my-agent is Now Official on Homebrew-core: A New Milestone for Multi-Agent Orchestration

Dev.to

"The AI Agent's Guide to Sustainable Income: From Zero to Profitability"

Dev.to

"The Hidden Economics of AI Agents: Survival Strategies in Competitive Markets"

Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Towards Generalizable Robotic Manipulation in Dynamic Environments

Key Points

Abstract

Related Articles

Black Hat Asia

oh-my-agent is Now Official on Homebrew-core: A New Milestone for Multi-Agent Orchestration

"The AI Agent's Guide to Sustainable Income: From Zero to Profitability"

"The Hidden Economics of AI Agents: Survival Strategies in Competitive Markets"

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer