RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation

arXiv cs.RO / 4/22/2026

📰 NewsSignals & Early TrendsModels & Research

共有:

Key Points

The paper proposes RoboWM-Bench, a manipulation-focused benchmark that evaluates video world models by turning predicted behaviors into robot-executable action sequences.
Unlike prior benchmarks that emphasize perception or diagnostic checks, RoboWM-Bench explicitly tests whether generated behaviors are physically plausible and can complete tasks when executed by embodied robotic agents.
The benchmark is built from generated behaviors derived from both human-hand and robotic manipulation videos, and it uses a unified protocol to enable consistent, reproducible evaluation.
Experiments show that even state-of-the-art video world models struggle to reliably produce physically executable behaviors, with common failures including spatial reasoning errors, unstable contact prediction, and non-physical deformations.
Although fine-tuning on manipulation data improves performance, physical inconsistencies remain, indicating a need for more physically grounded video generation approaches for robotics.

Abstract

Recent advances in large-scale video world models have enabled increasingly realistic future prediction, raising the prospect of leveraging imagined videos for robot learning. However, visual realism does not imply physical plausibility, and behaviors inferred from generated videos may violate dynamics and fail when executed by embodied agents. Existing benchmarks begin to incorporate notions of physical plausibility, but they largely remain perception- or diagnostic-oriented and do not systematically evaluate whether predicted behaviors can be translated into executable actions that complete the intended task. To address this gap, we introduce RoboWM-Bench, a manipulation-centric benchmark for embodiment-grounded evaluation of video world models. RoboWM-Bench converts generated behaviors from both human-hand and robotic manipulation videos into embodied action sequences and validates them through robotic execution. The benchmark spans diverse manipulation scenarios and establishes a unified protocol for consistent and reproducible evaluation. Using RoboWM-Bench, we evaluate state-of-the-art video world models and find that reliably generating physically executable behaviors remains an open challenge. Common failure modes include errors in spatial reasoning, unstable contact prediction, and non-physical deformations. While finetuning on manipulation data yields improvements, physical inconsistencies still persist, suggesting opportunities for more physically grounded video generation for robots.

Autoencoders and Representation Learning in Vision

Dev.to

Google Stitch 2.0: Senior-Level UI in Seconds, But Editing Still Breaks

Dev.to

Now Meta will track what employees do on their computers to train its AI agents

The Verge

Context Bloat in AI Agents

Dev.to

We open sourced the AI dev team that builds our product

Dev.to

RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation

Key Points

Abstract

Related Articles

Autoencoders and Representation Learning in Vision

Google Stitch 2.0: Senior-Level UI in Seconds, But Editing Still Breaks

Now Meta will track what employees do on their computers to train its AI agents

Context Bloat in AI Agents

We open sourced the AI dev team that builds our product

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer