ManipArena: Comprehensive Real-world Evaluation of Reasoning-Oriented Generalist Robot Manipulation

arXiv cs.RO / 3/31/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • ManipArenaは、視覚-言語-行動(VLA)モデルやワールドモデルを実世界に近い形で評価するための標準化された評価フレームワークを提案しています。
  • 既存のベンチマークがシミュレーション中心で現実の「ギャップ」(知覚ノイズ、接触ダイナミクス、ハード制約、レイテンシ等)を反映しにくい点を問題視し、そこを現場的な評価に置き換えることを目指しています。
  • 10,812のエキスパート軌道にもとづく20の多様な課題を含み、意味的・空間的推論を要する“推論志向の汎用ロボット操作”や、テーブルトップ外の長期ホライズンのモバイル操作を扱います。
  • 制御されたOOD(分布外)設定による多段階の一般化テスト、低レベルのモータ信号などの充実した診断情報、さらに高品質3Dスキャンに基づくreal-to-sim同期環境によって、VLA/ワールドモデル双方の公正で再現可能な比較を可能にします。
  • 結果として、埋め込み知能(embodied intelligence)を診断し進展させるためのスケーラブルな基盤を提供することが狙いです。

Abstract

Vision-Language-Action (VLA) models and world models have recently emerged as promising paradigms for general-purpose robotic intelligence, yet their progress is hindered by the lack of reliable evaluation protocols that reflect real-world deployment. Existing benchmarks are largely simulator-centric, which provide controllability but fail to capture the reality gap caused by perception noise, complex contact dynamics, hardware constraints, and system latency. Moreover, fragmented real-world evaluations across different robot platforms prevent fair and reproducible comparison. To address these challenges, we introduce ManipArena, a standardized evaluation framework designed to bridge simulation and real-world execution. ManipArena comprises 20 diverse tasks across 10,812 expert trajectories emphasizing reasoning-oriented manipulation tasks requiring semantic and spatial reasoning, supports multi-level generalization through controlled out-of-distribution settings, and incorporates long-horizon mobile manipulation beyond tabletop scenarios. The framework further provides rich sensory diagnostics, including low-level motor signals, and synchronized real-to-sim environments constructed via high-quality 3D scanning. Together, these features enable fair, realistic, and reproducible evaluation for both VLA and world model approaches, providing a scalable foundation for diagnosing and advancing embodied intelligence systems.