RoboLab: A High-Fidelity Simulation Benchmark for Analysis of Task Generalist Policies

arXiv cs.RO / 4/14/2026

📰 NewsIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • RoboLabは、ロボットのタスク汎用(task generalist)政策の本当の汎化を評価するために、シミュレーションでのベンチマークが抱える飽和・学習/評価のドメイン重複といった課題を解決する枠組みとして提案されています。
  • RoboLabは、物理的に現実的でフォトリアリスティックなシミュレーション上で、ロボットや政策に依存しない形でシーンとタスクを人手またはLLMで生成できるとしています。
  • 提案ベンチマークRoboLab-120は、視覚・手続き(procedural)・関係(relational)という3つの能力軸と3段階の難易度からなる120タスクで構成されます。
  • 制御された外乱(perturbations)に対する行動の感度まで定量化し、高精細なシミュレーションが現実での性能や外部要因依存性の代理として機能し得ることを示しています。
  • RoboLabによる評価では、既存の最先端モデルに性能ギャップがあることや、粒度の高い指標とスケーラブルなツールにより汎化能力の実態をより掴めると主張しています。

Abstract

The pursuit of general-purpose robotics has yielded impressive foundation models, yet simulation-based benchmarking remains a bottleneck due to rapid performance saturation and a lack of true generalization testing. Existing benchmarks often exhibit significant domain overlap between training and evaluation, trivializing success rates and obscuring insights into robustness. We introduce RoboLab, a simulation benchmarking framework designed to address these challenges. Concretely, our framework is designed to answer two questions: (1) to what extent can we understand the performance of a real-world policy by analyzing its behavior in simulation, and (2) which external factors most strongly affect that behavior under controlled perturbations. First, RoboLab enables human-authored and LLM-enabled generation of scenes and tasks in a robot- and policy-agnostic manner within a physically realistic and photorealistic simulation. With this, we propose the RoboLab-120 benchmark, consisting of 120 tasks categorized into three competency axes: visual, procedural, relational competency, across three difficulty levels. Second, we introduce a systematic analysis of real-world policies that quantify both their performance and the sensitivity of their behavior to controlled perturbations, indicating that high-fidelity simulation can serve as a proxy for analyzing performance and its dependence on external factors. Evaluation with RoboLab exposes significant performance gap in current state-of-the-art models. By providing granular metrics and a scalable toolset, RoboLab offers a scalable framework for evaluating the true generalization capabilities of task-generalist robotic policies.