Target-Bench: Can Video World Models Achieve Mapless Path Planning with Semantic Targets?

arXiv cs.RO / 4/16/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces Target-Bench, a benchmark designed to evaluate whether video world models can perform semantic reasoning, spatial estimation, and planning using semantic target goals.
  • Target-Bench includes 450 robot-collected scenarios across 47 semantic categories and uses SLAM-based trajectories as reference motion tendencies, along with metric scale recovery to reconstruct motion from generated videos.
  • The benchmark provides five complementary metrics that measure target-approaching ability and directional consistency, enabling more comprehensive planning evaluation than prior qualitative assessments.
  • Experimental results show a substantial performance gap: the best off-the-shelf video world model reaches only a 0.341 overall score, suggesting realistic video generation does not yet imply robust semantic planning.
  • The authors report that fine-tuning on a relatively small real-world robot dataset can substantially improve planning performance at the task level, indicating a practical path toward better planning-capable models.

Abstract

While recent video world models can generate highly realistic videos, their ability to perform semantic reasoning and planning remains unclear and unquantified. We introduce Target-Bench, the first benchmark that enables comprehensive evaluation of video world models' semantic reasoning, spatial estimation, and planning capabilities. Target-Bench provides 450 robot-collected scenarios spanning 47 semantic categories, with SLAM-based trajectories serving as motion tendency references. Our benchmark reconstructs motion from generated videos with a metric scale recovery mechanism, enabling the evaluation of planning performance with five complementary metrics that focus on target-approaching capability and directional consistency. Our evaluation result shows that the best off-the-shelf model achieves only a 0.341 overall score, revealing a significant gap between realistic visual generation and semantic reasoning in current video world models. Furthermore, we demonstrate that fine-tuning process on a relatively small real-world robot dataset can significantly improve task-level planning performance.