Target-Bench: Can Video World Models Achieve Mapless Path Planning with Semantic Targets?

arXiv cs.RO / 4/16/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces Target-Bench, a benchmark designed to evaluate whether video world models can perform semantic reasoning, spatial estimation, and planning using semantic target goals.
Target-Bench includes 450 robot-collected scenarios across 47 semantic categories and uses SLAM-based trajectories as reference motion tendencies, along with metric scale recovery to reconstruct motion from generated videos.
The benchmark provides five complementary metrics that measure target-approaching ability and directional consistency, enabling more comprehensive planning evaluation than prior qualitative assessments.
Experimental results show a substantial performance gap: the best off-the-shelf video world model reaches only a 0.341 overall score, suggesting realistic video generation does not yet imply robust semantic planning.
The authors report that fine-tuning on a relatively small real-world robot dataset can substantially improve planning performance at the task level, indicating a practical path toward better planning-capable models.

Abstract

While recent video world models can generate highly realistic videos, their ability to perform semantic reasoning and planning remains unclear and unquantified. We introduce Target-Bench, the first benchmark that enables comprehensive evaluation of video world models' semantic reasoning, spatial estimation, and planning capabilities. Target-Bench provides 450 robot-collected scenarios spanning 47 semantic categories, with SLAM-based trajectories serving as motion tendency references. Our benchmark reconstructs motion from generated videos with a metric scale recovery mechanism, enabling the evaluation of planning performance with five complementary metrics that focus on target-approaching capability and directional consistency. Our evaluation result shows that the best off-the-shelf model achieves only a 0.341 overall score, revealing a significant gap between realistic visual generation and semantic reasoning in current video world models. Furthermore, we demonstrate that fine-tuning process on a relatively small real-world robot dataset can significantly improve task-level planning performance.

The AI Hype Cycle Is Lying to You About What to Learn

Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Inside NVIDIA’s $2B Marvell Deal: What NVLink Fusion Means for AI Ethernet Fabrics

Dev.to

Automating Your Literature Review: From PDFs to Data with AI

Dev.to

Why event-driven agents reduce scope, cost, and decision dispersion

Dev.to

Target-Bench: Can Video World Models Achieve Mapless Path Planning with Semantic Targets?

Key Points

Abstract

Related Articles

The AI Hype Cycle Is Lying to You About What to Learn

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Inside NVIDIA’s $2B Marvell Deal: What NVLink Fusion Means for AI Ethernet Fabrics

Automating Your Literature Review: From PDFs to Data with AI

Why event-driven agents reduce scope, cost, and decision dispersion

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer