World Reasoning Arena

arXiv cs.CV / 3/30/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces WR-Arena, a new benchmark designed to evaluate world models on next world simulation beyond conventional next-state prediction and visual fidelity.
  • WR-Arena assesses three capabilities: action simulation fidelity for multi-step instruction following and counterfactual rollouts, long-horizon forecasting for extended physically plausible simulation, and simulative reasoning/planning for goal-directed comparison of alternative futures.
  • It provides a task taxonomy and curated datasets that move evaluation beyond single-turn and purely perceptual tests toward more interactive, open-ended scenarios.
  • Experiments with state-of-the-art world models reveal a substantial performance gap relative to human-level hypothetical reasoning, positioning WR-Arena as both a diagnostic and development guideline.
  • The project releases code publicly via GitHub to support reproducible evaluation and future research progress.

Abstract

World models (WMs) are intended to serve as internal simulators of the real world that enable agents to understand, anticipate, and act upon complex environments. Existing WM benchmarks remain narrowly focused on next-state prediction and visual fidelity, overlooking the richer simulation capabilities required for intelligent behavior. To address this gap, we introduce WR-Arena, a comprehensive benchmark for evaluating WMs along three fundamental dimensions of next world simulation: (i) Action Simulation Fidelity, the ability to interpret and follow semantically meaningful, multi-step instructions and generate diverse counterfactual rollouts; (ii) Long-horizon Forecast, the ability to sustain accurate, coherent, and physically plausible simulations across extended interactions; and (iii) Simulative Reasoning and Planning, the ability to support goal-directed reasoning by simulating, comparing, and selecting among alternative futures in both structured and open-ended environments. We build a task taxonomy and curate diverse datasets designed to probe these capabilities, moving beyond single-turn and perceptual evaluations. Through extensive experiments with state-of-the-art WMs, our results expose a substantial gap between current models and human-level hypothetical reasoning, and establish WR-Arena as both a diagnostic tool and a guideline for advancing next-generation world models capable of robust understanding, forecasting, and purposeful action. The code is available at https://github.com/MBZUAI-IFM/WR-Arena.