World2VLM: Distilling World Model Imagination into VLMs for Dynamic Spatial Reasoning

arXiv cs.CV / 4/30/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

Vision-language models excel at static visual understanding but still struggle with dynamic spatial reasoning that depends on how scenes change under ego-centric motion.
The paper introduces World2VLM, a training framework that distills “spatial imagination” from a generative world model into a VLM using camera-trajectory-conditioned, view-consistent synthesized future observations.
It creates structured supervision for both forward spatial reasoning (action-to-outcome) and inverse spatial reasoning (outcome-to-action) by aligning synthesized views geometrically.
After post-training the VLM with a two-stage recipe on data generated by the pipeline, World2VLM improves results on several benchmarks, including SAT-Real, SAT-Synthesized, VSI-Bench, and MindCube.
The approach reportedly beats world-model-coupled inference-time methods while avoiding their heavy computation, positioning world models as training-time teachers rather than only inference-time tools.

Abstract

Vision-language models (VLMs) have shown strong performance on static visual understanding, yet they still struggle with dynamic spatial reasoning that requires imagining how scenes evolve under egocentric motion. Recent efforts address this limitation either by scaling spatial supervision with synthetic data or by coupling VLMs with world models at inference time. However, the former often lacks explicit modeling of motion-conditioned state transitions, while the latter incurs substantial computational overhead. In this work, we propose World2VLM, a training framework that distills spatial imagination from a generative world model into a vision-language model. Given an initial observation and a parameterized camera trajectory, we use a view-consistent world model to synthesize geometrically aligned future views and derive structured supervision for both forward (action-to-outcome) and inverse (outcome-to-action) spatial reasoning. We post-train the VLM with a two-stage recipe on a compact dataset generated by this pipeline and evaluate it on multiple spatial reasoning benchmarks. World2VLM delivers consistent improvements over the base model across diverse benchmarks, including SAT-Real, SAT-Synthesized, VSI-Bench, and MindCube. It also outperforms the test-time world-model-coupled methods while eliminating the need for expensive inference-time generation. Our results suggest that world models can serve not only as inference-time tools, but also as effective training-time teachers, enabling VLMs to internalize spatial imagination in a scalable and efficient manner.

Chinese firms face pressure on AI investments as US peers’ spending keeps soaring

SCMP Tech

Building a Local AI Agent (Part 2): Six UX and UI Design Challenges

Dev.to

The Prompt Caching Mistake That's Costing You 70% More Than You Need to Pay

Dev.to

We Built a DNS-Based Discovery Protocol for AI Agents — Here's How It Works

Dev.to

Your first business opportunity in 3 commands: /register_directory in @biznode_bot, wait for matches, then /my_pulse to view...

Dev.to

World2VLM: Distilling World Model Imagination into VLMs for Dynamic Spatial Reasoning

Key Points

Abstract

Related Articles

Chinese firms face pressure on AI investments as US peers’ spending keeps soaring

Building a Local AI Agent (Part 2): Six UX and UI Design Challenges

The Prompt Caching Mistake That's Costing You 70% More Than You Need to Pay

We Built a DNS-Based Discovery Protocol for AI Agents — Here's How It Works

Your first business opportunity in 3 commands: /register_directory in @biznode_bot, wait for matches, then /my_pulse to view...

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer