Token Warping Helps MLLMs Look from Nearby Viewpoints

arXiv cs.CV / 4/6/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper investigates whether warping image tokens (not pixels) can help multimodal large language models (MLLMs) understand scenes from nearby viewpoints, addressing the fragility of pixel-wise warping to depth errors.
  • It compares forward vs. backward token warping and finds backward token warping is more stable and better preserves semantic coherence during viewpoint shifts.
  • The authors introduce and evaluate on the ViewBench benchmark, showing that token-level warping enables more reliable viewpoint reasoning.
  • Results indicate token-level warping outperforms multiple baselines, including pixel-wise warping, spatially fine-tuned MLLMs, and a generative warping method.

Abstract

Can warping tokens, rather than pixels, help multimodal large language models (MLLMs) understand how a scene appears from a nearby viewpoint? While MLLMs perform well on visual reasoning, they remain fragile to viewpoint changes, as pixel-wise warping is highly sensitive to small depth errors and often introduces geometric distortions. Drawing on theories of mental imagery that posit part-level structural representations as the basis for human perspective transformation, we examine whether image tokens in ViT-based MLLMs serve as an effective substrate for viewpoint changes. We compare forward and backward warping, finding that backward token warping, which defines a dense grid on the target view and retrieves a corresponding source-view token for each grid point, achieves greater stability and better preserves semantic coherence under viewpoint shifts. Experiments on our proposed ViewBench benchmark demonstrate that token-level warping enables MLLMs to reason reliably from nearby viewpoints, consistently outperforming all baselines including pixel-wise warping approaches, spatially fine-tuned MLLMs, and a generative warping method.