V-Reflection: Transforming MLLMs from Passive Observers to Active Interrogators

arXiv cs.CV / 4/7/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that current Multimodal Large Language Models (MLLMs) often hallucinate in fine-grained perception tasks because they treat images as static context rather than actively revisiting visual evidence during reasoning.
It introduces V-Reflection, a “think-then-look” framework that turns latent reasoning states into dynamic probes that interrogate the visual feature space for grounding at each reasoning step.
V-Reflection uses a two-stage distillation approach: Box-Guided Compression (BCM) to learn stable, spatially grounded pixel-to-latent targets, and Dynamic Autoregressive Compression (DAC) to convert hidden states into dynamic probes over the global visual feature map.
The method is reported to improve performance on six perception-intensive benchmarks by significantly narrowing the fine-grained perception gap, with visualizations showing latent reasoning localizes task-critical evidence.
The approach keeps both distillation modules inactive during inference, aiming to preserve efficient, end-to-end autoregressive latent decoding.

Abstract

Multimodal Large Language Models (MLLMs) have achieved remarkable success, yet they remain prone to perception-related hallucinations in fine-grained tasks. This vulnerability arises from a fundamental limitation: their reasoning is largely restricted to the language domain, treating visual input as a static, reasoning-agnostic preamble rather than a dynamic participant. Consequently, current models act as passive observers, unable to re-examine visual details to ground their evolving reasoning states. To overcome this, we propose V-Reflection, a framework that transforms the MLLM into an active interrogator through a "think-then-look" visual reflection mechanism. During reasoning, latent states function as dynamic probes that actively interrogate the visual feature space, grounding each reasoning step for task-critical evidence. Our approach employs a two-stage distillation strategy. First, the Box-Guided Compression (BCM) module establishes stable pixel-to-latent targets through explicit spatial grounding. Next, a Dynamic Autoregressive Compression (DAC) module maps the model's hidden states into dynamic probes that interrogate the global visual feature map. By distilling the spatial expertise of the BCM teacher into the DAC student, V-Reflection internalizes the ability to localize task-critical evidence. During inference, both modules remain entirely inactive, maintaining a purely end-to-end autoregressive decoding in the latent space with optimal efficiency. Extensive experiments demonstrate the effectiveness of our V-Reflection across six perception-intensive benchmarks, significantly narrowing the fine-grained perception gap. Visualizations confirm that latent reasoning autonomously localizes task-critical visual evidence.