V-Reflection: Transforming MLLMs from Passive Observers to Active Interrogators
arXiv cs.CV / 4/7/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that current Multimodal Large Language Models (MLLMs) often hallucinate in fine-grained perception tasks because they treat images as static context rather than actively revisiting visual evidence during reasoning.
- It introduces V-Reflection, a “think-then-look” framework that turns latent reasoning states into dynamic probes that interrogate the visual feature space for grounding at each reasoning step.
- V-Reflection uses a two-stage distillation approach: Box-Guided Compression (BCM) to learn stable, spatially grounded pixel-to-latent targets, and Dynamic Autoregressive Compression (DAC) to convert hidden states into dynamic probes over the global visual feature map.
- The method is reported to improve performance on six perception-intensive benchmarks by significantly narrowing the fine-grained perception gap, with visualizations showing latent reasoning localizes task-critical evidence.
- The approach keeps both distillation modules inactive during inference, aiming to preserve efficient, end-to-end autoregressive latent decoding.




