Decompose, Look, and Reason: Reinforced Latent Reasoning for VLMs
arXiv cs.CL / 4/10/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper addresses a key limitation of vision-language models (VLMs): they often fail at complex multi-step visual reasoning because information can be lost when intermediate reasoning is represented in text-based chain-of-thought (CoT).
- It proposes “Decompose, Look, and Reason (DLR),” a reinforced latent reasoning framework that decomposes a query into textual premises, extracts premise-conditioned continuous visual latents, and generates answers using grounded rationales.
- DLR includes a three-stage training pipeline and introduces a “Spherical Gaussian Latent Policy” designed to improve exploration quality in the latent space during reinforcement-style training.
- Experiments on vision-focused benchmarks reportedly show consistent gains over multiple strong baselines, including text-only methods, interleaved multimodal CoT, and prior latent reasoning approaches, along with improved step-by-step interpretability.



