Hybrid Latent Reasoning with Decoupled Policy Optimization

arXiv cs.CV / 4/23/2026

📰 NewsModels & Research

共有:

Key Points

The paper argues that applying chain-of-thought (CoT) reasoning to vision can cause “early semantic collapse” due to discretizing visual signals into LLM token inputs.
It introduces HyLaR (Hybrid Latent Reasoning), which alternates discrete text generation with continuous visual latent representations to retain fine-grained visual details.
After an initial supervised fine-tuning (SFT) cold start, the work proposes DePO (Decoupled Policy Optimization) to perform reinforcement learning in the hybrid discrete-continuous action space.
DePO improves RL stability by decomposing the policy-gradient objective and applying separate trust-region constraints to text and latent components, plus an exact closed-form von Mises-Fisher (vMF) KL regularizer.
Experiments reportedly show HyLaR outperforms standard MLLMs and existing latent-reasoning methods on fine-grained perception and general multimodal understanding benchmarks, with code released on GitHub.

Abstract

Chain-of-Thought (CoT) reasoning significantly elevates the complex problem-solving capabilities of multimodal large language models (MLLMs). However, adapting CoT to vision typically discretizes signals to fit LLM inputs, causing early semantic collapse and discarding fine-grained details. While external tools can mitigate this, they introduce a rigid bottleneck, confining reasoning to predefined operations. Although recent latent reasoning paradigms internalize visual states to overcome these limitations, optimizing the resulting hybrid discrete-continuous action space remains challenging. In this work, we propose HyLaR (Hybrid Latent Reasoning), a framework that seamlessly interleaves discrete text generation with continuous visual latent representations. Specifically, following an initial cold-start supervised fine-tuning (SFT), we introduce DePO (Decoupled Policy Optimization) to enable effective reinforcement learning within this hybrid space. DePO decomposes the policy gradient objective, applying independent trust-region constraints to the textual and latent components, alongside an exact closed-form von Mises-Fisher (vMF) KL regularizer. Extensive experiments demonstrate that HyLaR outperforms standard MLLMs and state-of-the-art latent reasoning approaches across fine-grained perception and general multimodal understanding benchmarks. Code is available at https://github.com/EthenCheng/HyLaR.