All Roads Lead to Rome: Incentivizing Divergent Thinking in Vision-Language Models

arXiv cs.CV / 4/2/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper investigates why reinforcement learning methods like GRPO can improve Vision-Language Models’ reasoning, focusing on how their reasoning behavior differs from base (non-RL) models.
It finds a behavioral and training-dynamics distinction: RL tends toward deeper but narrower reasoning, while base models produce broader and more diverse reasoning patterns.
The authors identify a key limitation of GRPO—diversity collapse—where the model converges too early on a small set of reasoning strategies, getting stuck in local optima and reducing scalability.
To mitigate this, the paper proposes Multi-Group Policy Optimization (MUPO), which incentivizes divergent thinking across multiple solution paths.
MUPO is evaluated on established benchmarks and shown to improve effectiveness by preventing premature convergence and preserving reasoning diversity.

Abstract

Recent studies have demonstrated that Reinforcement Learning (RL), notably Group Relative Policy Optimization (GRPO), can intrinsically elicit and enhance the reasoning capabilities of Vision-Language Models (VLMs). However, despite the promise, the underlying mechanisms that drive the effectiveness of RL models as well as their limitations remain underexplored. In this paper, we highlight a fundamental behavioral distinction between RL and base models, where the former engages in deeper yet narrow reasoning, while base models, despite less refined along individual path, exhibit broader and more diverse thinking patterns. Through further analysis of training dynamics, we show that GRPO is prone to diversity collapse, causing models to prematurely converge to a limited subset of reasoning strategies while discarding the majority of potential alternatives, leading to local optima and poor scalability. To address this, we propose Multi-Group Policy Optimization (MUPO), a simple yet effective approach designed to incentivize divergent thinking across multiple solutions, and demonstrate its effectiveness on established benchmarks. Project page: https://xytian1008.github.io/MUPO/

v5.5.0

Transformers（HuggingFace）Releases

Bonsai (PrismML's 1 bit version of Qwen3 8B 4B 1.7B) was not an aprils fools joke

Reddit r/LocalLLaMA

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Inference Engines - A visual deep dive into the layers of an LLM

Dev.to

Surprised by how capable Qwen3.5 9B is in agentic flows (CodeMode)

Reddit r/LocalLLaMA

All Roads Lead to Rome: Incentivizing Divergent Thinking in Vision-Language Models

Key Points

Abstract

Related Articles

v5.5.0

Bonsai (PrismML's 1 bit version of Qwen3 8B 4B 1.7B) was not an aprils fools joke

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Inference Engines - A visual deep dive into the layers of an LLM

Surprised by how capable Qwen3.5 9B is in agentic flows (CodeMode)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer