MIRL: Mutual Information-Guided Reinforcement Learning for Vision-Language Models

arXiv cs.CV / 5/5/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

Vision-language models often make visual perception mistakes and hallucinate, which reduces answer accuracy in complex reasoning tasks.
Existing RLVR approaches are limited because they waste sampling on trajectories likely to fail early and because sparse rewards cannot tell whether errors come from visual perception or reasoning.
The proposed MIRL framework uses mutual information between generated descriptions and visual inputs as a low-cost pre-screening signal to allocate the sampling budget more effectively.
MIRL also uses decoupled training to provide separate MI-based rewards for visual perception optimization, mitigating “reward blindness” from sparse correctness signals.
On six vision-language reasoning benchmarks, MIRL reaches 70.22% average accuracy and outperforms a baseline that samples 16 full trajectories by using only 10 pre-samples with top-6 selection (25% fewer complete trajectories).

Abstract

Vision-Language Models (VLMs) frequently suffer from visual perception errors and hallucinations that compromise answer accuracy in complex reasoning tasks. Reinforcement Learning with Verifiable Rewards (RLVR) offers a promising solution by optimizing policies using answer correctness signals. Despite their effectiveness, prevailing RLVR methods face two critical limitations. First, much of the sampling budget is wasted on trajectories doomed to fail due to early visual description errors. Second, sparse rewards cannot distinguish whether failures stem from visual perception or reasoning stages. We introduce MIRL, a decoupled framework that addresses both limitations by leveraging mutual information (MI) between generated descriptions and visual inputs as a cheap pre-screening signal. This enables intelligent budget allocation toward high-potential trajectories via forking, while decoupled training provides independent MI-based rewards for visual perception optimization, resolving reward blindness. Experiments on six vision-language reasoning benchmarks demonstrate that MIRL achieves 70.22% average accuracy and successfully surpasses the performance of sampling 16 complete trajectories using only 10 pre-samples with top-6 selection (25% fewer complete trajectories). Our code is available at: https://anonymous.4open.science/r/mirl-main/.