Seeing with You: Perception-Reasoning Coevolution for Multimodal Reasoning

arXiv cs.AI / 3/31/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that existing RLVR methods for multimodal LLMs often use a single final-answer reward, causing credit-assignment issues that improve reasoning without reliably improving visual evidence extraction.
It introduces PRCO (Perception-Reasoning Coevolution), a dual-role RLVR framework with a shared policy where an Observer produces question-specific evidence captions and a Solver uses them to predict the final answer.
PRCO uses role-specific rewards: the Solver gets verifiable outcome rewards from the final answer, while the Observer gets utility rewards based on how well the Solver succeeds downstream.
Experiments on eight multimodal reasoning benchmarks show PRCO improves average accuracy by more than 7 points across model scales versus the base model.
The approach outperforms prior open-source RL-tuned baselines, suggesting a more reliable way to co-train perception and reasoning for multimodal tasks.

Abstract

Reinforcement learning with verifiable rewards (RLVR) has substantially enhanced the reasoning capabilities of multimodal large language models (MLLMs). However, existing RLVR approaches typically rely on outcome-driven optimization that updates both perception and reasoning using a shared reward based solely on the final answer. This shared reward blurs credit assignment, frequently improving reasoning patterns while failing to reliably enhance the accuracy of upstream visual evidence extraction. To address this perception bottleneck, we introduce PRCO (Perception-Reasoning Coevolution), a dual-role RLVR framework with a shared policy. PRCO consists of two cooperative roles: an Observer that generates an evidence caption tailored to the question and a Solver that predicts the final answer based on this caption. Crucially, PRCO employs role-specific reward signals: the Solver is optimized using verifiable outcome rewards on the final answer, while the Observer receives a utility reward derived from the Solver's downstream success. Extensive experiments across eight challenging multimodal reasoning benchmarks demonstrate that PRCO yields consistent improvements across model scales by over 7 points on average accuracy compared to the base model, outperforming prior open-source RL-tuned baselines.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 3/31DailyView insight →

Black Hat Asia

AI Business

[D] How does distributed proof of work computing handle the coordination needs of neural network training?

Reddit r/MachineLearning

Claude Code's Entire Source Code Was Just Leaked via npm Source Maps — Here's What's Inside

Dev.to

BYOK is not just a pricing model: why it changes AI product trust

Dev.to

AI Citation Registries and Identity Persistence Across Records

Dev.to

Seeing with You: Perception-Reasoning Coevolution for Multimodal Reasoning

Key Points

Abstract

💡 Insights using this article

Related Articles

Black Hat Asia

[D] How does distributed proof of work computing handle the coordination needs of neural network training?

Claude Code's Entire Source Code Was Just Leaked via npm Source Maps — Here's What's Inside

BYOK is not just a pricing model: why it changes AI product trust

AI Citation Registries and Identity Persistence Across Records

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer