MM-Doc-R1: Training Agents for Long Document Visual Question Answering through Multi-turn Reinforcement Learning

arXiv cs.CL / 4/16/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces MM-Doc-R1, an agentic, vision-aware framework aimed at improving long-document visual question answering where conventional single-pass RAG struggles with complex multi-hop queries.
It proposes Similarity-based Policy Optimization (SPO) to improve multi-turn reinforcement learning stability by using similarity-weighted trajectory rewards for better baseline estimation, addressing bias issues in prior methods like GRPO.
The authors’ key technical claim is that more semantically similar trajectories yield more accurate shared baseline estimates, which SPO exploits to provide a more reliable learning signal across intermediate states.
Experiments on the MMLongbench-Doc benchmark show MM-Doc-R1 achieves a 10.4% improvement over previous baselines, and SPO yields additional gains over GRPO (5.0% with Qwen3-8B and 6.1% with Qwen3-4B).
Overall, the work argues that integrating an iterative information-discovery/synthesis agent workflow with a revised multi-turn RL training objective advances state-of-the-art performance for long-document VQA.

Abstract

Conventional Retrieval-Augmented Generation (RAG) systems often struggle with complex multi-hop queries over long documents due to their single-pass retrieval. We introduce MM-Doc-R1, a novel framework that employs an agentic, vision-aware workflow to address long document visual question answering through iterative information discovery and synthesis. To incentivize the information seeking capabilities of our agents, we propose Similarity-based Policy Optimization (SPO), addressing baseline estimation bias in existing multi-turn reinforcement learning (RL) algorithms like GRPO. Our core insight is that in multi-turn RL, the more semantically similar two trajectories are, the more accurate their shared baseline estimation becomes. Leveraging this, SPO calculates a more precise baseline by similarity-weighted averaging of rewards across multiple trajectories, unlike GRPO which inappropriately applies the initial state's baseline to all intermediate states. This provides a more stable and accurate learning signal for our agents, leading to superior training performance that surpasses GRPO. Our experiments on the MMLongbench-Doc benchmark show that MM-Doc-R1 outperforms previous baselines by 10.4%. Furthermore, SPO demonstrates superior performance over GRPO, boosting results by 5.0% with Qwen3-8B and 6.1% with Qwen3-4B. These results highlight the effectiveness of our integrated framework and novel training algorithm in advancing the state-of-the-art for complex, long-document visual question answering.