MM-Doc-R1: Training Agents for Long Document Visual Question Answering through Multi-turn Reinforcement Learning

arXiv cs.CL / 4/16/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces MM-Doc-R1, an agentic, vision-aware framework aimed at improving long-document visual question answering where conventional single-pass RAG struggles with complex multi-hop queries.
  • It proposes Similarity-based Policy Optimization (SPO) to improve multi-turn reinforcement learning stability by using similarity-weighted trajectory rewards for better baseline estimation, addressing bias issues in prior methods like GRPO.
  • The authors’ key technical claim is that more semantically similar trajectories yield more accurate shared baseline estimates, which SPO exploits to provide a more reliable learning signal across intermediate states.
  • Experiments on the MMLongbench-Doc benchmark show MM-Doc-R1 achieves a 10.4% improvement over previous baselines, and SPO yields additional gains over GRPO (5.0% with Qwen3-8B and 6.1% with Qwen3-4B).
  • Overall, the work argues that integrating an iterative information-discovery/synthesis agent workflow with a revised multi-turn RL training objective advances state-of-the-art performance for long-document VQA.

Abstract

Conventional Retrieval-Augmented Generation (RAG) systems often struggle with complex multi-hop queries over long documents due to their single-pass retrieval. We introduce MM-Doc-R1, a novel framework that employs an agentic, vision-aware workflow to address long document visual question answering through iterative information discovery and synthesis. To incentivize the information seeking capabilities of our agents, we propose Similarity-based Policy Optimization (SPO), addressing baseline estimation bias in existing multi-turn reinforcement learning (RL) algorithms like GRPO. Our core insight is that in multi-turn RL, the more semantically similar two trajectories are, the more accurate their shared baseline estimation becomes. Leveraging this, SPO calculates a more precise baseline by similarity-weighted averaging of rewards across multiple trajectories, unlike GRPO which inappropriately applies the initial state's baseline to all intermediate states. This provides a more stable and accurate learning signal for our agents, leading to superior training performance that surpasses GRPO. Our experiments on the MMLongbench-Doc benchmark show that MM-Doc-R1 outperforms previous baselines by 10.4%. Furthermore, SPO demonstrates superior performance over GRPO, boosting results by 5.0% with Qwen3-8B and 6.1% with Qwen3-4B. These results highlight the effectiveness of our integrated framework and novel training algorithm in advancing the state-of-the-art for complex, long-document visual question answering.