MM-Doc-R1: Training Agents for Long Document Visual Question Answering through Multi-turn Reinforcement Learning
arXiv cs.CL / 4/16/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces MM-Doc-R1, an agentic, vision-aware framework aimed at improving long-document visual question answering where conventional single-pass RAG struggles with complex multi-hop queries.
- It proposes Similarity-based Policy Optimization (SPO) to improve multi-turn reinforcement learning stability by using similarity-weighted trajectory rewards for better baseline estimation, addressing bias issues in prior methods like GRPO.
- The authors’ key technical claim is that more semantically similar trajectories yield more accurate shared baseline estimates, which SPO exploits to provide a more reliable learning signal across intermediate states.
- Experiments on the MMLongbench-Doc benchmark show MM-Doc-R1 achieves a 10.4% improvement over previous baselines, and SPO yields additional gains over GRPO (5.0% with Qwen3-8B and 6.1% with Qwen3-4B).
- Overall, the work argues that integrating an iterative information-discovery/synthesis agent workflow with a revised multi-turn RL training objective advances state-of-the-art performance for long-document VQA.
Related Articles

Black Hat Asia
AI Business

Introducing Claude Opus 4.7
Anthropic News

AI traffic to US retailers rose 393% in Q1, and it’s boosting their revenue too
TechCrunch

Who Audits the Auditors? Building an LLM-as-a-Judge for Agentic Reliability
Dev.to

"Enterprise AI Cost Optimization: How Companies Are Cutting AI Infrastructure Sp
Dev.to