Internalized Reasoning for Long-Context Visual Document Understanding

arXiv cs.AI / 4/6/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes an end-to-end synthetic-data pipeline to add “reasoning” to visual long-document understanding by generating page-level question relevance, extracting textual evidence, and ordering it by relevance.
  • It trains models with supervised fine-tuning on generated reasoning traces inside <think> tags, controlled by a <cot> token, and then internalizes the reasoning behavior via low-strength model merging.
  • Experiments with Qwen3-VL 32B show improved performance on MMLongBenchDoc, reaching 58.3 and slightly outperforming a much larger Qwen3-VL 235B baseline (57.0).
  • Experiments with Mistral Small 3.1 24B indicate that synthetic reasoning training beats distillation from explicit “thinking” traces by 3.8 points and reduces average output tokens by 12.4× versus explicit reasoning.
  • The authors release the pipeline for reproducibility, enabling further research and extension of internalized-reasoning methods for long visual documents.

Abstract

Visual long-document understanding is critical for enterprise, legal, and scientific applications, yet the best performing open recipes have not explored reasoning, a capability which has driven leaps in math and code performance. We introduce a synthetic data pipeline for reasoning in long-document understanding that generates thinking traces by scoring each page for question relevance, extracting textual evidence and ordering it from most to least relevant. We apply SFT to the resulting traces within \texttt{} tags, gated by a \texttt{} control token, and the resulting reasoning capability is internalized via low-strength model merging. We study Qwen3 VL 32B and Mistral Small 3.1 24B. With Qwen3 VL, we achieve 58.3 on MMLongBenchDoc, surpassing the 7\times larger Qwen3 VL 235B A22B (57.0). With Mistral, we show that synthetic reasoning outperforms distillation from the Thinking version's traces by 3.8 points on MMLBD-C, and internalized reasoning exhibits 12.4\times fewer mean output tokens compared to explicit reasoning. We release our pipeline for reproducibility and further exploration.