DocPrune:Efficient Document Question Answering via Background, Question, and Comprehension-aware Token Pruning

arXiv cs.CV / 4/27/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • DocPrune(アーキテクチャ学習不要・逐次型の文書トークン枝刈り)を提案し、長文書のドキュメントQAを効率化する方針を示しています。
  • 文書画像特有の構造的スパース性(大きな背景に対して根拠は点在)を活かし、背景や質問に無関係なトークンなど不要トークンを除去します。
  • モデルの理解度に応じて、どの層から枝刈りを開始するかを自動で選択することで、性能劣化を抑えつつ最適な削減を実現します。
  • M3DocRAGでの実験では、エンコーダでスループット3.0倍、デコーダで3.3倍の向上に加え、F1スコアを+1.0改善し、追加学習なしで精度と効率の両立を達成したと報告しています。

Abstract

Recent advances in vision-language models have demonstrated remarkable performance across diverse multi-modal tasks, including document question answering that leverages structured visual cues from text, tables, and figures. However, unlike natural images, document images contain large backgrounds and only sparse supporting evidence, leading to the inefficient consumption of substantial computational resources, especially for long documents. We observe that existing token-reduction methods for natural images and videos fall short in utilizing the structural sparsity unique to documents. To address this, we propose DocPrune, a training-free and progressive document token pruning framework designed for efficient long-document understanding. The proposed method preserves only the essential tokens for the task while removing unnecessary ones, such as background or question-irrelevant tokens. Moreover, it automatically selects the appropriate layers to initiate token pruning based on the model's level of comprehension. Our experiments on the M3DocRAG show that DocPrune improves throughput by 3.0x and 3.3x in the encoder and decoder, respectively, while boosting the F1 score by +1.0, achieving both higher accuracy and efficiency without any additional training.