Parser-Oriented Structural Refinement for a Stable Layout Interface in Document Parsing

arXiv cs.CV / 4/6/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • 論文は、明示的なDocument Layout Analysis(DLA)パイプラインで、検出器出力を保持・シリアライズしたレイアウトインスタンス集合が、後段パーサの入力順と不整合になる問題を指摘しています。
  • Denseで重なりや境界が曖昧なページでは、保持されたレイアウト仮説が不安定になり、後段の文書解析に致命的な誤り(特に順序・インデックスの不一致)を引き起こし得ると述べています。
  • DETRスタイルの検出器とパーサの間に軽量なstructural refinement段階を挿入し、クエリ特徴・セマンティック手がかり・ボックス幾何・視覚証拠に基づく集合レベル推論で、インスタンス保持、位置の微修正、パーサ入力順の予測を共同で行う手法を提案しています。
  • retention志向の教師あり信号と難易度に応じたordering目的関数により、複雑なページほど保持集合と最終パーサ入力の整合を改善します。
  • 公開ベンチマークでページ単位のレイアウト品質が一貫して向上し、標準的なエンドツーエンド統合でもシーケンス不一致を大きく抑え、OmniDocBenchでReading Order Editを0.024に改善したと報告しています。

Abstract

Accurate document parsing requires both robust content recognition and a stable parser interface. In explicit Document Layout Analysis (DLA) pipelines, downstream parsers do not consume the full detector output. Instead, they operate on a retained and serialized set of layout instances. However, on dense pages with overlapping regions and ambiguous boundaries, unstable layout hypotheses can make the retained instance set inconsistent with its parser input order, leading to severe downstream parsing errors. To address this issue, we introduce a lightweight structural refinement stage between a DETR-style detector and the parser to stabilize the parser interface. Treating raw detector outputs as a compact hypothesis pool, the proposed module performs set-level reasoning over query features, semantic cues, box geometry, and visual evidence. From a shared refined structural state, it jointly determines instance retention, refines box localization, and predicts parser input order before handoff. We further introduce retention-oriented supervision and a difficulty-aware ordering objective to better align the retained instance set and its order with the final parser input, especially on structurally complex pages. Extensive experiments on public benchmarks show that our method consistently improves page-level layout quality. When integrated into a standard end-to-end parsing pipeline, the stabilized parser interface also substantially reduces sequence mismatch, achieving a Reading Order Edit of 0.024 on OmniDocBench.