Benchmarking Complex Multimodal Document Processing Pipelines: A Unified Evaluation Framework for Enterprise AI

arXiv cs.AI / 4/30/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that enterprise document AI is typically a multi-stage pipeline (parse → index → retrieve → generate) and that evaluating these systems end-to-end remains difficult compared with evaluating each stage in isolation.
It introduces EnterpriseDocBench, a unified evaluation framework and corpus spanning six enterprise domains, assessing parsing fidelity, indexing efficiency, retrieval relevance, and generation groundedness using the same GPT-5 generator across three retrieval pipelines (BM25, dense, hybrid).
The results show hybrid retrieval slightly outperforms BM25 (nDCG@5: 0.92 vs. 0.91) and both outperform dense embedding (0.83), while hallucination does not increase monotonically with document length.
Cross-stage correlations are very weak (e.g., parsing→retrieval r=0.14, parsing→generation r=0.17, retrieval→generation r=0.02), challenging assumptions that quality cascades strongly across pipeline stages.
The authors find factual accuracy on stated claims is relatively high (85.5%) but answer completeness is low on average (0.40), suggesting that omissions may be a more deployment-critical weakness than headline accuracy.

Abstract

Most enterprise document AI today is a pipeline. Parse, index, retrieve, generate. Each of those stages has been studied to death on its own -- what's still hard is evaluating the system as a whole. We built EnterpriseDocBench to take a swing at it: parsing fidelity, indexing efficiency, retrieval relevance, and generation groundedness, all on the same corpus. The corpus is built from public, permissively licensed documents across six enterprise domains (five represented in the current pilot). We ran three pipelines through it -- BM25, dense embedding, and a hybrid -- all with the same GPT-5 generator. The headline numbers: hybrid retrieval narrowly beats BM25 (nDCG@5 of 0.92 vs. 0.91), and both beat dense embedding (0.83). Hallucination doesn't grow monotonically with document length -- short documents and very long ones both hallucinate more than medium ones (28.1% and 23.8% vs. 9.2%). Cross-stage correlations are very weak: parsing->retrieval r=0.14, parsing->generation r=0.17, retrieval->generation 0.02. If quality were cascading the way most of us assume, those numbers would be much higher; they aren't. Design caveats are real (parsing fixed, generator shared, automated proxy metrics) and we don't oversell the result. One result that genuinely surprised us: factual accuracy on stated claims is 85.5%, but answer completeness averages 0.40. The system is right when it answers -- it just leaves things out. That gap matters more for real deployments than the headline accuracy number does. We also describe three reference architectures (ColPali, ColQwen2, agentic complexity-based routing) which are not yet integrated end-to-end. Framework, metrics, baselines, and collection scripts will be released open-source on acceptance.

Chinese firms face pressure on AI investments as US peers’ spending keeps soaring

SCMP Tech

Building a Local AI Agent (Part 2): Six UX and UI Design Challenges

Dev.to

The Prompt Caching Mistake That's Costing You 70% More Than You Need to Pay

Dev.to

We Built a DNS-Based Discovery Protocol for AI Agents — Here's How It Works

Dev.to

Your first business opportunity in 3 commands: /register_directory in @biznode_bot, wait for matches, then /my_pulse to view...

Dev.to

Benchmarking Complex Multimodal Document Processing Pipelines: A Unified Evaluation Framework for Enterprise AI

Key Points

Abstract

Related Articles

Chinese firms face pressure on AI investments as US peers’ spending keeps soaring

Building a Local AI Agent (Part 2): Six UX and UI Design Challenges

The Prompt Caching Mistake That's Costing You 70% More Than You Need to Pay

We Built a DNS-Based Discovery Protocol for AI Agents — Here's How It Works

Your first business opportunity in 3 commands: /register_directory in @biznode_bot, wait for matches, then /my_pulse to view...

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer