When Good OCR Is Not Enough: Benchmarking OCR Robustness for Retrieval-Augmented Generation

arXiv cs.CV / 5/5/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that character-level OCR benchmarks (CER/WER) are insufficient for predicting real-world retrieval-augmented generation (RAG) performance in industrial settings.
  • It introduces InduOCRBench, an OCR benchmark tailored for industrial RAG, covering 11 difficult document types such as extreme layouts, historical reading orders, watermarked/complex backgrounds, decorated text, and pages with tables and math.
  • Experiments with recent state-of-the-art OCR models in a controlled OCR-first RAG pipeline show substantial downstream performance drops on realistic documents even when conventional OCR scores are strong.
  • The authors find that low retrieval failures can persist despite high OCR accuracy, because structural and semantic OCR errors can break retrieval and downstream generation, and this mismatch varies by document category.
  • The benchmark is released publicly on GitHub to support more RAG-relevant evaluation of OCR robustness.

Abstract

Industrial Retrieval-Augmented Generation (RAG) systems depend on optical character recognition (OCR) to transform visual documents into text. Existing OCR benchmarks rely on character-level metrics, which inadequately measure downstream RAG effectiveness under real-world conditions. We introduce an OCR benchmark for industrial RAG systems covering 11 challenging document types, including extreme layouts, high-resolution pages, complex or watermarked backgrounds, historical documents with non-standard reading orders, visually decorated text, and documents containing tables and mathematical formulas. Evaluating recent SOTA OCR models under a controlled OCR-first RAG pipeline shows clear performance degradation on realistic industrial documents despite strong conventional benchmark scores. We find that high OCR accuracy does not necessarily translate into strong downstream RAG performance: structural and semantic errors can cause substantial retrieval failures even when WER/CER remains low. Further analysis shows that this mismatch is category-dependent, arises through both retrieval-side and downstream generation-side failures, and remains stable across representative OCR-first pipeline choices. The benchmark is publicly available at https://github.com/Qihoo360/InduOCRBench.