A Survey of OCR Evaluation Methods and Metrics and the Invisibility of Historical Documents

arXiv cs.CV / 3/30/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper surveys how OCR and document-understanding systems are evaluated (2006–2025) and finds evaluations skew toward modern, Western, institutional documents rather than historical or marginalized archives.
  • It reports that Black historical newspapers and similar community-produced documents are rarely included in reported training data or benchmark datasets, leading to a blind spot in what systems are tested on.
  • The review shows many evaluations focus on character accuracy and surface task success, while often missing structural failure modes common in historical material (e.g., column collapse, typographic errors, and hallucinated text).
  • Using archival/empirical context, the study argues that these evaluation gaps contribute to “structural invisibility” and representational harm, driven by organizational and institutional behaviors, benchmark incentives, and data governance choices.
  • The authors propose that benchmark and governance design should better reflect historical document complexity to prevent systematic misrepresentation by vision transformer and multimodal OCR systems.

Abstract

Optical character recognition (OCR) and document understanding systems increasingly rely on large vision and vision-language models, yet evaluation remains centered on modern, Western, and institutional documents. This emphasis masks system behavior in historical and marginalized archives, where layout, typography, and material degradation shape interpretation. This study examines how OCR and document understanding systems are evaluated, with particular attention to Black historical newspapers. We review OCR and document understanding papers, as well as benchmark datasets, which are published between 2006 and 2025 using the PRISMA framework. We look into how the studies report training data, benchmark design, and evaluation metrics for vision transformer and multimodal OCR systems. During the review, we found that Black newspapers and other community-produced historical documents rarely appear in reported training data or evaluation benchmarks. Most evaluations emphasize character accuracy and task success on modern layouts. They rarely capture structural failures common in historical newspapers, including column collapse, typographic errors, and hallucinated text. To put these findings into perspective, we use previous empirical studies and archival statistics from significant Black press collections to show how evaluation gaps lead to structural invisibility and representational harm. We propose that these gaps occur due to organizational (meso) and institutional (macro) behaviors and structure, shaped by benchmark incentives and data governance decisions.