A Benchmark of State-Space Models vs. Transformers and BiLSTM-based Models for Historical Newspaper OCR

arXiv cs.CV / 4/2/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper benchmarks linear-time State-Space Models (SSMs) using Mamba-based OCR architectures against Transformer- and BiLSTM-based recognizers for historical newspaper transcription, addressing long-sequence and degraded-layout challenges.
  • It introduces (to the authors’ knowledge) the first OCR architecture based on SSMs, pairing a CNN visual encoder with bidirectional and autoregressive Mamba sequence modeling and evaluating multiple decoding strategies (CTC, autoregressive, non-autoregressive).
  • Experiments on newly released >99% verified gold-standard Luxembourg newspaper data and cross-dataset tests on Fraktur/Antiqua show all neural systems reach ~2% CER, so computational efficiency becomes the key differentiator.
  • Mamba-based models remain competitive in accuracy while cutting inference time roughly in half and improving memory scaling, with paragraph-level results under severe degradation showing strong performance (6.07% CER vs 5.24% for DAN).
  • The authors release code, trained models, and standardized evaluation protocols to support reproducible, large-scale cultural heritage OCR development.

Abstract

End-to-end OCR for historical newspapers remains challenging, as models must handle long text sequences, degraded print quality, and complex layouts. While Transformer-based recognizers dominate current research, their quadratic complexity limits efficient paragraph-level transcription and large-scale deployment. We investigate linear-time State-Space Models (SSMs), specifically Mamba, as a scalable alternative to Transformer-based sequence modeling for OCR. We present to our knowledge, the first OCR architecture based on SSMs, combining a CNN visual encoder with bi-directional and autoregressive Mamba sequence modeling, and conduct a large-scale benchmark comparing SSMs with Transformer- and BiLSTM-based recognizers. Multiple decoding strategies (CTC, autoregressive, and non-autoregressive) are evaluated under identical training conditions alongside strong neural baselines (VAN, DAN, DANIEL) and widely used off-the-shelf OCR engines (PERO-OCR, Tesseract OCR, TrOCR, Gemini). Experiments on historical newspapers from the Biblioth\`eque nationale du Luxembourg, with newly released >99% verified gold-standard annotations, and cross-dataset tests on Fraktur and Antiqua lines, show that all neural models achieve low error rates (~2% CER), making computational efficiency the main differentiator. Mamba-based models maintain competitive accuracy while halving inference time and exhibiting superior memory scaling (1.26x vs 2.30x growth at 1000 chars), reaching 6.07% CER at the severely degraded paragraph level compared to 5.24% for DAN, while remaining 2.05x faster. We release code, trained models, and standardized evaluation protocols to enable reproducible research and guide practitioners in large-scale cultural heritage OCR.