PP-OCRv5: A Specialized 5M-Parameter Model Rivaling Billion-Parameter Vision-Language Models on OCR Tasks

arXiv cs.CV / 3/26/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper presents PP-OCRv5, a specialized OCR model with only about 5M parameters that competes with many billion-parameter vision-language models on common OCR benchmarks.
  • It argues that accuracy is not driven only by architectural scaling, showing improved localization precision and fewer text hallucinations relative to large, unified VLM-style approaches.
  • The authors attribute performance gains largely to data-centric optimization, systematically analyzing the impact of training data difficulty, accuracy, and diversity.
  • Experiments suggest that sufficiently large volumes of high-quality, well-labeled, and diverse data can raise the achievable ceiling of efficient two-stage OCR pipelines beyond typical assumptions.
  • Code and models are released publicly via PaddlePaddle’s PaddleOCR repository, aiming to enable practical adoption and data-curation guidance for OCR systems.

Abstract

The advent of "OCR 2.0" and large-scale vision-language models (VLMs) has set new benchmarks in text recognition. However, these unified architectures often come with significant computational demands, challenges in precise text localization within complex layouts, and a propensity for textual hallucinations. Revisiting the prevailing notion that model scale is the sole path to high accuracy, this paper introduces PP-OCRv5, a meticulously optimized, lightweight OCR system with merely 5 million parameters. We demonstrate that PP-OCRv5 achieves performance competitive with many billion-parameter VLMs on standard OCR benchmarks, while offering superior localization precision and reduced hallucinations. The cornerstone of our success lies not in architectural expansion but in a data-centric investigation. We systematically dissect the role of training data by quantifying three critical dimensions: data difficulty, data accuracy, and data diversity. Our extensive experiments reveal that with a sufficient volume of high-quality, accurately labeled, and diverse data, the performance ceiling for traditional, efficient two-stage OCR pipelines is far higher than commonly assumed. This work provides compelling evidence for the viability of lightweight, specialized models in the large-model era and offers practical insights into data curation for OCR. The source code and models are publicly available at https://github.com/PaddlePaddle/PaddleOCR.