I recently had to process ~940,000 PDFs. I started with the standard OCR tools, but the bottlenecking was frustrating. Even on an RTX 5090, I was seeing low speed.
The Problem:
- PaddleOCR (the most popular open source OCR): Maxed out at ~15 img/s. GPU utilization hovered around 15%. Their high performance inference mode doesn't support Blackwell GPUs yet (needs CUDA < 12.8) and doesn't work with the latin recognition model either.
- Any VLM OCR (via vLLM): Great accuracy, but crawled at max 2 img/s. At a million pages, the time/cost was prohibitive.
The Solution: A C++/CUDA Inference Server
PaddleOCR bottlenecks on Python overhead and single-stream execution, so the GPU was barely being used. The fix was a C++ server around the PP-OCRv5-mobile models with TensorRT FP16 and multi-stream concurrency, served via gRPC/HTTP. Went from 15% to 99% GPU utilisation and multiplied the throughput compared to using PaddleOCR's own library. Claude Code and Gemini CLI did most of the coding.Benchmarks (Linux/ RTX 5090 / CUDA 13.1)
- Text-heavy pages: 100+ img/s
- Sparse/Low-text pages: 1,000+ img/s
Trade-offs
- Accuracy vs. Speed: This trades layout accuracy for raw speed. No multi-column reading order or complex table extraction. If you need that, GLM-OCR or Paddle-VL or other VLM based OCRs are better options.
Source for those interested: github.com/aiptimizer/turbo-ocr
[link] [comments]




