TurboOCR: 270–1200 img/s OCR with Paddle + TensorRT (C++/CUDA, FP16) [P]

Reddit r/MachineLearning / 4/13/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage

Key Points

  • The article describes TurboOCR, a C++/CUDA OCR system that replaces single-threaded Python PaddleOCR inference with FP16 TensorRT, fused kernels, batching, and a multi-stream pipeline to greatly increase throughput.
  • In tests on Linux with RTX 50-series and CUDA 13.2, TurboOCR reportedly achieves about 270 img/s on text-heavy pages and 1,200+ img/s on sparse pages, enabling near-instant indexing for large-scale RAG workflows.
  • TurboOCR accepts images and PDFs via HTTP/gRPC and returns bounding boxes, recognized text, and layout region labels using PP-DocLayoutV3 (25 classes), with layout processing adding roughly ~20% latency when enabled.
  • The author notes trade-offs: complex table extraction and structured outputs (e.g., invoice-to-JSON) still require VLM-based OCR approaches like PaddleOCR-VL.
  • The author states plans to add structured extraction, markdown output, table parsing, and more languages while minimizing speed regressions, and provides a GitHub link for adoption.

I had about 940,000 PDFs to process. Running VLMs over a million pages is slow and expensive, and that gap is only getting worse as OCR moves toward transformer and VLM-based approaches. They’re great for complex understanding, but throughput and cost can become a bottleneck at scale.

PaddleOCR (the non VL version), in my opinion the best non-VLM open source OCR, only handled ~15 img/s on my RTX 5090, which was still too slow. PaddleOCR-VL was crawling at 2 img/s with vLLM.

PaddleOCR runs single-threaded Python with FP32 inference and no kernel fusion. Turbo-OCR replaces that with C++/CUDA, FP16 TensorRT, fused kernels, batched recognition, and multi-stream pipeline pooling. It takes images and PDFs via HTTP/gRPC and returns bounding boxes, text, and layout regions (PP-DocLayoutV3, 25 classes).

Layout is toggleable per request and only adds ~20% to inference time.

Results: 270 img/s on text-heavy pages without layout, 1,200+ on sparse ones. Works well for real-time RAG where you need a document indexed instantly, or for bulk processing large collections cheaply.

Trade-offs: complex table extraction and structured output (invoice → JSON) still need VLM-based OCR like PaddleOCR-VL. I'm working on bringing structured extraction, markdown output, table parsing, and more languages to Turbo-OCR while sacrificing as little speed as possible..

Tested on Linux, RTX 50-series, CUDA 13.2.

https://github.com/aiptimizer/TurboOCR

submitted by /u/Civil-Image5411
[link] [comments]