[D] Large scale OCR [D]

Reddit r/MachineLearning / 4/10/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical Usage

Key Points

  • The post asks for the most cost-effective approach to OCR about 50 million pages of legal documents under a strict one-week processing timeline.
  • The requester prioritizes extracting text only, stating that preserving page layout is not important, which can simplify the OCR pipeline.
  • The question is framed around large-scale throughput planning, implicitly raising concerns about batching, parallelization, and automation across massive document volumes.
  • The scenario suggests evaluating OCR strategies that balance accuracy and speed, potentially including model choice and infrastructure design to meet deadlines economically.
  • It is a practical, execution-focused inquiry rather than a report of any new system or release.

I need to OCR 50 million pages of legal documents. I'm only interested in the text, layout is not very important.

What is the most cost effective way on how I could tackle this while it not taking longer than 1 week?

submitted by /u/vroemboem
[link] [comments]