[D] Large scale OCR [D]

Reddit r/MachineLearning / 4/10/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical Usage

共有:

Key Points

The post asks for the most cost-effective approach to OCR about 50 million pages of legal documents under a strict one-week processing timeline.
The requester prioritizes extracting text only, stating that preserving page layout is not important, which can simplify the OCR pipeline.
The question is framed around large-scale throughput planning, implicitly raising concerns about batching, parallelization, and automation across massive document volumes.
The scenario suggests evaluating OCR strategies that balance accuracy and speed, potentially including model choice and infrastructure design to meet deadlines economically.
It is a practical, execution-focused inquiry rather than a report of any new system or release.

I need to OCR 50 million pages of legal documents. I'm only interested in the text, layout is not very important.

What is the most cost effective way on how I could tackle this while it not taking longer than 1 week?