We benchmarked 18 LLMs on OCR (7k+ calls) — cheaper/old models oftentimes win. Full dataset + framework open-sourced. [R]

Reddit r/MachineLearning / 4/23/2026

💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • The authors benchmarked 18 LLMs on OCR/document extraction using a curated set of 42 standard documents, running 10 trials per model (7,560 total calls) under identical conditions.
  • The main finding is that smaller and/or older models often achieve premium-level OCR accuracy while costing a fraction of the price, suggesting many teams have been overpaying by defaulting to the newest or biggest models.
  • They evaluate models using pass^n reliability at scale, cost-per-success, latency, and accuracy for critical fields, focusing on practical production concerns.
  • They open-sourced both the dataset/framework via the “ocr-mini-bench” GitHub repo and provided a public leaderboard, plus a free tool to test the user’s own documents.
  • The post invites others to confirm whether the same cost/accuracy pattern is observed in their own OCR workflows.

TLDR; We were overpaying for OCR, so we compared flagship models with cheaper and older models. New mini-bench + leaderboard. Free tool to test your own documents. Open Source.

We’ve been looking at OCR / document extraction workflows and kept seeing the same pattern:

Too many teams are either stuck in legacy OCR pipelines, or are overpaying badly for LLM calls by defaulting to the newest/ biggest model.

We put together a curated set of 42 standard documents and ran every model 10 times under identical conditions; 7,560 total calls. Main takeaway: for standard OCR, smaller and older models match premium accuracy at a fraction of the cost.

We track pass^n (reliability at scale), cost-per-success, latency, and critical field accuracy.

Everything is open source: https://github.com/ArbitrHq/ocr-mini-bench

Leaderboard: https://arbitrhq.ai/leaderboards/

Curious whether this matches what others here are seeing.

submitted by /u/TimoKerre
[link] [comments]