AI Navigate

[R] IDP Leaderboard: Open benchmark for document AI across 16 VLMs, 9,000+ documents, 3 benchmark suites

Reddit r/MachineLearning / 3/12/2026

📰 NewsTools & Practical UsageModels & Research

Key Points

  • The IDP Leaderboard is an open evaluation framework testing 16 document AI models across three benchmark suites focusing on various document understanding tasks including KIE, table extraction, VQA, OCR, classification, and long document processing.
  • Gemini 3.1 Pro leads the overall leaderboard with a score of 83.2, while cheaper variant models such as Flash and Sonnet perform nearly as well on extraction tasks but differ on reasoning-heavy tasks like VQA.
  • GPT-5.4 shows a substantial improvement over GPT-4.1, particularly excelling on document VQA tasks with scores increasing from 42% to 91%.
  • The hardest tasks remain sparse unstructured table extraction with most models scoring below 55%, and handwriting OCR tops out at 76% accuracy.
  • A Results Explorer tool has been introduced to display ground truth alongside model predictions for every document, improving transparency and aiding users in selecting the best model for their needs.

We're releasing the IDP Leaderboard, an open evaluation framework for document understanding tasks. 16 models tested across OlmOCR, OmniDoc, and our own IDP Core benchmark (covering KIE, table extraction, VQA, OCR, classification, and long document processing).

Key results:

- Gemini 3.1 Pro leads overall (83.2) but the margin is tight. Top 5 within 2.4 points.

- Cheaper model variants (Flash, Sonnet) produce nearly identical extraction quality to flagship models. The differentiation only appears on reasoning-heavy tasks like VQA.

- GPT-5.4 shows a significant jump over GPT-4.1 (70 to 81 overall, 42% to 91% on DocVQA).

- Sparse unstructured tables remain the hardest task. Most models are below 55%.

- Handwriting OCR tops out at 76%.

We also built a Results Explorer that shows ground truth alongside every model's raw prediction for every document. Not just scores.

This helps you decide which model works for you by actually seeing the predictions and the ground truths.

Findings: https://nanonets.com/blog/idp-leaderboard-1-5/

Datasets: huggingface.co/collections/nanonets/idp-leaderboard

Leaderboard + Results Explorer: idp-leaderboard.org

submitted by /u/shhdwi
[link] [comments]