[R] IDP Leaderboard: Open benchmark for document AI across 16 VLMs, 9,000+ documents, 3 benchmark suites

Reddit r/MachineLearning / 3/12/2026

📰 NewsTools & Practical UsageModels & Research

共有:

Key Points

The IDP Leaderboard is an open evaluation framework testing 16 document AI models across three benchmark suites focusing on various document understanding tasks including KIE, table extraction, VQA, OCR, classification, and long document processing.
Gemini 3.1 Pro leads the overall leaderboard with a score of 83.2, while cheaper variant models such as Flash and Sonnet perform nearly as well on extraction tasks but differ on reasoning-heavy tasks like VQA.
GPT-5.4 shows a substantial improvement over GPT-4.1, particularly excelling on document VQA tasks with scores increasing from 42% to 91%.
The hardest tasks remain sparse unstructured table extraction with most models scoring below 55%, and handwriting OCR tops out at 76% accuracy.
A Results Explorer tool has been introduced to display ground truth alongside model predictions for every document, improving transparency and aiding users in selecting the best model for their needs.

We're releasing the IDP Leaderboard, an open evaluation framework for document understanding tasks. 16 models tested across OlmOCR, OmniDoc, and our own IDP Core benchmark (covering KIE, table extraction, VQA, OCR, classification, and long document processing).

Key results:

- Gemini 3.1 Pro leads overall (83.2) but the margin is tight. Top 5 within 2.4 points.

- Cheaper model variants (Flash, Sonnet) produce nearly identical extraction quality to flagship models. The differentiation only appears on reasoning-heavy tasks like VQA.

- GPT-5.4 shows a significant jump over GPT-4.1 (70 to 81 overall, 42% to 91% on DocVQA).

- Sparse unstructured tables remain the hardest task. Most models are below 55%.

- Handwriting OCR tops out at 76%.

We also built a Results Explorer that shows ground truth alongside every model's raw prediction for every document. Not just scores.

This helps you decide which model works for you by actually seeing the predictions and the ground truths.

Findings: https://nanonets.com/blog/idp-leaderboard-1-5/

Datasets: huggingface.co/collections/nanonets/idp-leaderboard

Leaderboard + Results Explorer: idp-leaderboard.org

submitted by /u/shhdwi
[link] [comments]

Astral to Join OpenAI

Dev.to

I Built a MITM Proxy to See What Claude Code Actually Sends to Anthropic

Dev.to

Your AI coding agent is installing vulnerable packages. I built the fix.

Dev.to

ChatGPT Prompt Engineering for Freelancers: Unlocking Efficient Client Communication

Dev.to

PearlOS. We gave swarm intelligence a local desktop environment and code control to self-evolve. Has been pretty incredible to see so far. Open source and free if you want your own.

Reddit r/LocalLLaMA

[R] IDP Leaderboard: Open benchmark for document AI across 16 VLMs, 9,000+ documents, 3 benchmark suites

Key Points

Related Articles

Astral to Join OpenAI

I Built a MITM Proxy to See What Claude Code Actually Sends to Anthropic

Your AI coding agent is installing vulnerable packages. I built the fix.

ChatGPT Prompt Engineering for Freelancers: Unlocking Efficient Client Communication

PearlOS. We gave swarm intelligence a local desktop environment and code control to self-evolve. Has been pretty incredible to see so far. Open source and free if you want your own.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer