Benchmarking PDF Parsers on Table Extraction with LLM-based Semantic Evaluation

arXiv cs.CV / 3/20/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

They present a benchmarking framework for PDF table extraction that uses synthetically generated PDFs with precise LaTeX ground truth and realistic tables sourced from arXiv to capture diversity and complexity.
A central contribution is the use of LLMs as judges for semantic evaluation of tables, integrated into a matching pipeline that tolerates inconsistencies in parser outputs.
In a human validation study with over 1,500 quality judgments, the LLM-based evaluation shows substantially higher correlation with human judgment (Pearson r=0.93) than TEDS (r=0.68) and GriTS (r=0.70).
Evaluating 21 contemporary PDF parsers across 100 synthetic documents containing 451 tables reveals notable performance disparities and yields practical guidance for selecting parsers for tabular data extraction.
The work provides a reproducible, scalable evaluation methodology and makes code and data available on GitHub for broader adoption.

Abstract

Reliably extracting tables from PDFs is essential for large-scale scientific data mining and knowledge base construction, yet existing evaluation approaches rely on rule-based metrics that fail to capture semantic equivalence of table content. We present a benchmarking framework based on synthetically generated PDFs with precise LaTeX ground truth, using tables sourced from arXiv to ensure realistic complexity and diversity. As our central methodological contribution, we apply LLM-as-a-judge for semantic table evaluation, integrated into a matching pipeline that accommodates inconsistencies in parser outputs. Through a human validation study comprising over 1,500 quality judgments on extracted table pairs, we show that LLM-based evaluation achieves substantially higher correlation with human judgment (Pearson r=0.93) compared to Tree Edit Distance-based Similarity (TEDS, r=0.68) and Grid Table Similarity (GriTS, r=0.70). Evaluating 21 contemporary PDF parsers across 100 synthetic documents containing 451 tables reveals significant performance disparities. Our results offer practical guidance for selecting parsers for tabular data extraction and establish a reproducible, scalable evaluation methodology for this critical task. Code and data: https://github.com/phorn1/pdf-parse-bench Metric study and human evaluation: https://github.com/phorn1/table-metric-study

I Built an AI That Audits Other AI Agents for Token Waste — Launching on Product Hunt Today

Dev.to

Check out this article on AI-Driven Reporting 2.0: From Manual Bottlenecks to Real-Time Decision Intelligence (2026 Edition)

Dev.to

SYNCAI

Dev.to

How AI-Powered Decision Making is Reshaping Enterprise Strategy in 2024

Dev.to

When AI Grows Up: Identity, Memory, and What Persists Across Versions

Dev.to

Benchmarking PDF Parsers on Table Extraction with LLM-based Semantic Evaluation

Key Points

Abstract

Related Articles

I Built an AI That Audits Other AI Agents for Token Waste — Launching on Product Hunt Today

Check out this article on AI-Driven Reporting 2.0: From Manual Bottlenecks to Real-Time Decision Intelligence (2026 Edition)

SYNCAI

How AI-Powered Decision Making is Reshaping Enterprise Strategy in 2024

When AI Grows Up: Identity, Memory, and What Persists Across Versions

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer