AI Navigate

Benchmarking Large Language Models on Reference Extraction and Parsing in the Social Sciences and Humanities

arXiv cs.CL / 3/17/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The paper presents a unified benchmark for bibliographic reference extraction and parsing focusing on SSH with multilingual and varied formatting, across three datasets (CEX, EXCITE, LinkedBooks).
  • It compares a strong supervised pipeline (GROBID) with contemporary LLMs (DeepSeek-V3.1, Mistral-Small-3.2-24B, Gemma-3-27B-it, Qwen3-VL variants) under a schema-constrained setting for fair evaluation.
  • Across datasets, reference extraction largely reaches moderate capability, while reference parsing and end-to-end document parsing remain bottlenecks due to brittle structured-output in noisy layouts.
  • Lightweight LoRA adaptation yields consistent gains, especially on SSH-heavy benchmarks, and segmentation/pipelining significantly improves robustness.
  • The authors propose a hybrid deployment strategy that routes well-structured PDFs to GROBID and more complex, multilingual/footnote-heavy documents to task-adapted LLMs.

Abstract

Bibliographic reference extraction and parsing are foundational for citation indexing, linking, and downstream scholarly knowledge-graph construction. However, most established evaluations focus on clean, English, end-of-document bibliographies, and therefore underrepresent the Social Sciences and Humanities (SSH), where citations are frequently multilingual, embedded in footnotes, abbreviated, and shaped by heterogeneous historical conventions. We present a unified benchmark that targets these SSH-realistic conditions across three complementary datasets: CEX (English journal articles spanning multiple disciplines), EXCITE (German/English documents with end-section, footnote-only, and mixed regimes), and LinkedBooks (humanities references with strong stylistic variation and multilinguality). We evaluate three tasks of increasing difficulty -- reference extraction, reference parsing, and end-to-end document parsing -- under a schema-constrained setup that enables direct comparison between a strong supervised pipeline baseline (GROBID) and contemporary LLMs (DeepSeek-V3.1, Mistral-Small-3.2-24B, Gemma-3-27B-it, and Qwen3-VL (4B-32B variants)). Across datasets, extraction largely saturates beyond a moderate capability threshold, while parsing and end-to-end parsing remain the primary bottlenecks due to structured-output brittleness under noisy layouts. We further show that lightweight LoRA adaptation yields consistent gains -- especially on SSH-heavy benchmarks -- and that segmentation/pipelining can substantially improve robustness. Finally, we argue for hybrid deployment via routing: leveraging GROBID for well-structured, in-distribution PDFs while escalating multilingual and footnote-heavy documents to task-adapted LLMs.