Benchmarking Large Language Models on Reference Extraction and Parsing in the Social Sciences and Humanities
arXiv cs.CL / 3/17/2026
💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research
Key Points
- The paper presents a unified benchmark for bibliographic reference extraction and parsing focusing on SSH with multilingual and varied formatting, across three datasets (CEX, EXCITE, LinkedBooks).
- It compares a strong supervised pipeline (GROBID) with contemporary LLMs (DeepSeek-V3.1, Mistral-Small-3.2-24B, Gemma-3-27B-it, Qwen3-VL variants) under a schema-constrained setting for fair evaluation.
- Across datasets, reference extraction largely reaches moderate capability, while reference parsing and end-to-end document parsing remain bottlenecks due to brittle structured-output in noisy layouts.
- Lightweight LoRA adaptation yields consistent gains, especially on SSH-heavy benchmarks, and segmentation/pipelining significantly improves robustness.
- The authors propose a hybrid deployment strategy that routes well-structured PDFs to GROBID and more complex, multilingual/footnote-heavy documents to task-adapted LLMs.




