BibTeX Citation Hallucinations in Scientific Publishing Agents: Evaluation and Mitigation
arXiv cs.CL / 4/6/2026
💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research
Key Points
- Web-enabled scientific publishing agents using large language models can generate BibTeX citations with widespread field-level errors, even though prior evaluations often ignored the role of web search.
- The authors introduce a benchmark of 931 papers across four domains and multiple citation tiers, along with version-aware ground truth, and evaluate three frontier search-enabled models (GPT-5, Claude Sonnet-4.6, Gemini-3 Flash) using a nine-field scoring scheme and a six-way error taxonomy.
- While overall accuracy reaches 83.6%, only 50.9% of generated BibTeX entries are completely correct, and accuracy drops substantially for more recent papers, indicating strong reliance on parametric memory.
- Two main failure modes are identified—wholesale entry substitution (identity fields failing together) and isolated field errors—supported by field-error co-occurrence analysis.
- As mitigation, the paper evaluates clibib, an open-source deterministic BibTeX retrieval tool (Zotero Translation Server with CrossRef fallback), and shows that a two-stage “search then revision against authoritative records” integration increases accuracy to 91.5% with low regression (0.8%), outperforming a single-stage approach.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.




