BAGEL: Benchmarking Animal Knowledge Expertise in Language Models

arXiv cs.CL / 4/20/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces BAGEL, a new closed-book benchmark designed to measure how well language models handle specialized animal-related knowledge.
  • BAGEL is built from multiple scientific and reference sources (including bioRxiv, Global Biotic Interactions, Xeno-canto, and Wikipedia) using both curated examples and automatically generated question–answer pairs.
  • The benchmark evaluates several dimensions of animal knowledge, such as taxonomy, morphology, habitat, behavior, vocalizations, geographic distribution, and species interactions.
  • By using closed-book evaluation with no external retrieval at inference time, BAGEL aims to provide a more reliable assessment of model knowledge and to analyze strengths and systematic failure modes across domains and categories.
  • The benchmark is positioned as a testbed for studying domain-specific knowledge generalization and improving reliability for biodiversity-related applications.

Abstract

Large language models have shown strong performance on broad-domain knowledge and reasoning benchmarks, but it remains unclear how well language models handle specialized animal-related knowledge under a unified closed-book evaluation protocol. We introduce BAGEL, a benchmark for evaluating animal knowledge expertise in language models. BAGEL is constructed from diverse scientific and reference sources, including bioRxiv, Global Biotic Interactions, Xeno-canto, and Wikipedia, using a combination of curated examples and automatically generated closed-book question-answer pairs. The benchmark covers multiple aspects of animal knowledge, including taxonomy, morphology, habitat, behavior, vocalization, geographic distribution, and species interactions. By focusing on closed-book evaluation, BAGEL measures animal-related knowledge of models without external retrieval at inference time. BAGEL further supports fine-grained analysis across source domains, taxonomic groups, and knowledge categories, enabling a more precise characterization of model strengths and systematic failure modes. Our benchmark provides a new testbed for studying domain-specific knowledge generalization in language models and for improving their reliability in biodiversity-related applications.