KnowledgeBerg: Evaluating Systematic Knowledge Coverage and Compositional Reasoning in Large Language Models
arXiv cs.AI / 4/21/2026
📰 NewsDeveloper Stack & InfrastructureModels & Research
Key Points
- The article proposes a framework for evaluating LLMs on two hard-to-notice real-world abilities: systematic coverage of a bounded knowledge universe and compositional set-based reasoning over it.
- It introduces KnowledgeBerg, a benchmark with 4,800 multiple-choice questions built from 1,183 enumeration seeds across 10 domains and 17 languages, using authoritative sources to keep the universes reproducible.
- Experiments with representative open-source LLMs show major weaknesses, with low performance on universe enumeration (5.26–36.88 F1) and knowledge-grounded reasoning (16.00–44.19 accuracy).
- The authors diagnose failures into three stages—completeness (missing knowledge), awareness (not identifying requirements), and application (incorrect reasoning execution)—and find the same pattern across languages and model sizes.
- While test-time compute and retrieval augmentation provide some improvements, notable gaps remain, suggesting current LLMs struggle to organize structured knowledge and execute compositional reasoning even within bounded domains.



