TailNLG: A Multilingual Benchmark Addressing Verbalization of Long-Tail Entities

arXiv cs.CL / 3/31/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that multilingual data-to-text verbalization of knowledge graphs can be biased against rare (long-tail) entities, limiting usability for non-expert users and retrieval-augmented generation systems.
  • It introduces TailNLG, a new multilingual benchmark (English, Italian, Spanish) built from Wikidata that systematically varies entity popularity and is designed to study long-tail effects.
  • The study evaluates three families of large language models in zero-shot settings and finds a consistent bias against long-tail entities, with lower embedding-based scores and higher model uncertainty for rare items.
  • It shows that the magnitude of long-tail bias differs by model and language, and that existing evaluation metrics may not reliably reflect these differences, motivating improved evaluation approaches.

Abstract

The automatic verbalization of structured knowledge is a key task for making knowledge graphs accessible to non-expert users and supporting retrieval-augmented generation systems. Although recent advances in Data-to-Text generation have improved multilingual coverage, little attention has been paid to potential biases in the verbalization of rare entities, frequently known as long-tail entities. In this work, we present the first systematic study of long-tail entities in Data-to-Text generation. We introduce TailNLG, a new multilingual benchmark in English, Italian, and Spanish, built from Wikidata and covering entities with varying levels of popularity. We evaluate three different families of large language models in zero-shot settings and compare their performance on rare versus common entities, as well as against the established WebNLG benchmark. Our results reveal a consistent bias against long-tail entities: embedding-based scores are lower, and model uncertainty is higher for rare entities. We further show that the impact of long-tail entities varies across models and languages, and that existing evaluation metrics do not consistently capture these differences, highlighting the need for more reliable evaluation frameworks.