A Bolu: A Structured Dataset for the Computational Analysis of Sardinian Improvisational Poetry

arXiv cs.CL / 4/22/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces A Bolu, the first structured computational corpus of extemporaneous (improvised) poetry focused on cantada logudorese, a Sardinian language variant.
  • The dataset includes 2,835 stanzas totaling 141,321 tokens, addressing a methodological gap in preserving and analyzing oral linguistic heritage with NLP.
  • The study outlines the corpus architecture and applies multidimensional computational linguistic methods plus descriptive statistics to characterize the poetic text.
  • Findings show recurring production patterns among Sardinian improvisational poets that align with Parry and Lord’s theory of formulaicity.
  • The authors argue the resource helps both scholarly understanding of oral creativity and the development of more inclusive NLP tools for less widely spoken languages.

Abstract

The growing interest of Natural Language Processing (NLP) in minority languages has not yet bridged the gap in the preservation of oral linguistic heritage. In particular, extemporaneous poetry - a performative genre based on real-time improvisation, metrical-rhetorical competence - remains a largely unexplored area of computational linguistics. This methodological gap necessitates the creation of specific resources to document and analyse the structures of improvised poetry. This is the context in which A Bolu was created, the first structured corpus of extemporaneous poetry dedicated to cantada logudorese, a variant of the Sardinian language. The dataset comprises 2,835 stanzas for a total of 141,321 tokens. The study presents the architecture of the corpus and applies a multidimensional analysis combining descriptive statistical indices and computational linguistics techniques to map the characteristics of the poetic text. The results indicate that the production of Sardinian extemporaneous poets is characterised by recurring patterns that support Parry and Lord's theory of formulaicity. This evidence not only provides a new key to understanding oral creativity, but also offers a significant contribution to the development of NLP tools that are more inclusive and sensitive to the specificities of less widely spoken languages.