FRENCH-YMCA: A FRENCH Corpus meeting the language needs of Youth, froM Children to Adolescents

arXiv cs.CL / 4/8/2026

📰 NewsSignals & Early TrendsModels & Research

Key Points

  • The paper introduces the French-YMCA corpus, a new open linguistic resource tailored to children and adolescents’ evolving language needs rather than adult language patterns.
  • The corpus contains 39,200 text files totaling 22,471,898 words, with design choices including diverse sources plus consistent grammar and spelling.
  • The authors emphasize open online accessibility so the dataset can be broadly reused for research and downstream development.
  • The corpus is positioned as a foundation for training language models to better understand youth language and generate age-appropriate, comprehension-matched responses and suggestions.

Abstract

In this paper, we introduce the French-YMCA corpus, a new linguistic resource specifically tailored for children and adolescents. The motivation for building this corpus is clear: children have unique language requirements, as their language skills are in constant evolution and differ from those of adults. With an extensive collection of 39,200 text files, the French-YMCA corpus encompasses a total of 22,471,898 words. It distinguishes itself through its diverse sources, consistent grammar and spelling, and the commitment to providing open online accessibility for all. Such corpus can serve as the foundation for training language models that understand and anticipate youth's language, thereby enhancing the quality of digital interactions and ensuring that responses and suggestions are age-appropriate and adapted to the comprehension level of users of this age.