GLeMM: A large-scale multilingual dataset for morphological research

arXiv cs.CL / 4/15/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces GLeMM, a new large-scale multilingual dataset specifically designed to support data-driven research in derivational morphology and word-formation form–meaning relations.
GLeMM is fully automated and consistently designed across seven European languages (German, English, Spanish, French, Italian, Polish, and Russian), aiming to improve replication and generalization compared with prior limited datasets.
Each dataset entry includes automatic annotation of morphological features, and a substantial subset also contains encoded semantic descriptions to enable richer computational experiments.
The authors describe the dataset construction pipeline using Wiktionary sources and provide case studies showing how the resource can be used to test computational methods for identifying derivational morphological structures.

Abstract

In derivational morphology, what mechanisms govern the variation in form-meaning relations between words? The answers to this type of questions are typically based on intuition and on observations drawn from limited data, even when a wide range of languages is considered. Many of these studies are difficult to replicate and generalize. To address this issue, we present GLeMM, a new derivational resource designed for experimentation and data-driven description in morphology. GLeMM is characterized by (i) its large size, (ii) its extensive coverage (currently amounting to seven European languages, i.e., German, English, Spanish, French, Italian, Polish, Russian, (iii) its fully automated design, identical across all languages, (iv) the automatic annotation of morphological features on each entry, as well as (v) the encoding of semantic descriptions for a significant subset of these entries. It enables researchers to address difficult questions, such as the role of form and meaning in word-formation, and to develop and experimentally test computational methods that identify the structures of derivational morphology. The article describes how GLeMM is created using Wiktionary articles and presents various case studies illustrating possible applications of the resource.