GLeMM: A large-scale multilingual dataset for morphological research
arXiv cs.CL / 4/15/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces GLeMM, a new large-scale multilingual dataset specifically designed to support data-driven research in derivational morphology and word-formation form–meaning relations.
- GLeMM is fully automated and consistently designed across seven European languages (German, English, Spanish, French, Italian, Polish, and Russian), aiming to improve replication and generalization compared with prior limited datasets.
- Each dataset entry includes automatic annotation of morphological features, and a substantial subset also contains encoded semantic descriptions to enable richer computational experiments.
- The authors describe the dataset construction pipeline using Wiktionary sources and provide case studies showing how the resource can be used to test computational methods for identifying derivational morphological structures.
Related Articles

As China’s biotech firms shift gears, can AI floor the accelerator?
SCMP Tech

Why AI Teams Are Standardizing on a Multi-Model Gateway
Dev.to

a claude code/codex plugin to run autoresearch on your repository
Dev.to

AI startup claims to automate app making but actually just uses humans
Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to