RUMLEM: A Dictionary-Based Lemmatizer for Romansh

arXiv cs.CL / 4/14/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces RUMLEM, a dictionary-based lemmatizer designed specifically for the Romansh language and its five main regional varieties plus Rumantsch Grischun.
  • By relying on comprehensive, community-driven morphological databases, RUMLEM achieves coverage of roughly 77–84% of words in typical Romansh text.
  • The approach is variety-aware: separate databases per variety enable the lemmatizer to support variety-aware language classification.
  • Experiments on 30,000 Romansh texts show RUMLEM identifies the correct variety in 95% of cases.
  • A proof of concept further demonstrates that lemmatization outputs can support Romansh-vs-non-Romansh language classification.

Abstract

Lemmatization -- the task of mapping an inflected word form to its dictionary form -- is a crucial component of many NLP applications. In this paper, we present RUMLEM, a lemmatizer that covers the five main varieties of Romansh as well as the supra-regional standard variety Rumantsch Grischun. It is based on comprehensive, community-driven morphological databases for Romansh, enabling RUMLEM to cover 77-84% of the words in a typical Romansh text. Since there is a dedicated database for each Romansh variety, an additional application of RUMLEM is variety-aware language classification. Evaluation on 30'000 Romansh texts of varying lengths shows that RUMLEM correctly identifies the variety in 95% of cases. In addition, a proof of concept demonstrates the feasibility of Romansh vs. non-Romansh language classification based on the lemmatizer.