AI Navigate

LuxBorrow: From Pompier to Pompjee, Tracing Borrowing in Luxembourgish

arXiv cs.CL / 3/12/2026

📰 NewsIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • LuxBorrow introduces a borrowing-first analysis of Luxembourgish (LU) news from 1999 to 2025, using a pipeline that combines sentence-level language identification (LU/DE/FR/EN) with a token-level borrowing resolver, lemmatization, a loanword registry, and morphological/orthographic rules.
  • The study shows Luxembourgish remains the matrix language across all documents, but multilingual practice is pervasive, with 77.1% of articles containing at least one donor language and 65.4% drawing on three or four donors.
  • Token-level adaptations total 25,444 instances and are mostly morphological (63.8%) and orthographic (35.9%), with a small lexical component (0.3%), and the most frequent rules are orthographic changes such as on->oun and eur->er.
  • The authors advocate borrowing-centric evaluation metrics—such as borrowed token/type rates, donor entropy over borrowed items, and assimilation ratios—over relying solely on document-level mixing indices.

Abstract

We present LuxBorrow, a borrowing-first analysis of Luxembourgish (LU) news spanning 27 years (1999-2025), covering 259,305 RTL articles and 43.7M tokens. Our pipeline combines sentence-level language identification (LU/DE/FR/EN) with a token-level borrowing resolver restricted to LU sentences, using lemmatization, a collected loanword registry, and compiled morphological and orthographic rules. Empirically, LU remains the matrix language across all documents, while multilingual practice is pervasive: 77.1% of articles include at least one donor language and 65.4% use three or four. Breadth does not imply intensity: median code-mixing index (CMI) increases from 3.90 (LU+1) to only 7.00 (LU+3), indicating localized insertions rather than balanced bilingual text. Domain and period summaries show moderate but persistent mixing, with CMI rising from 6.1 (1999-2007) to a peak of 8.4 in 2020. Token-level adaptations total 25,444 instances and exhibit a mixed profile: morphological 63.8%, orthographic 35.9%, lexical 0.3%. The most frequent individual rules are orthographic, such as on->oun and eur->er, while morphology is collectively dominant. Diachronically, code-switching intensifies, and morphologically adapted borrowings grow from a small base. French overwhelmingly supplies adapted items, with modest growth for German and negligible English. We advocate borrowing-centric evaluation, including borrowed token and type rates, donor entropy over borrowed items, and assimilation ratios, rather than relying only on document-level mixing indices.