Resource-Lean Lexicon Induction for German Dialects

arXiv cs.CL / 4/28/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses the challenge of automatically inducing high-quality lexical dictionaries for German dialects despite limited annotations and high spelling variation.
  • It shows that random-forest statistical models using string-similarity features can effectively induce German dialect lexicons and can outperform LLM baselines like Mistral-123b.
  • The induced dictionaries support cross-dialect transfer and are evaluated under different training-data sizes to study robustness in low-resource settings.
  • In bilingual lexicon induction (BLI), random forests achieve stronger intrinsic performance while remaining more resource-lean than large language models.
  • In dialect information retrieval (IR) with BM25, the paper reports query expansion using the dialect dictionaries improves results by up to 28.9% in nDCG@10 and 50.7% in Recall@100.

Abstract

Automatic induction of high-quality dictionaries is essential for building lexical resources, yet low-resource languages and dialects pose several challenges: limited access to annotators, high degree of spelling variations, and poor performance of large language models (LLMs). We empirically show that statistical models (random forests) trained on string similarity features are surprisingly effective for inducing German dialect lexicons. They outperform LLMs, enable cross-dialect transfer, and offer a lightweight data-driven alternative. We evaluate our models intrinsically on bilingual lexicon induction (BLI) and extrinsically on dialect information retrieval (IR). On BLI, random forests outperform Mistral-123b while being more resource-lean. On dialect IR with BM25, using our dialect dictionaries for query expansion yields relative improvements of up to 28.9% in nDCG@10 and 50.7% in Recall@100. Motivated by the resource scarcity in dialects, we further investigate the extent to which models transfer across different German dialects, and their performance under varying amounts of training data.