LGSE: Lexically Grounded Subword Embedding Initialization for Low-Resource Language Adaptation
arXiv cs.CL / 3/25/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces LGSE, a framework for lexically (morphologically) grounded initialization of embeddings to better adapt pretrained language models to low-resource, morphologically rich languages like Amharic and Tigrinya.
- Unlike typical vocabulary expansion approaches that use arbitrarily segmented subwords and can fragment lexical representations, LGSE builds embeddings by decomposing words into morphemes and averaging pretrained subword/FastText-based morpheme representations.
- If meaningful morpheme segmentation is unavailable, LGSE falls back to character n-gram representations to capture structural information for unseen or difficult tokens.
- During language-adaptive pretraining, LGSE adds a regularization term that discourages large deviations from the initialized embeddings, helping preserve alignment with the original pretrained embedding space while still allowing adaptation.
- Experiments on QA, NER, and text classification show LGSE outperforms baseline vocabulary expansion and adaptation methods across tasks, and the authors provide project resources via GitHub.
Related Articles
CRM Development That Drives Growth
Dev.to

Karpathy's Autoresearch: Improving Agentic Coding Skills
Dev.to
How to Write AI Prompts That Actually Work
Dev.to
[D] Any other PhD students feel underprepared and that the bar is too low?
Reddit r/MachineLearning
Automating the Perfect Pitch: An AI Framework for Boutique PR
Dev.to