CWoMP: Morpheme Representation Learning for Interlinear Glossing
arXiv cs.CL / 3/20/2026
📰 NewsTools & Practical UsageModels & Research
Key Points
- CWoMP introduces a morpheme-centric pretraining framework that treats morphemes as atomic units and learns their representations, aligning words-in-context with their morphemes in a shared embedding space.
- The approach uses a contrastively trained encoder and an autoregressive decoder that retrieves morpheme sequences from a mutable lexicon, producing predictions grounded in lexicon entries for interpretability.
- A key novelty is that users can expand the lexicon at inference time to improve results without retraining, enabling interactive, incremental improvements.
- Evaluations on diverse extremely low-resource languages show CWoMP outperforms existing methods and achieves higher efficiency, with notable gains when data is scarce.