MORPHOGEN: A Multilingual Benchmark for Evaluating Gender-Aware Morphological Generation
arXiv cs.AI / 4/22/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that multilingual LLMs’ handling of grammatical gender and morphological agreement has been insufficiently studied compared with tasks like translation and question answering.
- It introduces MORPHOGEN, a multilingual benchmark dataset targeting gender-aware morphological generation in French, Arabic, and Hindi.
- The benchmark’s main task (GENFORM) asks models to rewrite a first-person sentence into the opposite gender while preserving meaning and structure.
- Using a high-quality synthetic dataset, the authors evaluate 15 popular multilingual LLMs (2B–70B) and find notable performance gaps that reveal how models currently manage morphological gender.
- The authors position MORPHOGEN as a diagnostic tool to advance more inclusive, morphology-sensitive NLP research.



