MORPHOGEN: A Multilingual Benchmark for Evaluating Gender-Aware Morphological Generation

arXiv cs.AI / 4/22/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that multilingual LLMs’ handling of grammatical gender and morphological agreement has been insufficiently studied compared with tasks like translation and question answering.
  • It introduces MORPHOGEN, a multilingual benchmark dataset targeting gender-aware morphological generation in French, Arabic, and Hindi.
  • The benchmark’s main task (GENFORM) asks models to rewrite a first-person sentence into the opposite gender while preserving meaning and structure.
  • Using a high-quality synthetic dataset, the authors evaluate 15 popular multilingual LLMs (2B–70B) and find notable performance gaps that reveal how models currently manage morphological gender.
  • The authors position MORPHOGEN as a diagnostic tool to advance more inclusive, morphology-sensitive NLP research.

Abstract

While multilingual large language models (LLMs) perform well on high-level tasks like translation and question answering, their ability to handle grammatical gender and morphological agreement remains underexplored. In morphologically rich languages, gender influences verb conjugation, pronouns, and even first-person constructions with explicit and implicit mentions of gender. We introduce MORPHOGEN, a morphologically grounded large-scale benchmark dataset for evaluating gender-aware generation in three typologically diverse grammatically gendered languages: French, Arabic, and Hindi. The core task, GENFORM, requires models to rewrite a first-person sentence in the opposite gender while preserving its meaning and structure. We construct a high-quality synthetic dataset spanning these three languages and benchmark 15 popular multilingual LLMs (2B-70B) on their ability to perform this transformation. Our results reveal significant gaps and interesting insights into how current models handle morphological gender. MORPHOGEN provides a focused diagnostic lens for gender-aware language modeling and lays the groundwork for future research on inclusive and morphology-sensitive NLP.