ProText: A Benchmark Dataset for Measuring (Mis)gendering in Long-Form Texts

Apple Machine Learning Journal / 3/31/2026

📰 NewsModels & Research

Key Points

  • ProText is introduced as a benchmark dataset specifically designed to measure (mis)gendering behavior in long-form text settings.
  • The paper (published March 2026) is positioned within fairness and NLP research, focusing on evaluating gender-related errors in generated or processed text.
  • By targeting long-form documents, ProText aims to capture performance issues that may not appear in shorter text benchmarks.
  • The publication provides an entry point (via its arXiv link) for researchers and practitioners to evaluate and compare systems on gendering robustness.
  • The dataset is intended to support more rigorous fairness assessment for NLP models that generate or analyze extended natural-language content.
We introduce ProText, a dataset for measuring gendering and misgendering in stylistically diverse long-form English texts. ProText spans three dimensions: Theme nouns (names, occupations, titles, kinship terms), Theme category (stereotypically male, stereotypically female, gender-neutral/non-gendered), and Pronoun category (masculine, feminine, gender-neutral, none). The dataset is designed to probe (mis)gendering in text transformations such as summarization and rewrites using state-of-the-art Large Language Models, extending beyond traditional pronoun resolution benchmarks and beyond the…

Continue reading this article on the original site.

Read original →