AI Navigate

A survey of diversity quantification in natural language processing: The why, what, where and how

arXiv cs.CL / 3/16/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper notes fragmentation and inconsistencies in how NLP papers quantify diversity and calls for a unified approach.
  • It adopts Stirling's three diversity dimensions—variety, balance, and disparity—and maps them into an NLP-specific framework.
  • It surveys over 300 diversity-related NLP papers from ACL Anthology and organizes the analysis around four perspectives: why diversity matters, what is measured, where it is measured, and how it is measured.
  • The authors aim to improve comparability across methods, reveal emerging trends, and provide recommendations to guide future research in the field.

Abstract

The concept of diversity has received increasing attention in natural language processing (NLP) in recent years. It became an advocated property of datasets and systems, and many measures are used to quantify it. However, it is often addressed in an ad hoc manner, with few explicit justifications of its endorsement and many cross-paper inconsistencies. There have been very few attempts to take a step back and understand the conceptualization of diversity in NLP. To address this fragmentation, we take inspiration from other scientific fields where the concept of diversity has been more thoroughly conceptualized. We build upon Stirling (2007), a unified framework adapted from ecology and economics, which distinguishes three dimensions of diversity: variety, balance, and disparity. We survey over 300 recent diversity-related papers from ACL Anthology and build an NLP-specific framework with 4 perspectives: why diversity is important, what diversity is measured on, where it is measured, and how. Our analysis increases comparability of approaches to diversity in NLP, reveals emerging trends and allows us to formulate recommendations for the field.