IYKYK (But AI Doesn't): Automated Content Moderation Does Not Capture Communities' Heterogeneous Attitudes Towards Reclaimed Language

arXiv cs.CL / 4/21/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper finds that existing AI-based content moderation often cannot reliably distinguish reclaimed slur usage from hateful usage, leading to the suppression of marginalized communities' voices.
  • Using quantitative/qualitative analyses, the researchers build and analyze an annotated corpus of reclaimed slur usage across LGBTQIA+, Black, and women communities (e.g., f-word, n-word, b-word).
  • Annotation shows low inter-annotator agreement even among in-group annotators, suggesting that how reclaimed slurs are interpreted is highly subjective and depends on nuanced context.
  • The study reports poor alignment between human judgments and automated hate-speech assessments from Perspective API, with annotator decisions more associated with whether the slur is derogatory and whether it targets the self.
  • Semi-structured interviews indicate that differences in lived experience and personal history drive variation in interpretations, underscoring the limits of current automated moderation approaches.

Abstract

Reclaimed slur usage is a common and meaningful practice online for many marginalized communities. It serves as a source of solidarity, identity, and shared experience. However, contemporary automated and AI-based moderation tools for online content largely fail to distinguish between reclaimed and hateful uses of slurs, resulting in the suppression of marginalized voices. In this work, we use quantitative and qualitative methods to examine the attitudes of social media users in LGBTQIA+, Black, and women communities around reclaimed slurs targeting our focus groups including the f-word, n-word, and b-word. With social media users from these communities, we collect and analyze an annotated online slur usage corpus. The corpus includes annotators' perceptions of whether an online text containing a slur should be flagged as hate speech, as well as contextual features of the slur usage. Across all communities and annotation questions, we observe low inter-annotator agreement, indicating substantial disagreement among in-group annotators. This is compounded by the fact that, absent clear contextual signals of identity and intent, even in-group members may disagree on how to interpret reclaimed slur usage online. Semi-structured interviews with annotators suggest that differences in lived experience and personal history contribute to this variation as well. We find poor alignment between annotator judgments and automated hate speech assessments produced by Perspective API. We further observe that certain features of a text such as whether the slur usage was derogatory and if the slur was targeted at oneself are more associated with whether annotators report the text as hate speech. Together, these findings highlight the inherent subjectivity and contextual nature of how marginalized communities interpret slurs online.