Explain the Flag: Contextualizing Hate Speech Beyond Censorship

arXiv cs.CL / 4/17/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

Hate speech detection on online platforms often prioritizes censorship or removal, which can reduce transparency and make it harder to understand why content is harmful.
The paper proposes a hybrid system that combines LLMs with three newly curated vocabularies to detect and explain hate speech in English, French, and Greek.
It uses two complementary pipelines: one to identify and disambiguate problematic terms related to identity, and another to have LLMs evaluate context for direct group-targeting.
The system’s explanations are fused into grounded rationales, and human evaluation indicates higher accuracy and explanation quality than LLM-only baselines.
The approach aims to improve accountability and freedom of expression by providing clearer, context-aware explanations rather than only flagging or removing content.

Abstract

Hate, derogatory, and offensive speech remains a persistent challenge in online platforms and public discourse. While automated detection systems are widely used, most focus on censorship or removal, raising concerns for transparency and freedom of expression, and limiting opportunities to explain why content is harmful. To address these issues, explanatory approaches have emerged as a promising solution, aiming to make hate speech detection more transparent, accountable, and informative. In this paper, we present a hybrid approach that combines Large Language Models (LLMs) with three newly created and curated vocabularies to detect and explain hate speech in English, French, and Greek. Our system captures both inherently derogatory expressions tied to identity characteristics and direct group-targeted content through two complementary pipelines: one that detects and disambiguates problematic terms using the curated vocabularies, and one that leverages LLMs as context-aware evaluators of group-targeting content. The outputs are fused into grounded explanations that clarify why content is flagged. Human evaluation shows that our hybrid approach is accurate, with high-quality explanations, outperforming LLM-only baselines.