Beyond Hate: Differentiating Uncivil and Intolerant Speech in Multimodal Content Moderation
arXiv cs.CL / 3/25/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that multimodal toxicity benchmarks are overly coarse because they rely on a single hatefulness label that conflates tone (incivility) with content (intolerance).
- It introduces a fine-grained annotation scheme that separates incivility (rude or dismissive tone) from intolerance (attacks pluralism and targets groups or identities) and applies it to 2,030 memes from the Hateful Memes dataset.
- The authors evaluate multiple vision-language models using (1) coarse-label training, (2) transfer learning across label schemes, and (3) a joint learning approach combining coarse hatefulness with the new fine-grained annotations.
- Results show that adding the fine-grained labels improves overall moderation performance and yields more balanced error profiles, including reduced under-detection of harmful content.
- The work positions improved data quality—by using both coarse and fine-grained labels—as a practical path toward more reliable multimodal content moderation systems.
Related Articles
The Security Gap in MCP Tool Servers (And What I Built to Fix It)
Dev.to

Adversarial AI framework reveals mechanisms behind impaired consciousness and a potential therapy
Reddit r/artificial
Why I Switched From GPT-4 to Small Language Models for Two of My Products
Dev.to
Orchestrating AI Velocity: Building a Decoupled Control Plane for Agentic Development
Dev.to
In the Kadrey v. Meta Platforms case, Judge Chabbria's quest to bust the fair use copyright defense to generative AI training rises from the dead!
Reddit r/artificial