Beyond Hate: Differentiating Uncivil and Intolerant Speech in Multimodal Content Moderation

arXiv cs.CL / 3/25/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that multimodal toxicity benchmarks are overly coarse because they rely on a single hatefulness label that conflates tone (incivility) with content (intolerance).
It introduces a fine-grained annotation scheme that separates incivility (rude or dismissive tone) from intolerance (attacks pluralism and targets groups or identities) and applies it to 2,030 memes from the Hateful Memes dataset.
The authors evaluate multiple vision-language models using (1) coarse-label training, (2) transfer learning across label schemes, and (3) a joint learning approach combining coarse hatefulness with the new fine-grained annotations.
Results show that adding the fine-grained labels improves overall moderation performance and yields more balanced error profiles, including reduced under-detection of harmful content.
The work positions improved data quality—by using both coarse and fine-grained labels—as a practical path toward more reliable multimodal content moderation systems.

Abstract

Current multimodal toxicity benchmarks typically use a single binary hatefulness label. This coarse approach conflates two fundamentally different characteristics of expression: tone and content. Drawing on communication science theory, we introduce a fine-grained annotation scheme that distinguishes two separable dimensions: incivility (rude or dismissive tone) and intolerance (content that attacks pluralism and targets groups or identities) and apply it to 2,030 memes from the Hateful Memes dataset. We evaluate different vision-language models under coarse-label training, transfer learning across label schemes and a joint learning approach that combines the coarse hatefulness label with our fine-grained annotations. Our results show that fine-grained annotations complement existing coarse labels and, when used jointly, improve overall model performance. Moreover, models trained with the fine-grained scheme exhibit more balanced moderation-relevant error profiles and are less prone to under-detection of harmful content than models trained on hatefulness labels alone (FNR-FPR, the difference between false negative and false positive rates: 0.74 to 0.42 for LLaVA-1.6-Mistral-7B; 0.54 to 0.28 for Qwen2.5-VL-7B). This work contributes to data-centric approaches in content moderation by improving the reliability and accuracy of moderation systems through enhanced data quality. Overall, combining both coarse and fine-grained labels provides a practical route to more reliable multimodal moderation.

The Security Gap in MCP Tool Servers (And What I Built to Fix It)

Dev.to

Adversarial AI framework reveals mechanisms behind impaired consciousness and a potential therapy

Reddit r/artificial

Why I Switched From GPT-4 to Small Language Models for Two of My Products

Dev.to

Orchestrating AI Velocity: Building a Decoupled Control Plane for Agentic Development

Dev.to

In the Kadrey v. Meta Platforms case, Judge Chabbria's quest to bust the fair use copyright defense to generative AI training rises from the dead!

Reddit r/artificial

Beyond Hate: Differentiating Uncivil and Intolerant Speech in Multimodal Content Moderation

Key Points

Abstract

Related Articles

The Security Gap in MCP Tool Servers (And What I Built to Fix It)

Adversarial AI framework reveals mechanisms behind impaired consciousness and a potential therapy

Why I Switched From GPT-4 to Small Language Models for Two of My Products

Orchestrating AI Velocity: Building a Decoupled Control Plane for Agentic Development

In the Kadrey v. Meta Platforms case, Judge Chabbria's quest to bust the fair use copyright defense to generative AI training rises from the dead!

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer