Beyond Accuracy: An Explainability-Driven Analysis of Harmful Content Detection

arXiv cs.AI / 3/20/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The article presents an explainability-driven analysis of a RoBERTa-based harmful content detector trained on the Civil Comments dataset to understand how predictions are made, not just how accurate they are.
It applies two post-hoc explanation methods, Shapley Additive Explanations and Integrated Gradients, to compare their attributions for correct predictions and systematic failures.
Despite strong performance (AUC 0.93, accuracy 0.94), the study reveals limitations where the explanations diverge, with Integrated Gradients producing diffuse contextual attributions and Shapley Additive Explanations focusing on explicit lexical cues, contributing to false negatives and false positives.
The work argues that explainable AI can support human-in-the-loop moderation and serve as a transparency and diagnostic resource, rather than primarily boosting performance.

Abstract

Although automated harmful content detection systems are frequently used to monitor online platforms, moderators and end users frequently cannot understand the logic underlying their predictions. While recent studies have focused on increasing classification accuracy, little focus has been placed on comprehending why neural models identify content as harmful, especially when it comes to borderline, contextual, and politically sensitive situations. In this work, a neural harmful content detection model trained on the Civil Comments dataset is analyzed explainability-drivenly. Two popular post-hoc explanation methods, Shapley Additive Explanations and Integrated Gradients, are used to analyze the behavior of a RoBERTa-based classifier in both correct predictions and systematic failure cases. Despite strong overall performance, with an area under the curve of 0.93 and an accuracy of 0.94, the analysis reveals limitations that are not observable from aggregate evaluation metrics alone. Integrated Gradients appear to extract more diffuse contextual attributions while Shapley Additive Explanations extract more focused attributions on explicit lexical cues. The consequent divergence in their outputs manifests in both false negatives and false positives. Qualitative case studies reveal recurring failure modes such as indirect toxicity, lexical over-attribution, or political discourse. The results suggest that explainable AI can foster human-in-the-loop moderation by exposing model uncertainty and increasing the interpretable rationale behind automated decisions. Most importantly, this work highlights the role of explainability as a transparency and diagnostic resource for online harmful content detection systems rather than as a performance-enhancing lever.

ADICはどの種類の革新なのか ―― ドリフト監査デモで見る「事後説明」から「通過条件」への移行**

Qiita

Complete Guide: How To Make Money With Ai

Dev.to

Built a small free iOS app to reduce LLM answer uncertainty with multiple models

Dev.to

Without Valid Data, AI Transformation Is Flying Blind – Why We Need to “Grasp” Work Again

Dev.to

How We Used Hindsight Memory to Build an AI That Knows Your Weaknesses

Dev.to

Beyond Accuracy: An Explainability-Driven Analysis of Harmful Content Detection

Key Points

Abstract

Related Articles

ADICはどの種類の革新なのか ―― ドリフト監査デモで見る「事後説明」から「通過条件」への移行**

Complete Guide: How To Make Money With Ai

Built a small free iOS app to reduce LLM answer uncertainty with multiple models

Without Valid Data, AI Transformation Is Flying Blind – Why We Need to “Grasp” Work Again

How We Used Hindsight Memory to Build an AI That Knows Your Weaknesses

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer