Beyond Accuracy: An Explainability-Driven Analysis of Harmful Content Detection
arXiv cs.AI / 3/20/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The article presents an explainability-driven analysis of a RoBERTa-based harmful content detector trained on the Civil Comments dataset to understand how predictions are made, not just how accurate they are.
- It applies two post-hoc explanation methods, Shapley Additive Explanations and Integrated Gradients, to compare their attributions for correct predictions and systematic failures.
- Despite strong performance (AUC 0.93, accuracy 0.94), the study reveals limitations where the explanations diverge, with Integrated Gradients producing diffuse contextual attributions and Shapley Additive Explanations focusing on explicit lexical cues, contributing to false negatives and false positives.
- The work argues that explainable AI can support human-in-the-loop moderation and serve as a transparency and diagnostic resource, rather than primarily boosting performance.
Related Articles
ADICはどの種類の革新なのか ―― ドリフト監査デモで見る「事後説明」から「通過条件」への移行**
Qiita
Complete Guide: How To Make Money With Ai
Dev.to
Built a small free iOS app to reduce LLM answer uncertainty with multiple models
Dev.to
Without Valid Data, AI Transformation Is Flying Blind – Why We Need to “Grasp” Work Again
Dev.to
How We Used Hindsight Memory to Build an AI That Knows Your Weaknesses
Dev.to