SenBen: Sensitive Scene Graphs for Explainable Content Moderation

arXiv cs.AI / 4/13/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces SenBen, a large-scale benchmark for explainable content moderation built from 13,999 annotated movie frames with Visual Genome-style scene graphs and 16 sensitivity tags across five categories.
  • It targets limitations of current image moderation approaches by adding spatial grounding and interpretability, enabling detection explanations that specify what sensitive behavior occurred, who/what is involved, and where it occurs in the scene.
  • The authors distill a frontier vision-language model into a compact 241M “student” model using a multi-task training recipe designed to address vocabulary imbalance in autoregressive scene graph generation.
  • The proposed training method improves SenBen Recall by 6.4 percentage points versus standard cross-entropy training and yields stronger grounded scene graph metrics than most evaluated VLMs, except Gemini models, and outperforms commercial safety APIs.
  • The student model is reported to run 7.6× faster inference and use 16× less GPU memory than evaluated baselines while achieving the highest object detection and captioning scores across models.

Abstract

Content moderation systems classify images as safe or unsafe but lack spatial grounding and interpretability: they cannot explain what sensitive behavior was detected, who is involved, or where it occurs. We introduce the Sensitive Benchmark (SenBen), the first large-scale scene graph benchmark for sensitive content, comprising 13,999 frames from 157 movies annotated with Visual Genome-style scene graphs (25 object classes, 28 attributes including affective states such as pain, fear, aggression, and distress, 14 predicates) and 16 sensitivity tags across 5 categories. We distill a frontier VLM into a compact 241M student model using a multi-task recipe that addresses vocabulary imbalance in autoregressive scene graph generation through suffix-based object identity, Vocabulary-Aware Recall (VAR) Loss, and a decoupled Query2Label tag head with asymmetric loss, yielding a +6.4 percentage point improvement in SenBen Recall over standard cross-entropy training. On grounded scene graph metrics, our student model outperforms all evaluated VLMs except Gemini models and all commercial safety APIs, while achieving the highest object detection and captioning scores across all models, at 7.6\times faster inference and 16\times less GPU memory.