KidsNanny: A Two-Stage Multimodal Content Moderation Pipeline Integrating Visual Classification, Object Detection, OCR, and Contextual Reasoning for Child Safety

arXiv cs.CV / 3/18/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

KidsNanny is a two-stage multimodal content moderation system designed for child safety, combining a vision transformer with an object detector in Stage 1 and OCR plus a 7B language model for contextual reasoning in Stage 2.
Stage 1 outputs are routed as text to Stage 2, with a 11.7 ms latency for Stage 1 and a total end-to-end latency of 120 ms.
On UnsafeBench Sexual category (1,054 images), Stage 1 achieves 80.27% accuracy and 85.39% F1, while the full pipeline reaches 81.40% accuracy and 86.16% F1, outperforming ShieldGemma-2 and LlavaGuard in some metrics.
Text-aware evaluation on a text-only subset shows 100% recall and 75.76% precision for KidsNanny, suggesting OCR-based reasoning can improve recall-precision for text-embedded threats, though the small sample limits generalizability.
The work aims to advance efficient multimodal content moderation for child safety by documenting architecture and evaluation methodology.

Abstract

We present KidsNanny, a two-stage multimodal content moderation architecture for child safety. Stage 1 combines a vision transformer (ViT) with an object detector for visual screening (11.7 ms); outputs are routed as text not raw pixels to Stage 2, which applies OCR and a text based 7B language model for contextual reasoning (120 ms total pipeline). We evaluate on the UnsafeBench Sexual category (1,054 images) under two regimes: vision-only, isolating Stage 1, and multimodal, evaluating the full Stage 1+2 pipeline. Stage 1 achieves 80.27% accuracy and 85.39% F1 at 11.7 ms; vision-only baselines range from 59.01% to 77.04% accuracy. The full pipeline achieves 81.40% accuracy and 86.16% F1 at 120 ms, compared to ShieldGemma-2 (64.80% accuracy, 1,136 ms) and LlavaGuard (80.36% accuracy, 4,138 ms). To evaluate text-awareness, we filter two subsets: a text+visual subset (257 images) and a text-only subset (44 images where safety depends primarily on embedded text). On text-only images, KidsNanny achieves 100% recall (25/25 positives; small sample) and 75.76% precision; ShieldGemma-2 achieves 84% recall and 60% precision at 1,136 ms. Results suggest that dedicated OCR-based reasoning may offer recall-precision advantages on text-embedded threats at lower latency, though the small text-only subset limits generalizability. By documenting this architecture and evaluation methodology, we aim to contribute to the broader research effort on efficient multimodal content moderation for child safety.

How to Build an AI Team: The Solopreneur Playbook

Dev.to

CrewAI vs AutoGen vs LangGraph: Which Agent Framework to Use

Dev.to

14 Best Self-Hosted Claude Alternatives for AI and Coding in 2026

Dev.to

[P] Finetuned small LMs to VLM adapters locally and wrote a short article about it

Reddit r/MachineLearning

Experiment: How far can a 28M model go in business email generation?

Reddit r/LocalLLaMA

KidsNanny: A Two-Stage Multimodal Content Moderation Pipeline Integrating Visual Classification, Object Detection, OCR, and Contextual Reasoning for Child Safety

Key Points

Abstract

Related Articles

How to Build an AI Team: The Solopreneur Playbook

CrewAI vs AutoGen vs LangGraph: Which Agent Framework to Use

14 Best Self-Hosted Claude Alternatives for AI and Coding in 2026

[P] Finetuned small LMs to VLM adapters locally and wrote a short article about it

Experiment: How far can a 28M model go in business email generation?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer