KidsNanny: A Two-Stage Multimodal Content Moderation Pipeline Integrating Visual Classification, Object Detection, OCR, and Contextual Reasoning for Child Safety
arXiv cs.CV / 3/18/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- KidsNanny is a two-stage multimodal content moderation system designed for child safety, combining a vision transformer with an object detector in Stage 1 and OCR plus a 7B language model for contextual reasoning in Stage 2.
- Stage 1 outputs are routed as text to Stage 2, with a 11.7 ms latency for Stage 1 and a total end-to-end latency of 120 ms.
- On UnsafeBench Sexual category (1,054 images), Stage 1 achieves 80.27% accuracy and 85.39% F1, while the full pipeline reaches 81.40% accuracy and 86.16% F1, outperforming ShieldGemma-2 and LlavaGuard in some metrics.
- Text-aware evaluation on a text-only subset shows 100% recall and 75.76% precision for KidsNanny, suggesting OCR-based reasoning can improve recall-precision for text-embedded threats, though the small sample limits generalizability.
- The work aims to advance efficient multimodal content moderation for child safety by documenting architecture and evaluation methodology.
Related Articles
How AI is Transforming Dynamics 365 Business Central
Dev.to
Algorithmic Gaslighting: A Formal Legal Template to Fight AI Safety Pivots That Cause Psychological Harm
Reddit r/artificial
Do I need different approaches for different types of business information errors?
Dev.to
ShieldCortex: What We Learned Protecting AI Agent Memory
Dev.to
How AI-Powered Revenue Intelligence Transforms B2B Sales Teams
Dev.to