KidsNanny: A Two-Stage Multimodal Content Moderation Pipeline Integrating Visual Classification, Object Detection, OCR, and Contextual Reasoning for Child Safety
arXiv cs.CV / 3/18/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- KidsNanny is a two-stage multimodal content moderation system designed for child safety, combining a vision transformer with an object detector in Stage 1 and OCR plus a 7B language model for contextual reasoning in Stage 2.
- Stage 1 outputs are routed as text to Stage 2, with a 11.7 ms latency for Stage 1 and a total end-to-end latency of 120 ms.
- On UnsafeBench Sexual category (1,054 images), Stage 1 achieves 80.27% accuracy and 85.39% F1, while the full pipeline reaches 81.40% accuracy and 86.16% F1, outperforming ShieldGemma-2 and LlavaGuard in some metrics.
- Text-aware evaluation on a text-only subset shows 100% recall and 75.76% precision for KidsNanny, suggesting OCR-based reasoning can improve recall-precision for text-embedded threats, though the small sample limits generalizability.
- The work aims to advance efficient multimodal content moderation for child safety by documenting architecture and evaluation methodology.
Related Articles
How to Build an AI Team: The Solopreneur Playbook
Dev.to
CrewAI vs AutoGen vs LangGraph: Which Agent Framework to Use
Dev.to

14 Best Self-Hosted Claude Alternatives for AI and Coding in 2026
Dev.to
[P] Finetuned small LMs to VLM adapters locally and wrote a short article about it
Reddit r/MachineLearning
Experiment: How far can a 28M model go in business email generation?
Reddit r/LocalLLaMA