Integration of Object Detection and Small VLMs for Construction Safety Hazard Identification
arXiv cs.CV / 4/8/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The study targets real-time construction safety hazard identification by combining efficient small vision-language models (sVLMs) with object detection to improve accuracy and reduce hallucinations in complex scenes.
- The proposed detection-guided framework uses a YOLOv11n detector to localize workers and construction machinery, then injects those entities into structured prompts for spatially grounded multimodal reasoning.
- Six sVLMs (e.g., Gemma-3 4B, Qwen-3-VL variants, InternVL-3, SmolVLM-2B) were evaluated in zero-shot conditions on a curated construction hazard dataset, and all showed improved hazard detection performance.
- For the best model (Gemma-3 4B), the F1-score rose to 50.6% from a 34.5% baseline, while explanation quality improved substantially (BERTScore F1 from 0.61 to 0.82).
- The approach keeps computational overhead low, adding about 2.5 ms per image during inference, making it more practical than larger VLM-only approaches for near real-time use.
Related Articles

Black Hat Asia
AI Business

Fully Automated Website 2026-04-11: **The Scoreboard — Visual Judge Score Comparison on the Homepage**
Dev.to
Human-Aligned Decision Transformers for satellite anomaly response operations with ethical auditability baked in
Dev.to

That Smoking-Gun Video? It's Not Evidence. It's a Suspect.
Dev.to

AI Citation Registries and Website-Based Publishing Constraints
Dev.to