Two-Pass Zero-Shot Temporal-Spatial Grounding of Rare Traffic Events in Surveillance Video
arXiv cs.CV / 5/5/2026
📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research
Key Points
- The paper addresses rare-event grounding of traffic accidents in real CCTV footage without labeled training data, requiring accurate joint localization across time, space, and collision type.
- It proposes a two-pass zero-shot pipeline: a coarse full-video scan at 1 fps produces an initial (t, x, y, c) estimate, followed by a finer 5 fps refinement within a ±3s window using deterministic confidence gates that fall back to the coarse output near boundaries.
- It uses a specialist role assignment where Qwen3-VL-Plus performs the grounding and Gemini 3.1 Flash-Lite performs typing on a centered video clip, both operating on frozen vision-language models.
- On the ACCIDENT@CVPR 2026 benchmark with 2,027 real CCTV videos, the method achieves ACC^S = 0.539, outperforming the benchmark’s best oracle baseline (0.412), the strongest single-VLM baseline (Molmo-7B, 0.396), and a naive baseline (0.289).
- The authors report practical system details: up to three API calls per video, with a 17% fallback to physics when APIs fail, and an estimated full-run cost of about $20.
Related Articles
Singapore's Fraud Frontier: Why AI Scam Detection Demands Regulatory Precision
Dev.to

First experience with Building Apps with Google AI Studio: Incredibly simple and intuitive.
Dev.to
Meta will use AI to analyze height and bone structure to identify if users are underage
TechCrunch
How AI is Changing the Way We Code in 2026: The Shift from Syntax to Strategy
Dev.to
13 CLAUDE.md Rules That Make AI Write Modern PHP (Not PHP 5 Resurrected)
Dev.to