Two-Pass Zero-Shot Temporal-Spatial Grounding of Rare Traffic Events in Surveillance Video

arXiv cs.CV / 5/5/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses rare-event grounding of traffic accidents in real CCTV footage without labeled training data, requiring accurate joint localization across time, space, and collision type.
  • It proposes a two-pass zero-shot pipeline: a coarse full-video scan at 1 fps produces an initial (t, x, y, c) estimate, followed by a finer 5 fps refinement within a ±3s window using deterministic confidence gates that fall back to the coarse output near boundaries.
  • It uses a specialist role assignment where Qwen3-VL-Plus performs the grounding and Gemini 3.1 Flash-Lite performs typing on a centered video clip, both operating on frozen vision-language models.
  • On the ACCIDENT@CVPR 2026 benchmark with 2,027 real CCTV videos, the method achieves ACC^S = 0.539, outperforming the benchmark’s best oracle baseline (0.412), the strongest single-VLM baseline (Molmo-7B, 0.396), and a naive baseline (0.289).
  • The authors report practical system details: up to three API calls per video, with a 17% fallback to physics when APIs fail, and an estimated full-run cost of about $20.

Abstract

Grounding traffic accidents in real CCTV footage is a rare-event problem where training on labeled accident video is often prohibited, yet accurate joint localization in time, space, and collision type is required. We present a no-fine-tuning pipeline that elicits this joint output from frozen vision-language models through two ideas. First, a coarse-to-fine two-pass decomposition: a full-video pass at 1 fps produces a coarse (t, x, y, c) tuple, then a second pass at 5 fps within a +/- 3 s window refines time and location, with two deterministic confidence gates that revert to the coarse estimate on boundary hedges or edge-clamped coordinates. Second, a specialist role assignment: Qwen3-VL-Plus handles grounding, Gemini 3.1 Flash-Lite handles typing on a centered video clip. On the ACCIDENT@CVPR 2026 benchmark (2,027 real CCTV videos) we reach ACC^S = 0.539 (95% CI [0.525, 0.553]): +0.127 over the benchmark paper's best-of-baselines oracle (0.412), +0.143 over the strongest single-VLM baseline (Molmo-7B, 0.396), and +0.250 over the naive baseline (0.289). The VLM path uses up to three API calls per video (17% fall back to physics on API failures); the full run costs ~$20.