Two-Pass Zero-Shot Temporal-Spatial Grounding of Rare Traffic Events in Surveillance Video

arXiv cs.CV / 5/5/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper addresses rare-event grounding of traffic accidents in real CCTV footage without labeled training data, requiring accurate joint localization across time, space, and collision type.
It proposes a two-pass zero-shot pipeline: a coarse full-video scan at 1 fps produces an initial (t, x, y, c) estimate, followed by a finer 5 fps refinement within a ±3s window using deterministic confidence gates that fall back to the coarse output near boundaries.
It uses a specialist role assignment where Qwen3-VL-Plus performs the grounding and Gemini 3.1 Flash-Lite performs typing on a centered video clip, both operating on frozen vision-language models.
On the ACCIDENT@CVPR 2026 benchmark with 2,027 real CCTV videos, the method achieves ACC^S = 0.539, outperforming the benchmark’s best oracle baseline (0.412), the strongest single-VLM baseline (Molmo-7B, 0.396), and a naive baseline (0.289).
The authors report practical system details: up to three API calls per video, with a 17% fallback to physics when APIs fail, and an estimated full-run cost of about $20.

Abstract

Grounding traffic accidents in real CCTV footage is a rare-event problem where training on labeled accident video is often prohibited, yet accurate joint localization in time, space, and collision type is required. We present a no-fine-tuning pipeline that elicits this joint output from frozen vision-language models through two ideas. First, a coarse-to-fine two-pass decomposition: a full-video pass at 1 fps produces a coarse (t, x, y, c) tuple, then a second pass at 5 fps within a +/- 3 s window refines time and location, with two deterministic confidence gates that revert to the coarse estimate on boundary hedges or edge-clamped coordinates. Second, a specialist role assignment: Qwen3-VL-Plus handles grounding, Gemini 3.1 Flash-Lite handles typing on a centered video clip. On the ACCIDENT@CVPR 2026 benchmark (2,027 real CCTV videos) we reach ACC^S = 0.539 (95% CI [0.525, 0.553]): +0.127 over the benchmark paper's best-of-baselines oracle (0.412), +0.143 over the strongest single-VLM baseline (Molmo-7B, 0.396), and +0.250 over the naive baseline (0.289). The VLM path uses up to three API calls per video (17% fall back to physics on API failures); the full run costs ~$20.

Singapore's Fraud Frontier: Why AI Scam Detection Demands Regulatory Precision

Dev.to

First experience with Building Apps with Google AI Studio: Incredibly simple and intuitive.

Dev.to

Meta will use AI to analyze height and bone structure to identify if users are underage

TechCrunch

How AI is Changing the Way We Code in 2026: The Shift from Syntax to Strategy

Dev.to

13 CLAUDE.md Rules That Make AI Write Modern PHP (Not PHP 5 Resurrected)

Dev.to

Two-Pass Zero-Shot Temporal-Spatial Grounding of Rare Traffic Events in Surveillance Video

Key Points

Abstract

Related Articles

Singapore's Fraud Frontier: Why AI Scam Detection Demands Regulatory Precision

First experience with Building Apps with Google AI Studio: Incredibly simple and intuitive.

Meta will use AI to analyze height and bone structure to identify if users are underage

How AI is Changing the Way We Code in 2026: The Shift from Syntax to Strategy

13 CLAUDE.md Rules That Make AI Write Modern PHP (Not PHP 5 Resurrected)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer