Reasoning-Guided Grounding: Elevating Video Anomaly Detection through Multimodal Large Language Models

arXiv cs.CV / 5/6/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • The paper introduces VANGUARD, a multimodal LLM/VLM framework that unifies video anomaly classification, spatial grounding, and chain-of-thought reasoning to improve interpretability and localization over prior VAD approaches.
  • It uses a three-stage curriculum—classifier warmup with a frozen backbone, LoRA-based spatial grounding training, and chain-of-thought generation—showing staged optimization beats single-stage (monolithic) training.
  • To address sparse VAD labels, the authors build a teacher-student annotation pipeline where Qwen3-VL-4B generates structured per-subclip reasoning trajectories, using manual cues from the UCA Dataset.
  • GroundingDINO supplies bounding-box supervision, and results on UCF-Crime report 94% ROC-AUC and 84% F1, along with more reliable, spatially grounded anomaly localization and interpretable reasoning.
  • Ablations and zero-shot experiments (XD-Violence, ShanghaiTech) suggest the structured reasoning functions as an implicit regularizer and supports cross-domain generalization without target-domain adaptation.

Abstract

Video Anomaly Detection (VAD) has traditionally been framed as binary classification or outlier detection, providing neither interpretable reasoning nor precise spatial localization of anomalous events. While Vision-Language Models (VLMs) offer rich scene understanding, they struggle with reliable spatial grounding - often producing hallucinated or geometrically invalid bounding boxes when asked to localize objects. We propose VANGUARD (Video Anomaly Understanding through Reasoning and Grounding), a framework that unifies anomaly classification, spatial grounding, and chain-of-thought reasoning within a single VLM. VANGUARD introduces a three-stage curriculum that progressively layers training objectives: (1) classifier warmup on frozen backbone features, (2) LoRA-adapted spatial grounding, and (3) chain-of-thought generation. To overcome the sparse annotation typical of VAD benchmarks, we employ a teacher-student annotation pipeline in which a VLM (Qwen3-VL-4B) generates structured per-subclip reasoning trajectories based on manual annotations available from the UCA Dataset. Further, GroundingDINO provides bounding box supervision. On UCF-Crime, VANGUARD achieves 94% ROC-AUC with 84% F1 while simultaneously producing interpretable chain-of-thought explanations and spatial grounding of anomalous objects - capabilities absent from prior VAD methods. Ablations confirm that staged training outperforms monolithic optimization, and that structured reasoning acts as an implicit regularizer yielding more balanced predictions than classification-only fine-tuning. Zero-shot transfer to XD-Violence and ShanghaiTech demonstrates cross-domain generalization without target-domain adaptation.