From Frames to Events: Rethinking Evaluation in Human-Centric Video Anomaly Detection

arXiv cs.CV / 4/13/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that traditional frame-level evaluation in pose-based video anomaly detection misrepresents real-world usage, where systems must detect and report coherent anomalous events over time rather than isolated frames.
It audits several popular VAD benchmarks to characterize how anomalies are structured temporally, motivating an event-centric evaluation perspective.
The authors propose two approaches for temporal event localization: a score-refinement pipeline (hierarchical Gaussian smoothing plus adaptive binarization) and an end-to-end dual-branch model that outputs event-level detections.
They introduce an event-based evaluation standard by adapting temporal action localization metrics (tIoU-based matching and multi-threshold F1), and show a large discrepancy between frame-level and event-level performance.
Despite state-of-the-art frame-level AUC-ROC above 52% on NWPUC, event-level localization precision is reported to be under 10% at minimal tIoU=0.2, with an average event-level F1 of 0.11, and the work includes released code.

Abstract

Pose-based Video Anomaly Detection (VAD) has gained significant attention for its privacy-preserving nature and robustness to environmental variations. However, traditional frame-level evaluations treat video as a collection of isolated frames, fundamentally misaligned with how anomalies manifest and are acted upon in the real world. In operational surveillance systems, what matters is not the flagging of individual frames, but the reliable detection, localization, and reporting of a coherent anomalous event, a contiguous temporal episode with an identifiable onset and duration. Frame-level metrics are blind to this distinction, and as a result, they systematically overestimate model performance for any deployment that requires actionable, event-level alerts. In this work, we propose a shift toward an event-centric perspective in VAD. We first audit widely used VAD benchmarks, including SHT[19], CHAD[6], NWPUC[4], and HuVAD[25], to characterize their event structure. We then introduce two strategies for temporal event localization: a score-refinement pipeline with hierarchical Gaussian smoothing and adaptive binarization, and an end-to-end Dual-Branch Model that directly generates event-level detections. Finally, we establish the first event-based evaluation standard for VAD by adapting Temporal Action Localization metrics, including tIoU-based event matching and multi-threshold F1 evaluation. Our results quantify a substantial performance gap: while all SoTA models achieve frame-level AUC-ROC exceeding 52% on the NWPUC[4], their event-level localization precision falls below 10% even at a minimal tIoU=0.2, with an average event-level F1 of only 0.11 across all thresholds. The code base for this work is available at https://github.com/TeCSAR-UNCC/EventCentric-VAD.