ESOM: Efficiently Understanding Streaming Video Anomalies with Open-world Dynamic Definitions

arXiv cs.CV / 4/10/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes ESOM, an efficient, training-free model for open-world video anomaly detection that supports dynamic anomaly definitions and streaming video settings for real-time use cases like surveillance and live moderation.
  • ESOM reduces hallucinations via a Definition Normalization module, compresses redundant visual tokens using an Inter-frame-matched Intra-frame Token Merging approach, and performs efficient causal inference with a Hybrid Streaming Memory module.
  • It converts interval-level textual outputs into frame-level anomaly scores using a Probabilistic Scoring module to improve temporal localization and evaluation alignment.
  • The work introduces OpenDef-Bench, a new benchmark featuring clean surveillance videos and diverse natural anomaly definitions to test robustness under varying conditions.
  • Experiments report single-GPU real-time efficiency and state-of-the-art results across anomaly temporal localization, classification, and description generation, with code and the benchmark planned for release.

Abstract

Open-world video anomaly detection (OWVAD) aims to detect and explain abnormal events under different anomaly definitions, which is important for applications such as intelligent surveillance and live-streaming content moderation. Recent MLLM-based methods have shown promising open-world generalization, but still suffer from three major limitations: inefficiency for practical deployment, lack of streaming processing adaptation, and limited support for dynamic anomaly definitions in both modeling and evaluation. To address these issues, this paper proposes ESOM, an efficient streaming OWVAD model that operates in a training-free manner. ESOM includes a Definition Normalization module to structure user prompts for reducing hallucination, an Inter-frame-matched Intra-frame Token Merging module to compress redundant visual tokens, a Hybrid Streaming Memory module for efficient causal inference, and a Probabilistic Scoring module that converts interval-level textual outputs into frame-level anomaly scores. In addition, this paper introduces OpenDef-Bench, a new benchmark with clean surveillance videos and diverse natural anomaly definitions for evaluating performance under varying conditions. Extensive experiments show that ESOM achieves real-time efficiency on a single GPU and state-of-the-art performance in anomaly temporal localization, classification, and description generation. The code and benchmark will be released at https://github.com/Kamino666/ESOM_OpenDef-Bench.