ESOM: Efficiently Understanding Streaming Video Anomalies with Open-world Dynamic Definitions
arXiv cs.CV / 4/10/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper proposes ESOM, an efficient, training-free model for open-world video anomaly detection that supports dynamic anomaly definitions and streaming video settings for real-time use cases like surveillance and live moderation.
- ESOM reduces hallucinations via a Definition Normalization module, compresses redundant visual tokens using an Inter-frame-matched Intra-frame Token Merging approach, and performs efficient causal inference with a Hybrid Streaming Memory module.
- It converts interval-level textual outputs into frame-level anomaly scores using a Probabilistic Scoring module to improve temporal localization and evaluation alignment.
- The work introduces OpenDef-Bench, a new benchmark featuring clean surveillance videos and diverse natural anomaly definitions to test robustness under varying conditions.
- Experiments report single-GPU real-time efficiency and state-of-the-art results across anomaly temporal localization, classification, and description generation, with code and the benchmark planned for release.



