ESOM: Efficiently Understanding Streaming Video Anomalies with Open-world Dynamic Definitions

arXiv cs.CV / 4/10/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes ESOM, an efficient, training-free model for open-world video anomaly detection that supports dynamic anomaly definitions and streaming video settings for real-time use cases like surveillance and live moderation.
ESOM reduces hallucinations via a Definition Normalization module, compresses redundant visual tokens using an Inter-frame-matched Intra-frame Token Merging approach, and performs efficient causal inference with a Hybrid Streaming Memory module.
It converts interval-level textual outputs into frame-level anomaly scores using a Probabilistic Scoring module to improve temporal localization and evaluation alignment.
The work introduces OpenDef-Bench, a new benchmark featuring clean surveillance videos and diverse natural anomaly definitions to test robustness under varying conditions.
Experiments report single-GPU real-time efficiency and state-of-the-art results across anomaly temporal localization, classification, and description generation, with code and the benchmark planned for release.

Abstract

Open-world video anomaly detection (OWVAD) aims to detect and explain abnormal events under different anomaly definitions, which is important for applications such as intelligent surveillance and live-streaming content moderation. Recent MLLM-based methods have shown promising open-world generalization, but still suffer from three major limitations: inefficiency for practical deployment, lack of streaming processing adaptation, and limited support for dynamic anomaly definitions in both modeling and evaluation. To address these issues, this paper proposes ESOM, an efficient streaming OWVAD model that operates in a training-free manner. ESOM includes a Definition Normalization module to structure user prompts for reducing hallucination, an Inter-frame-matched Intra-frame Token Merging module to compress redundant visual tokens, a Hybrid Streaming Memory module for efficient causal inference, and a Probabilistic Scoring module that converts interval-level textual outputs into frame-level anomaly scores. In addition, this paper introduces OpenDef-Bench, a new benchmark with clean surveillance videos and diverse natural anomaly definitions for evaluating performance under varying conditions. Extensive experiments show that ESOM achieves real-time efficiency on a single GPU and state-of-the-art performance in anomaly temporal localization, classification, and description generation. The code and benchmark will be released at https://github.com/Kamino666/ESOM_OpenDef-Bench.

Black Hat Asia

AI Business

CIA is trusting AI to help analyze intel from human spies

Reddit r/artificial

LLM API Pricing in 2026: I Put Every Major Model in One Table

Dev.to

i generated AI video on a GTX 1660. here's what it actually takes.

Dev.to

The $50,000 Build with MeDo Hackathon is NOW LIVE!

Dev.to

ESOM: Efficiently Understanding Streaming Video Anomalies with Open-world Dynamic Definitions

Key Points

Abstract

Related Articles

Black Hat Asia

CIA is trusting AI to help analyze intel from human spies

LLM API Pricing in 2026: I Put Every Major Model in One Table

i generated AI video on a GTX 1660. here's what it actually takes.

The $50,000 Build with MeDo Hackathon is NOW LIVE!

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer