Motion Semantics Guided Normalizing Flow for Privacy-Preserving Video Anomaly Detection

arXiv cs.CV / 3/31/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper addresses privacy-preserving video anomaly detection in embodied perception settings by using skeleton/pose representations that omit sensitive identity and facial information.
It argues that prior skeleton-based methods model motion trajectories monolithically, missing the hierarchical structure of human activities that combine discrete semantic primitives and fine-grained kinematics.
It proposes Motion Semantics Guided Normalizing Flow (MSG-Flow), which hierarchically models motion by discretizing pose motion into interpretable primitives with a vector-quantized VAE, then modeling semantic-level temporal dependencies with an autoregressive Transformer.
To retain and model detailed pose variations, MSG-Flow further uses a conditional normalizing flow for fine-grained kinematic modeling.
Experiments on HR-ShanghaiTech and HR-UBnormal report state-of-the-art results with AUC scores of 88.1% and 75.8%, respectively, supporting the effectiveness of hierarchical motion semantics for anomaly detection.

Abstract

As embodied perception systems increasingly bridge digital and physical realms in interactive multimedia applications, the need for privacy-preserving approaches to understand human activities in physical environments has become paramount. Video anomaly detection is a critical task in such embodied multimedia systems for intelligent surveillance and forensic analysis. Skeleton-based approaches have emerged as a privacy-preserving alternative that processes physical world information through abstract human pose representations while discarding sensitive visual attributes such as identity and facial features. However, existing skeleton-based methods predominantly model continuous motion trajectories in a monolithic manner, failing to capture the hierarchical nature of human activities composed of discrete semantic primitives and fine-grained kinematic details, which leads to reduced discriminability when anomalies manifest at different abstraction levels. In this regard, we propose Motion Semantics Guided Normalizing Flow (MSG-Flow) that decomposes skeleton-based VAD into hierarchical motion semantics modeling. It employs vector quantized variational auto-encoder to discretize continuous motion into interpretable primitives, an autoregressive Transformer to model semantic-level temporal dependencies, and a conditional normalizing flow to capture detail-level pose variations. Extensive experiments on benchmarks (HR-ShanghaiTech & HR-UBnormal) demonstrate that MSG-Flow achieves state-of-the-art performance with 88.1% and 75.8% AUC respectively.