Interpretable Human Activity Recognition for Subtle Robbery Detection in Surveillance Videos

arXiv cs.CV / 4/17/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

Key Points

  • The paper addresses the challenge of automatically detecting brief, subtle non-violent snatch-and-run robberies that are often visually similar to normal interactions in uncontrolled surveillance videos.
  • It proposes a hybrid, pose-driven pipeline that uses a YOLO-based pose estimator to extract body keypoints and then computes interpretable kinematic and interaction features (e.g., hand speed, arm extension, proximity, relative motion) for an aggressor–victim pair.
  • A Random Forest classifier is trained on these pose-derived descriptors, and a temporal hysteresis filter is applied to smooth predictions and reduce false alarms at the frame level.
  • Experiments on both a staged dataset and a disjoint internet-video test set show promising generalization across scenes and camera viewpoints.
  • The authors deploy the full system on an NVIDIA Jetson Nano and report real-time performance, indicating on-device feasibility for proactive robbery detection.
  • The work’s interpretability focus (feature-level, pose-based reasoning) is intended to make model decisions more explainable than purely black-box video classifiers, supporting practical surveillance use.

Abstract

Non-violent street robberies (snatch-and-run) are difficult to detect automatically because they are brief, subtle, and often indistinguishable from benign human interactions in unconstrained surveillance footage. This paper presents a hybrid, pose-driven approach for detecting snatch-and-run events that combines real-time perception with an interpretable classification stage suitable for edge deployment. The system uses a YOLO-based pose estimator to extract body keypoints for each tracked person and computes kinematic and interaction features describing hand speed, arm extension, proximity, and relative motion between an aggressor-victim pair. A Random Forest classifier is trained on these descriptors, and a temporal hysteresis filter is applied to stabilize frame-level predictions and reduce spurious alarms. We evaluate the method on a staged dataset and on a disjoint test set collected from internet videos, demonstrating promising generalization across different scenes and camera viewpoints. Finally, we implement the complete pipeline on an NVIDIA Jetson Nano and report real-time performance, supporting the feasibility of proactive, on-device robbery detection.