From Skeletons to Semantics: Design and Deployment of a Hybrid Edge-Based Action Detection System for Public Safety

arXiv cs.CV / 4/1/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses the challenge of deploying real-time, privacy-aware action detection for public safety in latency- and resource-constrained edge settings.
  • It proposes a hybrid architecture that combines skeleton-based motion analysis (low overhead, continuous monitoring) with vision-language models for semantic understanding and zero-shot reasoning.
  • Rather than introducing a new recognition model, the work focuses on system-level comparison of motion-centric versus semantic paradigms under realistic edge constraints.
  • A demonstrator implementation on a GPU-enabled edge device evaluates latency, resource usage, and operational trade-offs to quantify the practical feasibility of the approach.
  • The results suggest hybrid designs that selectively augment fast motion-based detection with higher-level semantic reasoning for more complex or previously unseen situations.

Abstract

Public spaces such as transport hubs, city centres, and event venues require timely and reliable detection of potentially violent behaviour to support public safety. While automated video analysis has made significant progress, practical deployment remains constrained by latency, privacy, and resource limitations, particularly under edge-computing conditions. This paper presents the design and demonstrator-based deployment of a hybrid edge-based action detection system that combines skeleton-based motion analysis with vision-language models for semantic scene interpretation. Skeleton-based processing enables continuous, privacy-aware monitoring with low computational overhead, while vision-language models provide contextual understanding and zero-shot reasoning capabilities for complex and previously unseen situations. Rather than proposing new recognition models, the contribution focuses on a system-level comparison of both paradigms under realistic edge constraints. The system is implemented on a GPU-enabled edge device and evaluated with respect to latency, resource usage, and operational trade-offs using a demonstrator-based setup. The results highlight the complementary strengths and limitations of motioncentric and semantic approaches and motivate a hybrid architecture that selectively augments fast skeletonbased detection with higher-level semantic reasoning. The presented system provides a practical foundation for privacy-aware, real-time video analysis in public safety applications.