Efficient Spatial-Temporal Focal Adapter with SSM for Temporal Action Detection

arXiv cs.CV / 4/13/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses temporal action detection in untrimmed videos, highlighting how existing CNN/Transformer approaches struggle with feature redundancy and weakened global dependency modeling over long sequences.
  • It proposes a new framework that applies State Space Models (SSMs) for linear long-term temporal modeling and stronger global temporal reasoning in video action detection.
  • The core contribution is the Efficient Spatial-Temporal Focal (ESTF) Adapter inserted into pre-trained layers, combining an improved Temporal Boundary-aware SSM (TB-SSM) for temporal modeling with efficient spatial feature processing.
  • Experiments across multiple benchmarks show significant gains in both action localization performance and robustness compared with prior SSM-based and other structural methods.
  • The work includes comprehensive quantitative and comparative analyses to validate that the new integration strategy improves scalability for real-world long-video understanding.

Abstract

Temporal human action detection aims to identify and localize action segments within untrimmed videos, serving as a pivotal task in video understanding. Despite the progress achieved by prior architectures like CNN and Transformer models, these continue to struggle with feature redundancy and degraded global dependency modeling capabilities when applied to long video sequences. These limitations severely constrain their scalability in real-world video analysis. State Space Models (SSMs) offer a promising alternative with linear long-term modeling and robust global temporal reasoning capabilities. Rethinking the application of SSMs in temporal modeling, this research constructs a novel framework for video human action detection. Specifically, we introduce the Efficient Spatial-Temporal Focal (ESTF) Adapter into the pre-trained layers. This module integrates the advantages of our proposed Temporal Boundary-aware SSM(TB-SSM) for temporal feature modeling with efficient processing of spatial features. We perform comprehensive and quantitative analyses across multiple benchmarks, comparing our proposed method against previous SSM-based and other structural methods. Extensive experiments demonstrate that our improved strategy significantly enhances both localization performance and robustness, validating the effectiveness of our proposed method.