Motion-Adaptive Multi-Scale Temporal Modelling with Skeleton-Constrained Spatial Graphs for Efficient 3D Human Pose Estimation

arXiv cs.CV / 4/7/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces MASC-Pose, an efficient 3D human pose estimation framework for monocular videos that targets both spatial and temporal dependency modeling challenges.
  • It uses an Adaptive Multi-scale Temporal Modelling (AMTM) module to capture different motion dynamics across temporal scales in a motion-adaptive way.
  • For spatial reasoning, it proposes a Skeleton-constrained Adaptive GCN (SAGCN) that models joint-specific interactions while leveraging skeletal structure constraints.
  • Experiments on Human3.6M and MPI-INF-3DHP show that the approach improves accuracy while maintaining high computational efficiency compared with fixed or dense-attention-heavy schemes.

Abstract

Accurate 3D human pose estimation from monocular videos requires effective modelling of complex spatial and temporal dependencies. However, existing methods often face challenges in efficiency and adaptability when modelling spatial and temporal dependencies, particularly under dense attention or fixed modelling schemes. In this work, we propose MASC-Pose, a Motion-Adaptive multi-scale temporal modelling framework with Skeleton-Constrained spatial graphs for efficient 3D human pose estimation. Specifically, it introduces an Adaptive Multi-scale Temporal Modelling (AMTM) module to adaptively capture heterogeneous motion dynamics at different temporal scales, together with a Skeleton-constrained Adaptive GCN (SAGCN) for joint-specific spatial interaction modelling. By jointly enabling adaptive temporal reasoning and efficient spatial aggregation, our method achieves strong accuracy with high computational efficiency. Extensive experiments on Human3.6M and MPI-INF-3DHP datasets demonstrate the effectiveness of our approach.