AI Navigate

Less is More: Decoder-Free Masked Modeling for Efficient Skeleton Representation Learning

arXiv cs.CV / 3/12/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • SLiM is proposed as a decoder-free masked modeling framework for skeleton-based action representation learning that unifies masked modeling and contrastive learning via a shared encoder.
  • By removing the reconstruction decoder, SLiM reduces computational redundancy and forces the encoder to learn discriminative features directly.
  • Semantic tube masking and skeletal-aware augmentations are introduced to prevent trivial reconstructions due to high skeletal-temporal correlation and to maintain anatomical consistency across temporal scales.
  • Experiments show state-of-the-art performance across downstream protocols with substantial efficiency, reducing inference cost by 7.89x relative to existing MAE methods.

Abstract

The landscape of skeleton-based action representation learning has evolved from Contrastive Learning (CL) to Masked Auto-Encoder (MAE) architectures. However, each paradigm faces inherent limitations: CL often overlooks fine-grained local details, while MAE is burdened by computationally heavy decoders. Moreover, MAE suffers from severe computational asymmetry -- benefiting from efficient masking during pre-training but requiring exhaustive full-sequence processing for downstream tasks. To resolve these bottlenecks, we propose SLiM (Skeleton Less is More), a novel unified framework that harmonizes masked modeling with contrastive learning via a shared encoder. By eschewing the reconstruction decoder, SLiM not only eliminates computational redundancy but also compels the encoder to capture discriminative features directly. SLiM is the first framework with decoder-free masked modeling of representative learning. Crucially, to prevent trivial reconstruction arising from high skeletal-temporal correlation, we introduce semantic tube masking, alongside skeletal-aware augmentations designed to ensure anatomical consistency across diverse temporal granularities. Extensive experiments demonstrate that SLiM consistently achieves state-of-the-art performance across all downstream protocols. Notably, our method delivers this superior accuracy with exceptional efficiency, reducing inference computational cost by 7.89x compared to existing MAE methods.