Less is More: Decoder-Free Masked Modeling for Efficient Skeleton Representation Learning

arXiv cs.CV / 3/12/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

SLiM is proposed as a decoder-free masked modeling framework for skeleton-based action representation learning that unifies masked modeling and contrastive learning via a shared encoder.
By removing the reconstruction decoder, SLiM reduces computational redundancy and forces the encoder to learn discriminative features directly.
Semantic tube masking and skeletal-aware augmentations are introduced to prevent trivial reconstructions due to high skeletal-temporal correlation and to maintain anatomical consistency across temporal scales.
Experiments show state-of-the-art performance across downstream protocols with substantial efficiency, reducing inference cost by 7.89x relative to existing MAE methods.

Abstract

The landscape of skeleton-based action representation learning has evolved from Contrastive Learning (CL) to Masked Auto-Encoder (MAE) architectures. However, each paradigm faces inherent limitations: CL often overlooks fine-grained local details, while MAE is burdened by computationally heavy decoders. Moreover, MAE suffers from severe computational asymmetry -- benefiting from efficient masking during pre-training but requiring exhaustive full-sequence processing for downstream tasks. To resolve these bottlenecks, we propose SLiM (Skeleton Less is More), a novel unified framework that harmonizes masked modeling with contrastive learning via a shared encoder. By eschewing the reconstruction decoder, SLiM not only eliminates computational redundancy but also compels the encoder to capture discriminative features directly. SLiM is the first framework with decoder-free masked modeling of representative learning. Crucially, to prevent trivial reconstruction arising from high skeletal-temporal correlation, we introduce semantic tube masking, alongside skeletal-aware augmentations designed to ensure anatomical consistency across diverse temporal granularities. Extensive experiments demonstrate that SLiM consistently achieves state-of-the-art performance across all downstream protocols. Notably, our method delivers this superior accuracy with exceptional efficiency, reducing inference computational cost by 7.89x compared to existing MAE methods.

Check out this article on AI-Driven Reporting 2.0: From Manual Bottlenecks to Real-Time Decision Intelligence (2026 Edition)

Dev.to

SYNCAI

Dev.to

How AI-Powered Decision Making is Reshaping Enterprise Strategy in 2024

Dev.to

When AI Grows Up: Identity, Memory, and What Persists Across Versions

Dev.to

AI-Driven Reporting 2.0: From Manual Bottlenecks to Real-Time Decision Intelligence (2026 Edition)

Dev.to

Less is More: Decoder-Free Masked Modeling for Efficient Skeleton Representation Learning

Key Points

Abstract

Related Articles

Check out this article on AI-Driven Reporting 2.0: From Manual Bottlenecks to Real-Time Decision Intelligence (2026 Edition)

SYNCAI

How AI-Powered Decision Making is Reshaping Enterprise Strategy in 2024

When AI Grows Up: Identity, Memory, and What Persists Across Versions

AI-Driven Reporting 2.0: From Manual Bottlenecks to Real-Time Decision Intelligence (2026 Edition)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer