Generative Data Augmentation for Skeleton Action Recognition

arXiv cs.CV / 4/17/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper addresses the high cost of collecting large, diverse, well-annotated 3D skeleton datasets for skeleton-based action recognition by introducing a conditional generative data augmentation pipeline.
It learns the distribution of real skeleton sequences conditioned on action labels, allowing it to synthesize diverse, high-fidelity training samples even when labeled data is limited.
The proposed method uses a Transformer-based encoder–decoder architecture with a generative refinement module and a dropout mechanism to balance fidelity versus diversity during sampling.
Experiments on HumanAct12 and the refined NTU-RGBD (NTU-VIBE) dataset show consistent accuracy improvements across multiple skeleton action recognition models in both few-shot and full-data settings.
The authors provide source code for reproducibility and further research.

Abstract

Skeleton-based human action recognition is a powerful approach for understanding human behaviour from pose data, but collecting large-scale, diverse, and well-annotated 3D skeleton datasets is both expensive and labor-intensive. To address this challenge, we propose a conditional generative pipeline for data augmentation in skeleton action recognition. Our method learns the distribution of real skeleton sequences under the constraint of action labels, enabling the synthesis of diverse and high-fidelity data. Even with limited training samples, it can effectively generate skeleton sequences and achieve competitive recognition performance in low-data scenarios, demonstrating strong generalisation in downstream tasks. Specifically, we introduce a Transformer-based encoder-decoder architecture, combined with a generative refinement module and a dropout mechanism, to balance fidelity and diversity during sampling. Experiments on HumanAct12 and the refined NTU-RGBD (NTU-VIBE) dataset show that our approach consistently improves the accuracy of multiple skeleton-based action recognition models, validating its effectiveness in both few-shot and full-data settings. The source code can be found at here.