Unsupervised Skeleton-Based Action Segmentation via Hierarchical Spatiotemporal Vector Quantization

arXiv cs.CV / 4/17/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces a hierarchical spatiotemporal vector quantization framework for unsupervised skeleton-based temporal action segmentation.
  • It uses two consecutive quantization levels: a lower level captures fine-grained subactions and a higher level aggregates them into action-level representations.
  • The method first shows strong results by primarily exploiting spatial cues through reconstruction of input skeletons, then improves by incorporating both spatial and temporal information.
  • The extended hierarchical spatiotemporal version performs multi-level clustering while also reconstructing the skeleton inputs and their corresponding timestamps.
  • Experiments on HuGaDB, LARa, and BABEL report new state-of-the-art performance and reduced segment-length bias in unsupervised action segmentation.

Abstract

We propose a novel hierarchical spatiotemporal vector quantization framework for unsupervised skeleton-based temporal action segmentation. We first introduce a hierarchical approach, which includes two consecutive levels of vector quantization. Specifically, the lower level associates skeletons with fine-grained subactions, while the higher level further aggregates subactions into action-level representations. Our hierarchical approach outperforms the non-hierarchical baseline, while primarily exploiting spatial cues by reconstructing input skeletons. Next, we extend our approach by leveraging both spatial and temporal information, yielding a hierarchical spatiotemporal vector quantization scheme. In particular, our hierarchical spatiotemporal approach performs multi-level clustering, while simultaneously recovering input skeletons and their corresponding timestamps. Lastly, extensive experiments on multiple benchmarks, including HuGaDB, LARa, and BABEL, demonstrate that our approach establishes a new state-of-the-art performance and reduces segment length bias in unsupervised skeleton-based temporal action segmentation.