STAR: Learning Diverse Robot Skill Abstractions through Rotation-Augmented Vector Quantization

arXiv cs.RO / 4/8/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces STAR (Skill Training with Augmented Rotation), a framework for learning discrete robot skill abstractions and composing them into complex behaviors.
  • It addresses codebook collapse in VQ-VAE-style methods by proposing rotation-augmented residual skill quantization (RaRSQ), which uses rotation-based gradient mechanisms to structure embedding spaces within the same skill code.
  • To better model how learned skills causally relate, it presents the Causal Skill Transformer (CST), an autoregressive approach that captures dependencies among skill representations during coherent action generation.
  • Experiments on the LIBERO benchmark and real-world tasks show STAR improves performance by about 12% over baseline approaches.
  • Overall, the work advances both representation learning (robust discrete skill codes) and skill composition (dependency-aware generation) for robotic manipulation.

Abstract

Transforming complex actions into discrete skill abstractions has demonstrated strong potential for robotic manipulation. Existing approaches mainly leverage latent variable models, e.g., VQ-VAE, to learn skill abstractions through learned vectors (codebooks), while they suffer from codebook collapse and modeling the causal relationship between learned skills. To address these limitations, we present \textbf{S}kill \textbf{T}raining with \textbf{A}ugmented \textbf{R}otation (\textbf{STAR}), a framework that advances both skill learning and composition to complete complex behaviors. Specifically, to prevent codebook collapse, we devise rotation-augmented residual skill quantization (RaRSQ). It encodes relative angles between encoder outputs into the gradient flow by rotation-based gradient mechanism. Points within the same skill code are forced to be either pushed apart or pulled closer together depending on gradient directions. Further, to capture the causal relationship between skills, we present causal skill transformer (CST) which explicitly models dependencies between skill representations through an autoregressive mechanism for coherent action generation. Extensive experiments demonstrate the superiority of STAR on both LIBERO benchmark and realworld tasks, with around 12\% improvement over the baselines.