TokenDance: Token-to-Token Music-to-Dance Generation with Bidirectional Mamba

arXiv cs.AI / 3/31/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • TokenDance is proposed as a two-stage music-to-dance generation framework aimed at improving generalization to real-world music by expanding training coverage beyond limited 3D dance datasets.
  • The method uses Finite Scalar Quantization to discretize both music and dance into token representations, including upper/lower-body factorization for motions and separate semantic/acoustic codebooks for music.
  • A Local-Global-Local token-to-token generator with a Bidirectional Mamba backbone is introduced to produce coherent dance while maintaining strong music-dance alignment.
  • The approach supports efficient non-autoregressive inference and reports state-of-the-art results in both generation quality and inference speed.
  • The paper positions TokenDance as practically valuable for virtual reality, dance education, and digital character animation where expressive and realistic dance output matters.

Abstract

Music-to-dance generation has broad applications in virtual reality, dance education, and digital character animation. However, the limited coverage of existing 3D dance datasets confines current models to a narrow subset of music styles and choreographic patterns, resulting in poor generalization to real-world music. Consequently, generated dances often become overly simplistic and repetitive, substantially degrading expressiveness and realism. To tackle this problem, we present TokenDance, a two-stage music-to-dance generation framework that explicitly addresses this limitation through dual-modality tokenization and efficient token-level generation. In the first stage, we discretize both dance and music using Finite Scalar Quantization, where dance motions are factorized into upper and lower-body components with kinematic-dynamic constraints, and music is decomposed into semantic and acoustic features with dedicated codebooks to capture choreography-specific structures. In the second stage, we introduce a Local-Global-Local token-to-token generator built on a Bidirectional Mamba backbone, enabling coherent motion synthesis, strong music-dance alignment, and efficient non-autoregressive inference. Extensive experiments demonstrate that TokenDance achieves overall state-of-the-art (SOTA) performance in both generation quality and inference speed, highlighting its effectiveness and practical value for real-world music-to-dance applications.

TokenDance: Token-to-Token Music-to-Dance Generation with Bidirectional Mamba | AI Navigate