TokenDance: Token-to-Token Music-to-Dance Generation with Bidirectional Mamba

arXiv cs.AI / 3/31/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

TokenDance is proposed as a two-stage music-to-dance generation framework aimed at improving generalization to real-world music by expanding training coverage beyond limited 3D dance datasets.
The method uses Finite Scalar Quantization to discretize both music and dance into token representations, including upper/lower-body factorization for motions and separate semantic/acoustic codebooks for music.
A Local-Global-Local token-to-token generator with a Bidirectional Mamba backbone is introduced to produce coherent dance while maintaining strong music-dance alignment.
The approach supports efficient non-autoregressive inference and reports state-of-the-art results in both generation quality and inference speed.
The paper positions TokenDance as practically valuable for virtual reality, dance education, and digital character animation where expressive and realistic dance output matters.

Abstract

Music-to-dance generation has broad applications in virtual reality, dance education, and digital character animation. However, the limited coverage of existing 3D dance datasets confines current models to a narrow subset of music styles and choreographic patterns, resulting in poor generalization to real-world music. Consequently, generated dances often become overly simplistic and repetitive, substantially degrading expressiveness and realism. To tackle this problem, we present TokenDance, a two-stage music-to-dance generation framework that explicitly addresses this limitation through dual-modality tokenization and efficient token-level generation. In the first stage, we discretize both dance and music using Finite Scalar Quantization, where dance motions are factorized into upper and lower-body components with kinematic-dynamic constraints, and music is decomposed into semantic and acoustic features with dedicated codebooks to capture choreography-specific structures. In the second stage, we introduce a Local-Global-Local token-to-token generator built on a Bidirectional Mamba backbone, enabling coherent motion synthesis, strong music-dance alignment, and efficient non-autoregressive inference. Extensive experiments demonstrate that TokenDance achieves overall state-of-the-art (SOTA) performance in both generation quality and inference speed, highlighting its effectiveness and practical value for real-world music-to-dance applications.