TeMuDance: Contrastive Alignment-Based Textual Control for Music-Driven Dance Generation

arXiv cs.CV / 4/21/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

TeMuDance addresses a key gap in music-driven dance generation by enabling semantic, text-based controllability over specific movements rather than relying only on realism and audio-motion alignment.
The framework aligns separate music–dance and text–motion data without needing manually annotated music–text–motion triplets by using motion as a shared semantic anchor and performing cross-modal retrieval of missing modalities for end-to-end training.
TeMuDance trains a lightweight text-control branch on top of a frozen music-to-dance diffusion model to maintain rhythmic fidelity while adding fine-grained language guidance.
To improve training signal quality, it applies dual-stream fine-tuning with confidence-based filtering to reduce noise from retrieved supervision, and introduces a task-aligned metric to evaluate whether prompts produce intended kinematic attributes under music conditioning.
Experiments indicate TeMuDance delivers comparable dance quality to prior approaches while significantly improving how well generated dance follows natural-language movement instructions.

Abstract

Existing music-driven dance generation approaches have achieved strong realism and effective audio-motion alignment. However, they generally lack semantic controllability, making it difficult to guide specific movements through natural language descriptions. This limitation primarily stems from the absence of large-scale datasets that jointly align music, text, and motion for supervised learning of text-conditioned control. To address this challenge, we propose TeMuDance, a framework that enables text-based control for music-conditioned dance generation without requiring any manually annotated music-text-motion triplet dataset. TeMuDance introduces a motion-centred bridging paradigm that leverages motion as a shared semantic anchor to align disjoint music-dance and text-motion datasets within a unified embedding space, enabling cross-modal retrieval of missing modalities for end-to-end training. A lightweight text control branch is then trained on top of a frozen music-to-dance diffusion backbone, preserving rhythmic fidelity while enabling fine-grained semantic guidance. To further suppress noise inherent in the retrieved supervision, we design a dual-stream fine-tuning strategy with confidence-based filtering. We also propose a novel task-aligned metric that quantifies whether textual prompts induce the intended kinematic attributes under music conditioning. Extensive experiments demonstrate that TeMuDance achieves competitive dance quality while substantially improving text-conditioned control over existing methods.

Every time a new model comes out, the old one is obsolete of course

Reddit r/LocalLLaMA

We built it during the NVIDIA DGX Spark Full-Stack AI Hackathon — and it ended up winning 1st place overall 🏆

Dev.to

Stop Losing Progress: Setting Up a Pro Jupyter Workflow in VS Code (No More Colab Timeouts!)

Dev.to

Building AgentOS: Why I’m Building the AWS Lambda for Insurance Claims

Dev.to

Where we are. In a year, everything has changed. Kimi - Minimax - Qwen - Gemma - GLM

Reddit r/LocalLLaMA

TeMuDance: Contrastive Alignment-Based Textual Control for Music-Driven Dance Generation

Key Points

Abstract

Related Articles

Every time a new model comes out, the old one is obsolete of course

We built it during the NVIDIA DGX Spark Full-Stack AI Hackathon — and it ended up winning 1st place overall 🏆

Stop Losing Progress: Setting Up a Pro Jupyter Workflow in VS Code (No More Colab Timeouts!)

Building AgentOS: Why I’m Building the AWS Lambda for Insurance Claims

Where we are. In a year, everything has changed. Kimi - Minimax - Qwen - Gemma - GLM

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer