UniSonate: A Unified Model for Speech, Music, and Sound Effect Generation with Text Instructions

arXiv cs.AI / 4/27/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

UniSonate is a new unified generative-audio model that can synthesize speech, music, and sound effects from a standardized, reference-free natural-language instruction interface.
The paper proposes a dynamic token injection mechanism that maps unstructured environmental sounds into a structured temporal latent space to enable precise duration control within a phoneme-driven Multimodal Diffusion Transformer (MM-DiT).
To address optimization conflicts across modalities, UniSonate uses a multi-stage curriculum learning strategy that helps stabilize cross-modal training.
Experiments report state-of-the-art results for instruction-based TTS (WER 1.47%) and text-to-music coherence (SongEval Coherence 3.18), with competitive fidelity for sound-effect generation, plus positive transfer from joint multi-audio training.
Audio samples are provided online, and the work is released as an arXiv preprint (arXiv:2604.22209v1).

Abstract

Generative audio modeling has largely been fragmented into specialized tasks, text-to-speech (TTS), text-to-music (TTM), and text-to-audio (TTA), each operating under heterogeneous control paradigms. Unifying these modalities remains a fundamental challenge due to the intrinsic dissonance between structured semantic representations (speech/music) and unstructured acoustic textures (sound effects). In this paper, we introduce UniSonate, a unified flow-matching framework capable of synthesizing speech, music, and sound effects through a standardized, reference-free natural language instruction interface. To reconcile structural disparities, we propose a novel dynamic token injection mechanism that projects unstructured environmental sounds into a structured temporal latent space, enabling precise duration control within a phoneme-driven Multimodal Diffusion Transformer (MM-DiT). Coupled with a multi-stage curriculum learning strategy, this approach effectively mitigates cross-modal optimization conflicts. Extensive experiments demonstrate that UniSonate achieves state-of-the-art performance in instruction-based TTS (WER 1.47%) and TTM (SongEval Coherence 3.18), while maintaining competitive fidelity in TTA. Crucially, we observe positive transfer, where joint training on diverse audio data significantly enhances structural coherence and prosodic expressiveness compared to single-task baselines. Audio samples are available at https://qiangchunyu.github.io/UniSonate/.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/27DailyView insight →

Subagents: The Building Block of Agentic AI

Dev.to

GET Serves Cache, POST Runs Inference: Cost Safety for a Public LLM Endpoint

Dev.to

DeepSeek-V4 Models Could Change Global AI Race

AI Business

Got OpenAI's privacy filter model running on-device via ExecuTorch

Reddit r/LocalLLaMA

The Agent-Skill Illusion: Why Prompt-Based Control Fails in Multi-Agent Business Consulting Systems

Dev.to

UniSonate: A Unified Model for Speech, Music, and Sound Effect Generation with Text Instructions

Key Points

Abstract

💡 Insights using this article

Related Articles

Subagents: The Building Block of Agentic AI

GET Serves Cache, POST Runs Inference: Cost Safety for a Public LLM Endpoint

DeepSeek-V4 Models Could Change Global AI Race

Got OpenAI's privacy filter model running on-device via ExecuTorch

The Agent-Skill Illusion: Why Prompt-Based Control Fails in Multi-Agent Business Consulting Systems

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer