Habibi: Laying the Open-Source Foundation of Unified-Dialectal Arabic Speech Synthesis

arXiv cs.CL / 4/1/2026

📰 NewsSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

Habibi is an open-source, unified-dialect Arabic text-to-speech (TTS) framework designed to cover 12+ regional dialects despite major cross-dialect lexical/phonological gaps.
The system repurposes open-source ASR corpora into TTS training data via a multi-step curation pipeline and uses a linguistically informed curriculum learning strategy to enable robust zero-shot dialectal synthesis without text diacritization.
The release includes the first standardized multi-dialect Arabic TTS benchmark (11,000+ utterances across 7 dialect subsets) with manually verified transcripts.
On the benchmark, Habibi’s unified model matches or surpasses per-dialect specialized models, and evaluations (automatic and human) show competitiveness with ElevenLabs’ Eleven v3 (alpha) on intelligibility, speaker similarity, and naturalness.
The authors also open-source all checkpoints, training/inference code, and benchmark data, supported by extensive ablation studies using roughly 8,000 H100 GPU hours.

Abstract

Arabic spans over 30 spoken varieties, yet no open-source text-to-speech system unifies them. Key barriers include substantial cross-dialect lexical and phonological divergence, scarce synthesis-grade data, and the absence of a standardized multi-dialect evaluation benchmark. We present Habibi, a unified-dialectal Arabic TTS framework that addresses all three. Through a multi-step curation pipeline, we repurpose open-source ASR corpora into TTS training data covering 12+ regional dialects. A linguistically-informed curriculum learning strategy - progressing from Modern Standard Arabic to dialectal data - enables robust zero-shot synthesis without text diacritization. We further release the first standardized multi-dialect Arabic TTS benchmark, comprising over 11,000 utterances across 7 dialect subsets with manually verified transcripts. On this benchmark, our unified model matches or surpasses per-dialect specialized models. Both automatic metrics and human evaluations confirm that Habibi is highly competitive with ElevenLabs' Eleven v3 (alpha) in intelligibility, speaker similarity, and naturalness. Extensive ablations (~8,000 H100 GPU hours, 30+ configurations) validate each design choice. We open-source all checkpoints, training and inference code, and benchmark data - the first such release for multi-dialect Arabic TTS - at https://SWivid.github.io/Habibi/ .