One Voice, Many Tongues: Cross-Lingual Voice Cloning for Scientific Speech

arXiv cs.CL / 4/30/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper tackles the challenge of preserving a speaker’s voice identity while generating speech in a different language, with a focus on scientific communication.
It evaluates leading voice cloning models for cross-lingual generation of scientific texts in Arabic, Chinese, and French.
The authors build cross-lingual voice cloning systems using the OmniVoice foundation model and apply data augmentation via multi-model ensemble distillation from the ACL 60/60 corpus.
Experiments show that fine-tuning with the synthetic augmented data improves intelligibility across the evaluated languages (measured by WER and CER) while maintaining speaker similarity.

Abstract

Preserving a speaker's voice identity while generating speech in a different language remains a fundamental challenge in spoken language technology, particularly in specialized domains such as scientific communication. In this paper, we address this challenge through our system submission to the International Conference on Spoken Language Translation (IWSLT 2026), the Cross-Lingual Voice Cloning shared task. First, we evaluate several state-of-the-art voice cloning models for cross-lingual speech generation of scientific texts in Arabic, Chinese, and French. Then, we build voice cloning systems based on the OmniVoice foundation model. We employ data augmentation via multi-model ensemble distillation from the ACL 60/60 corpus. We investigate the effect of using this synthetic data for fine-tuning, demonstrating consistent improvements in intelligibility (WER and CER) across languages while preserving speaker similarity.

Can AI Predict Pollution Before It Happens? The Smart Solution to an Old Problem

Dev.to

THE FIFTH TRANSMISSION: THE GRADIENT IS THE GOVERNMENT

Reddit r/artificial

Looking for feedback on OpenVidya: an open-source AI classroom layer for NCERT/CBSE [R]

Reddit r/MachineLearning

RAG Series (1): Why LLMs Need External Memory

Dev.to

One Open Source Project a Day (No. 54): Warp - The AI-Native Rust Terminal

Dev.to

One Voice, Many Tongues: Cross-Lingual Voice Cloning for Scientific Speech

Key Points

Abstract

Related Articles

Can AI Predict Pollution Before It Happens? The Smart Solution to an Old Problem

THE FIFTH TRANSMISSION: THE GRADIENT IS THE GOVERNMENT

Looking for feedback on OpenVidya: an open-source AI classroom layer for NCERT/CBSE [R]

RAG Series (1): Why LLMs Need External Memory

One Open Source Project a Day (No. 54): Warp - The AI-Native Rust Terminal

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer