LASE: Language-Adversarial Speaker Encoding for Indic Cross-Script Identity Preservation

arXiv cs.CL / 5/4/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

共有:

Key Points

Off-the-shelf multilingual speaker encoders can produce different embeddings for the same speaker depending on the audio script, undermining cross-script identity preservation in voice cloning.
The paper shows that this “accent-conditional leakage” is especially problematic for cross-script TTS where a non-Indic-trained voice is projected into Indic scripts.
It proposes LASE (Language-Adversarial Speaker Encoder), which adds a small projection head on top of a frozen WavLM-base-plus and trains it with supervised contrastive loss plus a gradient-reversal objective to remove language information while keeping speaker identity.
Experiments on Western- and Indian-accented corpora indicate LASE largely closes the cross-script cosine-similarity gap (with residual deltas near zero) and improves the cross-script margin by about 2.4–2.7× over baselines.
In synthetic multi-speaker diarisation, LASE matches ECAPA-TDNN cross-script speaker recall while using roughly 100× less training data, and the authors release checkpoints, datasets, and a bootstrap recipe.

Abstract

A speaker encoder used in multilingual voice cloning should treat the same speaker identically regardless of which script the audio was uttered in. Off-the-shelf encoders do not, and the failure is accent-conditional. On a 1043-pair Western-accented voice corpus across English, Hindi, Telugu, and Tamil, WavLM-base-plus-sv loses 0.082 absolute cosine similarity when the same voice changes script and ECAPA-TDNN loses 0.105. On a 1369-pair Indian-accented voice corpus, the gap shrinks to 0.006 (WavLM-SV) and 0.044 (ECAPA-TDNN). The leak is largest where it matters most for cross-script TTS: when a system projects a non-Indic-trained voice into Indic scripts. We present LASE (Language-Adversarial Speaker Encoder), a small projection head over frozen WavLM-base-plus trained with two losses: a supervised contrastive loss over voice identity, and a gradient-reversal cross-entropy against a 4-language classifier that pushes the embedding to be language-uninformative while remaining speaker-informative. Trained on 1118 quality-gated cross-script pairs synthesised from 8 commercial multilingual voices, LASE's residual gap is consistent with zero on both corpora (Delta = 0.013 Western, Delta = 0.026 Indian; both bootstrap 95% CIs include zero) and amplifies the cross-script-vs-floor margin 2.4-2.7x over both baselines. An ECAPA+GRL ablation shows the GRL objective improves either backbone but the WavLM choice contributes too. In synthetic multi-speaker diarisation, LASE matches ECAPA-TDNN on cross-script speaker recall (0.788 vs 0.789) with ~100x less training data. We release the r1 checkpoint, both corpora, and the bootstrap recipe.

ALM on Power Platform: ADO + GitHub, the best of both worlds

Dev.to

Iron Will, Iron Problems: Kiwi-chan's Mining Misadventures! 🥝⛏️

Dev.to

Experiment: Does repeated usage influence ChatGPT 5.4 outputs in a RAG-like setup?

Dev.to

Open source models are going to be the future on Cursor, OpenCode etc.

Reddit r/LocalLLaMA

How I Automated VPN Deployment with AI: The World's First AI-Powered VPN Kit

Dev.to

LASE: Language-Adversarial Speaker Encoding for Indic Cross-Script Identity Preservation

Key Points

Abstract

Related Articles

ALM on Power Platform: ADO + GitHub, the best of both worlds

Iron Will, Iron Problems: Kiwi-chan's Mining Misadventures! 🥝⛏️

Experiment: Does repeated usage influence ChatGPT 5.4 outputs in a RAG-like setup?

Open source models are going to be the future on Cursor, OpenCode etc.

How I Automated VPN Deployment with AI: The World's First AI-Powered VPN Kit

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer