A Novel Automatic Framework for Speaker Drift Detection in Synthesized Speech

arXiv cs.AI / 4/10/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

Diffusion-based TTS can develop “speaker drift,” where synthesized speech slowly shifts perceived speaker identity within an utterance, harming coherence in long-form or interactive use.
The authors propose an automatic speaker-drift detection framework that recasts drift as a binary speaker-consistency classification problem using cosine similarity over overlapping synthesized speech segments combined with LLM-based structured assessment.
They provide theoretical guarantees for the cosine-similarity-based detection approach and show that speaker embeddings form meaningful geometric clusters on the unit sphere.
A new synthetic benchmark is introduced, featuring human-validated drift annotations to enable reliable evaluation.
Experiments using multiple state-of-the-art LLMs demonstrate that an embedding-to-reasoning pipeline can effectively detect speaker drift, positioning it as a standalone research direction that links geometry-based signal analysis with LLM perceptual reasoning.

Abstract

Recent diffusion-based text-to-speech (TTS) models achieve high naturalness and expressiveness, yet often suffer from speaker drift, a subtle, gradual shift in perceived speaker identity within a single utterance. This underexplored phenomenon undermines the coherence of synthetic speech, especially in long-form or interactive settings. We introduce the first automatic framework for detecting speaker drift by formulating it as a binary classification task over utterance-level speaker consistency. Our method computes cosine similarity across overlapping segments of synthesized speech and prompts large language models (LLMs) with structured representations to assess drift. We provide theoretical guarantees for cosine-based drift detection and demonstrate that speaker embeddings exhibit meaningful geometric clustering on the unit sphere. To support evaluation, we construct a high-quality synthetic benchmark with human-validated speaker drift annotations. Experiments with multiple state-of-the-art LLMs confirm the viability of this embedding-to-reasoning pipeline. Our work establishes speaker drift as a standalone research problem and bridges geometric signal analysis with LLM-based perceptual reasoning in modern TTS.

Black Hat Asia

AI Business

GLM 5.1 tops the code arena rankings for open models

Reddit r/LocalLLaMA

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

My Bestie Built a Free MCP Server for Job Search — Here's How It Works

Dev.to

can we talk about how AI has gotten really good at lying to you?

Reddit r/artificial

A Novel Automatic Framework for Speaker Drift Detection in Synthesized Speech

Key Points

Abstract

Related Articles

Black Hat Asia

GLM 5.1 tops the code arena rankings for open models

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

My Bestie Built a Free MCP Server for Job Search — Here's How It Works

can we talk about how AI has gotten really good at lying to you?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer