PHONOS: PHOnetic Neutralization for Online Streaming Applications

arXiv cs.CL / 3/31/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces PHONOS, a real-time speaker anonymization module for streaming that reduces the identifiability risk caused by non-native accents narrowing the anonymity set.
PHONOS uses pre-generated “golden” speaker utterances that preserve original timbre and rhythm while replacing foreign segmental sounds with native ones via silence-aware DTW alignment and zero-shot voice conversion.
It trains a causal accent translator that converts non-native content tokens into native-like equivalents with no more than 40ms look-ahead, optimizing with joint cross-entropy and CTC losses.
Experiments report an 81% reduction in non-native accent confidence and improved human listening-test ratings, alongside lower speaker linkability in embedding space and streaming latency under 241 ms on a single GPU.

Abstract

Speaker anonymization (SA) systems modify timbre while leaving regional or non-native accents intact, which is problematic because accents can narrow the anonymity set. To address this issue, we present PHONOS, a streaming module for real-time SA that neutralizes non-native accent to sound native-like. Our approach pre-generates golden speaker utterances that preserve source timbre and rhythm but replace foreign segmentals with native ones using silence-aware DTW alignment and zero-shot voice conversion. These utterances supervise a causal accent translator that maps non-native content tokens to native equivalents with at most 40ms look-ahead, trained using joint cross-entropy and CTC losses. Our evaluations show an 81% reduction in non-native accent confidence, with listening-test ratings consistent with this shift, and reduced speaker linkability as accent-neutralized utterances move away from the original speaker in embedding space while having latency under 241 ms on single GPU.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 3/31DailyView insight →

[D] How does distributed proof of work computing handle the coordination needs of neural network training?

Reddit r/MachineLearning

BYOK is not just a pricing model: why it changes AI product trust

Dev.to

AI Citation Registries and Identity Persistence Across Records

Dev.to

Building Real-Time AI Voice Agents with Google Gemini 3.1 Flash Live and VideoSDK

Dev.to

Your Knowledge, Your Model: A Method for Deterministic Knowledge Externalization

Dev.to

PHONOS: PHOnetic Neutralization for Online Streaming Applications

Key Points

Abstract

💡 Insights using this article

Related Articles

[D] How does distributed proof of work computing handle the coordination needs of neural network training?

BYOK is not just a pricing model: why it changes AI product trust

AI Citation Registries and Identity Persistence Across Records

Building Real-Time AI Voice Agents with Google Gemini 3.1 Flash Live and VideoSDK

Your Knowledge, Your Model: A Method for Deterministic Knowledge Externalization

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer