Mask2Flow-TSE: Two-Stage Target Speaker Extraction with Masking and Flow Matching

arXiv cs.AI / 3/16/2026

📰 NewsModels & Research

共有:

Key Points

Mask2Flow-TSE is a two-stage target speaker extraction framework that combines discriminative masking for coarse separation with flow matching for refinement.
The first stage performs discriminative masking to achieve coarse separation, while the second stage uses flow matching to refine the output toward the target speech.
Unlike generative TSE methods that synthesize speech from Gaussian noise and often require many iterative steps, Mask2Flow-TSE starts from the masked spectrogram to enable high-quality reconstruction in a single inference step.
Experiments show the approach achieves comparable performance to existing generative methods with approximately 85 million parameters.

Abstract

Target speaker extraction (TSE) extracts the target speaker's voice from overlapping speech mixtures given a reference utterance. Existing approaches typically fall into two categories: discriminative and generative. Discriminative methods apply time-frequency masking for fast inference but often over-suppress the target signal, while generative methods synthesize high-quality speech at the cost of numerous iterative steps. We propose Mask2Flow-TSE, a two-stage framework combining the strengths of both paradigms. The first stage applies discriminative masking for coarse separation, and the second stage employs flow matching to refine the output toward target speech. Unlike generative approaches that synthesize speech from Gaussian noise, our method starts from the masked spectrogram, enabling high-quality reconstruction in a single inference step. Experiments show that Mask2Flow-TSE achieves comparable performance to existing generative TSE methods with approximately 85M parameters.

[R] Combining Identity Anchors + Permission Hierarchies achieves 100% refusal in abliterated LLMs — system prompt only, no fine-tuning

Reddit r/MachineLearning

[P] Vibecoded on a home PC: building a ~2700 Elo browser-playable neural chess engine with a Karpathy-inspired AI-assisted research loop

Reddit r/MachineLearning

Meet DuckLLM 1.0 My First Model!

Reddit r/LocalLLaMA

Since FastFlowLM added support for Linux, I decided to benchmark all the models they support, here are some results

Reddit r/LocalLLaMA

What measure do I use to compare nested models and non nested models in high dimensional survival analysis [D]

Reddit r/MachineLearning

Mask2Flow-TSE: Two-Stage Target Speaker Extraction with Masking and Flow Matching

Key Points

Abstract

Related Articles

[R] Combining Identity Anchors + Permission Hierarchies achieves 100% refusal in abliterated LLMs — system prompt only, no fine-tuning

[P] Vibecoded on a home PC: building a ~2700 Elo browser-playable neural chess engine with a Karpathy-inspired AI-assisted research loop

Meet DuckLLM 1.0 My First Model!

Since FastFlowLM added support for Linux, I decided to benchmark all the models they support, here are some results

What measure do I use to compare nested models and non nested models in high dimensional survival analysis [D]

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer