What and When to Learn: CURriculum Ranking Loss for Large-Scale Speaker Verification

arXiv cs.CL / 3/26/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that fixed-margin speaker-verification losses can be harmed by mislabeled or degraded samples because they inject noisy gradients and disrupt compact speaker manifolds.
It introduces Curry (CURriculum Ranking), an adaptive curriculum ranking loss that estimates per-sample difficulty online using confidence derived from Sub-center ArcFace dominant sub-center cosine similarity, grouping samples into easy/medium/hard tiers via running batch statistics.
The method uses learnable weights to guide training from stable identity learning toward later-stage manifold refinement and boundary sharpening, without requiring auxiliary annotations.
Experiments on VoxCeleb1-O and SITW report large EER reductions versus the Sub-center ArcFace baseline, with claimed improvements of 86.8% and 60.0% respectively.
The authors also claim Curry is part of the largest-scale speaker verification training system reported to date, aiming at robust performance on imperfect large-scale datasets.

Abstract

Speaker verification at large scale remains an open challenge as fixed-margin losses treat all samples equally regardless of quality. We hypothesize that mislabeled or degraded samples introduce noisy gradients that disrupt compact speaker manifolds. We propose Curry (CURriculum Ranking), an adaptive loss that estimates sample difficulty online via Sub-center ArcFace: confidence scores from dominant sub-center cosine similarity rank samples into easy, medium, and hard tiers using running batch statistics, without auxiliary annotations. Learnable weights guide the model from stable identity foundations through manifold refinement to boundary sharpening. To our knowledge, this is the largest-scale speaker verification system trained to date. Evaluated on VoxCeleb1-O, and SITW, Curry reduces EER by 86.8\% and 60.0\% over the Sub-center ArcFace baseline, establishing a new paradigm for robust speaker verification on imperfect large-scale data.

Speaking of VoxtralResearchVoxtral TTS: A frontier, open-weights text-to-speech model that’s fast, instantly adaptable, and produces lifelike speech for voice agents.

Mistral AI Blog

Anyone who has any common sense knows that AI agents in marketing just don’t exist.

Dev.to

How to Use MiMo V2 API for Free in 2026: Complete Guide

Dev.to

The Agent Memory Problem Nobody Solves: A Practical Architecture for Persistent Context

Dev.to

From Chaos to Compliance: AI Automation for the Mobile Kitchen

Dev.to

What and When to Learn: CURriculum Ranking Loss for Large-Scale Speaker Verification

Key Points

Abstract

Related Articles

Speaking of VoxtralResearchVoxtral TTS: A frontier, open-weights text-to-speech model that’s fast, instantly adaptable, and produces lifelike speech for voice agents.

Anyone who has any common sense knows that AI agents in marketing just don’t exist.

How to Use MiMo V2 API for Free in 2026: Complete Guide

The Agent Memory Problem Nobody Solves: A Practical Architecture for Persistent Context

From Chaos to Compliance: AI Automation for the Mobile Kitchen

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer