What and When to Learn: CURriculum Ranking Loss for Large-Scale Speaker Verification

arXiv cs.CL / 3/26/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that fixed-margin speaker-verification losses can be harmed by mislabeled or degraded samples because they inject noisy gradients and disrupt compact speaker manifolds.
  • It introduces Curry (CURriculum Ranking), an adaptive curriculum ranking loss that estimates per-sample difficulty online using confidence derived from Sub-center ArcFace dominant sub-center cosine similarity, grouping samples into easy/medium/hard tiers via running batch statistics.
  • The method uses learnable weights to guide training from stable identity learning toward later-stage manifold refinement and boundary sharpening, without requiring auxiliary annotations.
  • Experiments on VoxCeleb1-O and SITW report large EER reductions versus the Sub-center ArcFace baseline, with claimed improvements of 86.8% and 60.0% respectively.
  • The authors also claim Curry is part of the largest-scale speaker verification training system reported to date, aiming at robust performance on imperfect large-scale datasets.

Abstract

Speaker verification at large scale remains an open challenge as fixed-margin losses treat all samples equally regardless of quality. We hypothesize that mislabeled or degraded samples introduce noisy gradients that disrupt compact speaker manifolds. We propose Curry (CURriculum Ranking), an adaptive loss that estimates sample difficulty online via Sub-center ArcFace: confidence scores from dominant sub-center cosine similarity rank samples into easy, medium, and hard tiers using running batch statistics, without auxiliary annotations. Learnable weights guide the model from stable identity foundations through manifold refinement to boundary sharpening. To our knowledge, this is the largest-scale speaker verification system trained to date. Evaluated on VoxCeleb1-O, and SITW, Curry reduces EER by 86.8\% and 60.0\% over the Sub-center ArcFace baseline, establishing a new paradigm for robust speaker verification on imperfect large-scale data.