S-SONDO: Self-Supervised Knowledge Distillation for General Audio Foundation Models

arXiv cs.AI / 4/29/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • The paper introduces S-SONDO, a first-in-class framework for knowledge distillation of general audio foundation models using only the teachers’ output embeddings, without requiring logits or intermediate-layer alignment.
  • By eliminating assumptions about the teacher’s output format (e.g., supporting self-supervised/metric-learning models that emit embeddings only), S-SONDO is architecture-agnostic and broadly applicable.
  • Experiments show that two audio foundation models can be distilled into three smaller student models that are up to 61× smaller while preserving up to 96% of the teachers’ performance.
  • The authors also provide practical guidance on selecting loss functions and using clustering-based balanced sampling to improve distillation quality.
  • Reproducibility is supported by released code on GitHub (ssondo).

Abstract

General audio foundation models have recently achieved remarkable progress, enabling strong performance across diverse tasks. However, state-of-the-art models remain extremely large, often with hundreds of millions of parameters, leading to high inference costs and limited deployability on edge devices. Knowledge distillation is a proven strategy for model compression, but prior work in audio has mostly focused on supervised settings, relying on class logits, intermediate features, or architecture-specific techniques. Such assumptions exclude models that output only embeddings, such as self-supervised or metric-learning models. We introduce S-SONDO (Self-Supervised KnOwledge DistillatioN for General AuDio FOundation Models), the first framework to distill general audio models using only their output embeddings. By avoiding the need for logits or layer-level alignment, S-SONDO is architecture-agnostic and broadly applicable to embedding-based teachers. We demonstrate its effectiveness by distilling two audio foundation models into three efficient students that are up to 61 times smaller while retaining up to 96% of teacher performance. We also provide practical insights on loss choice and clustering-based balanced data sampling. Code is available here: https://github.com/MedAliAdlouni/ssondo.