S-SONDO: Self-Supervised Knowledge Distillation for General Audio Foundation Models

arXiv cs.AI / 4/29/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

The paper introduces S-SONDO, a first-in-class framework for knowledge distillation of general audio foundation models using only the teachers’ output embeddings, without requiring logits or intermediate-layer alignment.
By eliminating assumptions about the teacher’s output format (e.g., supporting self-supervised/metric-learning models that emit embeddings only), S-SONDO is architecture-agnostic and broadly applicable.
Experiments show that two audio foundation models can be distilled into three smaller student models that are up to 61× smaller while preserving up to 96% of the teachers’ performance.
The authors also provide practical guidance on selecting loss functions and using clustering-based balanced sampling to improve distillation quality.
Reproducibility is supported by released code on GitHub (ssondo).

Abstract

General audio foundation models have recently achieved remarkable progress, enabling strong performance across diverse tasks. However, state-of-the-art models remain extremely large, often with hundreds of millions of parameters, leading to high inference costs and limited deployability on edge devices. Knowledge distillation is a proven strategy for model compression, but prior work in audio has mostly focused on supervised settings, relying on class logits, intermediate features, or architecture-specific techniques. Such assumptions exclude models that output only embeddings, such as self-supervised or metric-learning models. We introduce S-SONDO (Self-Supervised KnOwledge DistillatioN for General AuDio FOundation Models), the first framework to distill general audio models using only their output embeddings. By avoiding the need for logits or layer-level alignment, S-SONDO is architecture-agnostic and broadly applicable to embedding-based teachers. We demonstrate its effectiveness by distilling two audio foundation models into three efficient students that are up to 61 times smaller while retaining up to 96% of teacher performance. We also provide practical insights on loss choice and clustering-based balanced data sampling. Code is available here: https://github.com/MedAliAdlouni/ssondo.

DeepSeek V4 Released: 1.6T Parameters, 1M Context, and Floor-Shattering Prices

Dev.to

Building an Al food tracker and currently tackling Apple Health integration. How do you prefer your „active calories“ to be handled?

Reddit r/artificial

Data migration and modernization in 2025: why manual approaches are failing Global 2000 enterprises

Dev.to

Tenstorrent TT-QuietBox 2 Specifications (Blackhole)

Reddit r/LocalLLaMA

Qwen3.6-27B-Q6_K - images

Reddit r/LocalLLaMA

S-SONDO: Self-Supervised Knowledge Distillation for General Audio Foundation Models

Key Points

Abstract

Related Articles

DeepSeek V4 Released: 1.6T Parameters, 1M Context, and Floor-Shattering Prices

Building an Al food tracker and currently tackling Apple Health integration. How do you prefer your „active calories“ to be handled?

Data migration and modernization in 2025: why manual approaches are failing Global 2000 enterprises

Tenstorrent TT-QuietBox 2 Specifications (Blackhole)

Qwen3.6-27B-Q6_K - images

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer