HARNESS: Lightweight Distilled Arabic Speech Foundation Models

arXiv cs.CL / 4/17/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

HArnESS is a new Arabic-centric self-supervised speech model family designed to overcome the deployment limitations of large SSL models in resource-constrained settings.
The approach uses iterative self-distillation starting from a large bilingual Arabic-English teacher to train lightweight student models for ASR, dialect identification (DID), and speech emotion recognition (SER).
The paper also explores PCA-based compression of the teacher’s supervision signals to better fit the reduced capacity of shallower and thinner student architectures.
Experiments reportedly show consistent improvements over HuBERT and XLS-R on Arabic downstream tasks, with the compressed student models staying competitive even under substantial structural reduction.
Overall, HArnESS is presented as a practical, accessible foundation for real-world Arabic speech applications requiring strong accuracy-efficiency trade-offs.

Abstract

Large self-supervised speech (SSL) models achieve strong downstream performance, but their size limits deployment in resource-constrained settings. We present HArnESS, an Arabic-centric self-supervised speech model family trained from scratch with iterative self-distillation, together with lightweight student variants that offer strong accuracy-efficiency trade-offs on Automatic Speech Recognition (ASR), Dialect Identification (DID), and Speech Emotion Recognition (SER). Our approach begins with a large bilingual Arabic-English teacher and progressively distills its knowledge into compressed student models while preserving Arabic-relevant acoustic and paralinguistic representations. We further study PCA-based compression of the teacher supervision signal to better match the capacity of shallow and thin students. Compared with HuBERT and XLS-R, HArnESS consistently improves performance on Arabic downstream tasks, while the compressed models remain competitive under substantial structural reduction. These results position HArnESS as a practical and accessible Arabic-centric SSL foundation for real-world speech applications.