Few-Shot Contrastive Adaptation for Audio Abuse Detection in Low-Resource Indic Languages

arXiv cs.CL / 4/13/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper studies abusive speech detection for multilingual social media voice interactions, focusing on low-resource Indic languages where typical ASR→text pipelines can fail due to transcription errors and loss of prosody.
It evaluates Contrastive Language-Audio Pre-training (CLAP) representations for detecting abuse directly from audio, using the ADIMA dataset.
Experiments include few-shot supervised contrastive adaptation with cross-lingual learning and a leave-one-language-out setup, alongside zero-shot prompting for comparison.
Results show CLAP provides strong cross-lingual audio representations across ten Indic languages, and lightweight projection-only adaptation can match performance of fully supervised models trained on all data in some cases.
The gains from few-shot adaptation vary by language and are not simply increasing with more labeled examples, indicating incomplete and language-specific transfer.

Abstract

Abusive speech detection is becoming increasingly important as social media shifts towards voice-based interaction, particularly in multilingual and low-resource settings. Most current systems rely on automatic speech recognition (ASR) followed by text-based hate speech classification, but this pipeline is vulnerable to transcription errors and discards prosodic information carried in speech. We investigate whether Contrastive Language-Audio Pre-training (CLAP) can support abusive speech detection directly from audio. Using the ADIMA dataset, we evaluate CLAP-based representations under few-shot supervised contrastive adaptation in cross-lingual and leave-one-language-out settings, with zero-shot prompting included as an auxiliary analysis. Our results show that CLAP yields strong cross-lingual audio representations across ten Indic languages, and that lightweight projection-only adaptation achieves competitive performance with respect to fully supervised systems trained on complete training data. However, the benefits of few-shot adaptation are language-dependent and not monotonic with shot size. These findings suggest that contrastive audio-text models provide a promising basis for cross-lingual audio abuse detection in low-resource settings, while also indicating that transfer remains incomplete and language-specific in important ways.