Robust Audio-Text Retrieval via Cross-Modal Attention and Hybrid Loss

arXiv cs.CL / 4/28/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper targets audio-text retrieval, aiming to better match noisy, long audio inputs to natural-language queries for tasks like multimedia search and accessibility.
  • It introduces a multimodal framework that refines audio and text embeddings using a cross-modal attention-based refinement module with transformer projection, linear mapping, and bidirectional attention.
  • To improve training stability without large batches, it proposes a hybrid loss that combines cosine similarity, an L1 term, and contrastive objectives for robust optimization.
  • The method incorporates silence-aware chunking and attention-based pooling, enabling stronger performance on long-form audio across reported SNR ranges (5–15).
  • Experiments on benchmark datasets show improved retrieval results compared with prior state-of-the-art approaches.

Abstract

Audio-text retrieval enables semantic alignment between audio content and natural language queries, supporting applications in multimedia search, accessibility, and surveillance. However, current state-of-the-art approaches struggle with long, noisy, and weakly labeled audio due to their reliance on contrastive learning and large-batch training. We propose a novel multimodal retrieval framework that refines audio and text embeddings using a cross-modal embedding refinement module combining transformer-based projection, linear mapping, and bidirectional attention. To further improve robustness, we introduce a hybrid loss function blending cosine similarity, \mathcal{L}_{1}, and contrastive objectives, enabling stable training even under small-batch constraints. Our approach efficiently handles long-form and noisy audio (SNR 5 to 15) via silence-aware chunking and attention-based pooling. Experiments on benchmark datasets demonstrate improvements over prior methods.