SOUPLE: Enhancing Audio-Visual Localization and Segmentation with Learnable Prompt Contexts

arXiv cs.CV / 3/25/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that applying CLIP-style audio-visual models to localization and segmentation is difficult because simple token replacement and fixed prompts do not properly link audio embeddings to semantic context.
  • It introduces SOUPLE (Sound-aware Prompt Learning), which learns context tokens that are conditioned on visual features to better bridge audio semantics with the mask decoder.
  • SOUPLE replaces static prompts with learnable prompt context tokens, aiming to establish stronger correspondence between audio-embedded tokens and visual context.
  • Experiments on VGGSound, SoundNet, and AVSBench show improved audio-visual localization and segmentation performance compared with prior prompt/token approaches.

Abstract

Large-scale pre-trained image-text models exhibit robust multimodal representations, yet applying the Contrastive Language-Image Pre-training (CLIP) model to audio-visual localization remains challenging. Replacing the classification token ([CLS]) with an audio-embedded token ([V_A]) struggles to capture semantic cues, and the prompt "a photo of a [V_A]" fails to establish meaningful connections between audio embeddings and context tokens. To address these issues, we propose Sound-aware Prompt Learning (SOUPLE), which replaces fixed prompts with learnable context tokens. These tokens incorporate visual features to generate conditional context for a mask decoder, effectively bridging semantic correspondence between audio and visual inputs. Experiments on VGGSound, SoundNet, and AVSBench demonstrate that SOUPLE improves localization and segmentation performance.