Generate, Analyze, and Refine: Training-Free Sound Source Localization via MLLM Meta-Reasoning

arXiv cs.CV / 4/9/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces a training-free sound source localization framework that uses the intrinsic multimodal reasoning abilities of multimodal large language models (MLLMs) instead of relying on contrastive feature matching alone.
  • It proposes a three-stage Generation–Analysis–Refinement (GAR) pipeline that first generates bounding boxes and audio classifications, then evaluates audio-visual consistency using open-set role tagging and anchor voting.
  • In the refinement step, the method uses adaptive gating to avoid unnecessary updates, aiming to improve reliability in complex acoustic scenes.
  • Experiments on single-source and multi-source benchmarks show competitive localization performance, and the authors release the source code via the provided GitHub repository.
  • The work positions explicit reasoning and verification as key missing components in prior self-supervised sound localization approaches and demonstrates how MLLMs can supply that capability.

Abstract

Sound source localization task aims to identify the locations of sound-emitting objects by leveraging correlations between audio and visual modalities. Most existing SSL methods rely on contrastive learning-based feature matching, but lack explicit reasoning and verification, limiting their effectiveness in complex acoustic scenes. Inspired by human meta-cognitive processes, we propose a training-free SSL framework that exploits the intrinsic reasoning capabilities of Multimodal Large Language Models (MLLMs). Our Generation-Analysis-Refinement (GAR) pipeline consists of three stages: Generation produces initial bounding boxes and audio classifications; Analysis quantifies Audio-Visual Consistency via open-set role tagging and anchor voting; and Refinement applies adaptive gating to prevent unnecessary adjustments. Extensive experiments on single-source and multi-source benchmarks demonstrate competitive performance. The source code is available at https://github.com/VisualAIKHU/GAR-SSL.