Don't Let the Video Speak: Audio-Contrastive Preference Optimization for Audio-Visual Language Models

arXiv cs.CV / 4/16/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses a reliability bottleneck in audio-visual language models caused by cross-modal hallucination, especially when models rely on video “shortcuts” to generate sounds that aren’t actually present in the audio.
  • It proposes Audio-Contrastive Preference Optimization (ACPO), a dual-axis preference learning method that uses output-contrastive training to penalize visual descriptions presented as true audio.
  • ACPO also applies input-contrastive learning by swapping audio tracks to penalize generations that remain invariant to the real auditory signal.
  • Experiments reported in the paper show that ACPO improves faithful audio grounding and reduces video-driven audio hallucination while preserving broader multimodal performance.

Abstract

While Audio-Visual Language Models (AVLMs) have achieved remarkable progress over recent years, their reliability is bottlenecked by cross-modal hallucination. A particularly pervasive manifestation is video-driven audio hallucination: models routinely exploit visual shortcuts to hallucinate expected sounds, discarding true auditory evidence. To counteract this deeply ingrained visual dominance, we propose Audio-Contrastive Preference Optimization (ACPO). This dual-axis preference learning framework introduces an output-contrastive objective to penalize visual descriptions masquerading as audio facts, alongside an input-contrastive objective that swaps audio tracks to explicitly penalize generation invariant to the true auditory signal. Extensive experiments demonstrate that ACPO establishes highly faithful audio grounding and mitigates audio hallucination without compromising overarching multimodal capabilities.