3rd Place of MeViS-Audio Track of the 5th PVUW: VIRST-Audio

arXiv cs.CV / 3/25/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces VIRST-Audio, a framework for Audio-based Referring Video Object Segmentation (ARVOS) that grounds audio queries into pixel-level, time-consistent object masks.
Instead of training directly on audio, it converts audio to text via an ASR module and uses a pretrained RVOS model with a vision-language architecture for text-supervised segmentation.
To enhance robustness, VIRST-Audio adds an existence-aware gating mechanism that detects whether the target is present in the video and suppresses segmentation when absent to reduce hallucinated masks.
The method is evaluated on the MeViS-Audio track of the 5th PVUW Challenge, where VIRST-Audio achieves 3rd place, indicating strong generalization to audio-driven referring scenarios.

Abstract

Audio-based Referring Video Object Segmentation (ARVOS) requires grounding audio queries into pixel-level object masks over time, posing challenges in bridging acoustic signals with spatio-temporal visual representations. In this report, we present VIRST-Audio, a practical framework built upon a pretrained RVOS model integrated with a vision-language architecture. Instead of relying on audio-specific training, we convert input audio into text using an ASR module and perform segmentation using text-based supervision, enabling effective transfer from text-based reasoning to audio-driven scenarios. To improve robustness, we further incorporate an existence-aware gating mechanism that estimates whether the referred target object is present in the video and suppresses predictions when it is absent, reducing hallucinated masks and stabilizing segmentation behavior. We evaluate our approach on the MeViS-Audio track of the 5th PVUW Challenge, where VIRST-Audio achieves 3rd place, demonstrating strong generalization and reliable performance in audio-based referring video segmentation.