3rd Place of MeViS-Audio Track of the 5th PVUW: VIRST-Audio
arXiv cs.CV / 3/25/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces VIRST-Audio, a framework for Audio-based Referring Video Object Segmentation (ARVOS) that grounds audio queries into pixel-level, time-consistent object masks.
- Instead of training directly on audio, it converts audio to text via an ASR module and uses a pretrained RVOS model with a vision-language architecture for text-supervised segmentation.
- To enhance robustness, VIRST-Audio adds an existence-aware gating mechanism that detects whether the target is present in the video and suppresses segmentation when absent to reduce hallucinated masks.
- The method is evaluated on the MeViS-Audio track of the 5th PVUW Challenge, where VIRST-Audio achieves 3rd place, indicating strong generalization to audio-driven referring scenarios.
広告
Related Articles

Got My 39-Agent System Audited Live. Here's What the Maturity Scorecard Revealed.
Dev.to

The Redline Economy
Dev.to

$500 GPU outperforms Claude Sonnet on coding benchmarks
Dev.to

From Scattershot to Sniper: AI for Hyper-Personalized Media Lists
Dev.to

The LiteLLM Supply Chain Attack: A Wake-Up Call for AI Infrastructure
Dev.to