The 1st Winner for 5th PVUW MeViS-Text Challenge: Strong MLLMs Meet SAM3 for Referring Video Object Segmentation

arXiv cs.CV / 4/2/2026

📰 NewsSignals & Early TrendsModels & Research

共有:

Key Points

The winning solution for the 5th PVUW MeViS-Text Challenge tackles referring video object segmentation using motion-centric language expressions by jointly modeling appearance, temporal behavior, and object interactions.
It proposes a fully training-free, three-stage pipeline that combines multimodal LLMs with SAM3: Gemini-3.1 Pro generates instance-level grounding targets and selects the clearest frame, while SAM3-agent creates a seed mask and the SAM3 tracker propagates it across the video.
A final refinement step uses Qwen3.5-Plus plus behavior-level verification to fix ambiguous or semantically inconsistent mask predictions without any task-specific fine-tuning.
The approach reportedly achieves first place on the PVUW 2026 MeViS-Text test set with a Final score of 0.909064 and a J&F score of 0.7897, and the code is released publicly.
The work demonstrates that strong multimodal LLM prompting combined with SAM3-style segmentation/tracking can yield top performance without specialized training for the task.

Abstract

This report presents our winning solution to the 5th PVUW MeViS-Text Challenge. The track studies referring video object segmentation under motion-centric language expressions, where the model must jointly understand appearance, temporal behavior, and object interactions. To address this problem, we build a fully training-free pipeline that combines strong multimodal large language models with SAM3. Our method contains three stages. First, Gemini-3.1 Pro decomposes each target event into instance-level grounding targets, selects the frame where the target is most clearly visible, and generates a discriminative description. Second, SAM3-agent produces a precise seed mask on the selected frame, and the official SAM3 tracker propagates the mask through the whole video. Third, a refinement stage uses Qwen3.5-Plus and behavior-level verification to correct ambiguous or semantically inconsistent predictions. Without task-specific fine-tuning, our method ranks first on the PVUW 2026 MeViS-Text test set, achieving a Final score of 0.909064 and a J&F score of 0.7897. The code is available at https://github.com/Moujuruo/MeViSv2_Track_Solution_2026.

Black Hat Asia

AI Business

Unitree's IPO

ChinaTalk

Did you know your GIGABYTE laptop has a built-in AI coding assistant? Meet GiMATE Coder 🤖

Dev.to

Benchmarking Batch Deep Reinforcement Learning Algorithms

Dev.to

A bug in Bun may have been the root cause of the Claude Code source code leak.

Reddit r/LocalLLaMA

The 1st Winner for 5th PVUW MeViS-Text Challenge: Strong MLLMs Meet SAM3 for Referring Video Object Segmentation

Key Points

Abstract

Related Articles

Black Hat Asia

Unitree's IPO

Did you know your GIGABYTE laptop has a built-in AI coding assistant? Meet GiMATE Coder 🤖

Benchmarking Batch Deep Reinforcement Learning Algorithms

A bug in Bun may have been the root cause of the Claude Code source code leak.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer