The 1st Winner for 5th PVUW MeViS-Text Challenge: Strong MLLMs Meet SAM3 for Referring Video Object Segmentation
arXiv cs.CV / 4/2/2026
📰 NewsSignals & Early TrendsModels & Research
Key Points
- The winning solution for the 5th PVUW MeViS-Text Challenge tackles referring video object segmentation using motion-centric language expressions by jointly modeling appearance, temporal behavior, and object interactions.
- It proposes a fully training-free, three-stage pipeline that combines multimodal LLMs with SAM3: Gemini-3.1 Pro generates instance-level grounding targets and selects the clearest frame, while SAM3-agent creates a seed mask and the SAM3 tracker propagates it across the video.
- A final refinement step uses Qwen3.5-Plus plus behavior-level verification to fix ambiguous or semantically inconsistent mask predictions without any task-specific fine-tuning.
- The approach reportedly achieves first place on the PVUW 2026 MeViS-Text test set with a Final score of 0.909064 and a J&F score of 0.7897, and the code is released publicly.
- The work demonstrates that strong multimodal LLM prompting combined with SAM3-style segmentation/tracking can yield top performance without specialized training for the task.
Related Articles

Black Hat Asia
AI Business

Unitree's IPO
ChinaTalk

Did you know your GIGABYTE laptop has a built-in AI coding assistant? Meet GiMATE Coder 🤖
Dev.to

Benchmarking Batch Deep Reinforcement Learning Algorithms
Dev.to
A bug in Bun may have been the root cause of the Claude Code source code leak.
Reddit r/LocalLLaMA