X2SAM: Any Segmentation in Images and Videos
arXiv cs.CV / 5/5/2026
📰 NewsModels & Research
Key Points
- X2SAM is a unified multimodal segmentation LLM that extends any-segmentation from images to videos while supporting both conversational instructions and visual prompts.
- It combines an LLM with a Mask Memory module to maintain temporally consistent mask generation across video frames.
- The model is designed to handle a wide range of segmentation tasks, including open-vocabulary, referring, grounded conversation-based generation, interactive, and visual-grounded segmentation for both images and videos.
- The authors introduce the Video Visual Grounded (V-VGD) benchmark to evaluate whether models can segment object tracks in videos using interactive visual prompts.
- According to the abstract, joint training over heterogeneous image and video datasets yields strong video segmentation results without sacrificing competitive image segmentation and general image/video chat ability.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles

Why Retail Chargeback Recovery Could Be AgentHansa's First Real PMF
Dev.to

Last Week in AI #340 - OpenAI vs Musk + Microsoft, DeepSeek v4, Vision Banana
Last Week in AI

Trying to train tiny LLMs on length constrained reddit posts summarization task using GRPO on 3xMac Minis - updates!
Reddit r/LocalLLaMA

Uber Shares What Happens When 1.500 AI Agents Hit Production
Reddit r/artificial
vibevoice.cpp: Microsoft VibeVoice (TTS + long-form ASR with diarization) ported to ggml/C++, runs on CPU/CUDA/Metal/Vulkan, no Python at inference
Reddit r/LocalLLaMA