X2SAM: Any Segmentation in Images and Videos

arXiv cs.CV / 5/5/2026

📰 NewsModels & Research

Key Points

  • X2SAM is a unified multimodal segmentation LLM that extends any-segmentation from images to videos while supporting both conversational instructions and visual prompts.
  • It combines an LLM with a Mask Memory module to maintain temporally consistent mask generation across video frames.
  • The model is designed to handle a wide range of segmentation tasks, including open-vocabulary, referring, grounded conversation-based generation, interactive, and visual-grounded segmentation for both images and videos.
  • The authors introduce the Video Visual Grounded (V-VGD) benchmark to evaluate whether models can segment object tracks in videos using interactive visual prompts.
  • According to the abstract, joint training over heterogeneous image and video datasets yields strong video segmentation results without sacrificing competitive image segmentation and general image/video chat ability.

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated strong image-level visual understanding and reasoning, yet their pixel-level perception across both images and videos remains limited. Foundation segmentation models such as the SAM series produce high-quality masks, but they rely on low-level visual prompts and cannot natively interpret complex conversational instructions. Existing segmentation MLLMs narrow this gap, but are usually specialized for either images or videos and rarely support both textual and visual prompts in one interface. We introduce X2SAM, a unified segmentation MLLM that extends any-segmentation capabilities from images to videos. Given conversational instructions and visual prompts, X2SAM couples an LLM with a Mask Memory module that stores guided vision features for temporally consistent video mask generation. The same formulation supports generic, open-vocabulary, referring, reasoning, grounded conversation generation, interactive, and visual grounded segmentation across image and video inputs. We further introduce the Video Visual Grounded (V-VGD) segmentation benchmark, which evaluates whether a model can segment object tracks in videos from interactive visual prompts. With a unified joint training strategy over heterogeneous image and video datasets, X2SAM delivers strong video segmentation performance, remains competitive on image segmentation benchmarks, and preserves general image and video chat ability.