VIRST: Video-Instructed Reasoning Assistant for SpatioTemporal Segmentation
arXiv cs.CV / 3/31/2026
📰 NewsSignals & Early TrendsModels & Research
Key Points
- The paper introduces VIRST, an end-to-end Video-Instructed Reasoning Assistant designed for Referring Video Object Segmentation that addresses failures of keyframe-based RVOS pipelines on fast motion and reasoning-heavy queries.
- VIRST unifies global video reasoning with pixel-level mask prediction in a single model rather than coupling a vision-language model with a separate propagation module.
- The Spatio-Temporal Fusion (STF) module bridges semantic and segmentation representations by injecting segmentation-aware video features into the vision-language backbone.
- A Temporal Dynamic Anchor Updater maintains temporally adjacent anchor frames to provide stable temporal cues despite large motion, occlusion, and object reappearance.
- Experiments report state-of-the-art performance across multiple RVOS benchmarks and strong generalization for both referring and reasoning-oriented settings, with code and checkpoints released on GitHub.



