SPARROW: Learning Spatial Precision and Temporal Referential Consistency in Pixel-Grounded Video MLLMs
arXiv cs.AI / 3/16/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- SPARROW introduces Target-Specific Tracked Features (TSF) to inject temporally aligned referent cues during training and a dual-prompt design that decodes box and segmentation tokens to fuse geometric priors with semantic grounding for pixel-grounded video MLLMs.
- It operates end-to-end without external detectors, leveraging a SAM2-based proposer, and has been integrated into three open-source video MLLMs (UniPixel, GLUS, VideoGLaMM) with consistent performance gains.
- The approach is evaluated on a curated referential video dataset of 30,646 videos and 45,231 Q&A pairs, achieving improvements such as up to +8.9 J&F on RVOS, +5 mIoU on visual grounding, and +5.4 CLAIR on GCG.
- Overall, SPARROW substantially improves referential stability, spatial precision, and temporal coherence in pixel-grounded video understanding, signaling stronger temporally consistent grounding for video AI systems.
Related Articles

The programming passion is melting
Dev.to

Maximize Developer Revenue with Monetzly's Innovative API for AI Conversations
Dev.to
Co-Activation Pattern Detection for Prompt Injection: A Mechanistic Interpretability Approach Using Sparse Autoencoders
Reddit r/LocalLLaMA

How to Train Custom Language Models: Fine-Tuning vs Training From Scratch (2026)
Dev.to

KoboldCpp 1.110 - 3 YR Anniversary Edition, native music gen, qwen3tts voice cloning and more
Reddit r/LocalLLaMA