SPARROW: Learning Spatial Precision and Temporal Referential Consistency in Pixel-Grounded Video MLLMs
arXiv cs.AI / 3/16/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- SPARROW introduces Target-Specific Tracked Features (TSF) to inject temporally aligned referent cues during training and a dual-prompt design that decodes box and segmentation tokens to fuse geometric priors with semantic grounding for pixel-grounded video MLLMs.
- It operates end-to-end without external detectors, leveraging a SAM2-based proposer, and has been integrated into three open-source video MLLMs (UniPixel, GLUS, VideoGLaMM) with consistent performance gains.
- The approach is evaluated on a curated referential video dataset of 30,646 videos and 45,231 Q&A pairs, achieving improvements such as up to +8.9 J&F on RVOS, +5 mIoU on visual grounding, and +5.4 CLAIR on GCG.
- Overall, SPARROW substantially improves referential stability, spatial precision, and temporal coherence in pixel-grounded video understanding, signaling stronger temporally consistent grounding for video AI systems.
Related Articles
Automating the Chase: AI for Festival Vendor Compliance
Dev.to
MCP Skills vs MCP Tools: The Right Way to Configure Your Server
Dev.to
500 AI Prompts Every Content Creator Needs in 2026 (20 Free Samples)
Dev.to
Building a Game for My Daughter with AI — Part 1: What If She Could Build It Too?
Dev.to

Math needs thinking time, everyday knowledge needs memory, and a new Transformer architecture aims to deliver both
THE DECODER