MASRA: MLLM-Assisted Semantic-Relational Consistent Alignment for Video Temporal Grounding
arXiv cs.CV / 5/6/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- MASRA is a training-time framework for Video Temporal Grounding (VTG) that tackles the cross-modal semantic gap and improves how video moments are aligned to queries.
- It uses an MLLM during training to generate textual priors in two forms—event-level descriptions with temporal spans and clip-level captions—and then performs two MLLM-assisted alignments.
- ESTA strengthens span-level separability by aligning temporal context with event semantics, while LRCA improves temporal consistency by matching a relation matrix from captions to the model’s temporal feature similarity matrix.
- MASRA adds modules (semantic-guided enhancement and second-order relational attention) plus Decoupled Alignment Interaction (DAI) with a context-aware codebook to reduce query-irrelevant semantics.
- The approach is reported to outperform prior methods in extensive experiments, and the MLLM is not used at inference, improving deployability.
Related Articles

Top 10 Free AI Tools for Students in 2026: The Ultimate Study Guide
Dev.to

AI as Your Contingency Co-Pilot: Automating Wedding Day 'What-Ifs'
Dev.to

Google AI Releases Multi-Token Prediction (MTP) Drafters for Gemma 4: Delivering Up to 3x Faster Inference Without Quality Loss
MarkTechPost
When Claude Hallucinates in Court: The Latham & Watkins Incident and What It Means for Attorney Liability
MarkTechPost
Solidity LM surpasses Opus
Reddit r/LocalLLaMA