MASRA: MLLM-Assisted Semantic-Relational Consistent Alignment for Video Temporal Grounding

arXiv cs.CV / 5/6/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

MASRA is a training-time framework for Video Temporal Grounding (VTG) that tackles the cross-modal semantic gap and improves how video moments are aligned to queries.
It uses an MLLM during training to generate textual priors in two forms—event-level descriptions with temporal spans and clip-level captions—and then performs two MLLM-assisted alignments.
ESTA strengthens span-level separability by aligning temporal context with event semantics, while LRCA improves temporal consistency by matching a relation matrix from captions to the model’s temporal feature similarity matrix.
MASRA adds modules (semantic-guided enhancement and second-order relational attention) plus Decoupled Alignment Interaction (DAI) with a context-aware codebook to reduce query-irrelevant semantics.
The approach is reported to outperform prior methods in extensive experiments, and the MLLM is not used at inference, improving deployability.

Abstract

Video Temporal Grounding (VTG) faces a cross-modal semantic gap that often leads to background features being incorrectly aligned with the query, while directly matching the query to moments results in insufficient discriminability and consistency of temporal semantics. To address this issue, we propose MLLM-Assisted Semantic-Relational Consistent Alignment (MASRA), a training-time MLLM-based optimization framework for VTG. MASRA leverages an MLLM during training to produce two forms of textual priors, namely event-level descriptions with temporal spans and clip-level captions, and instantiates two MLLM-assisted alignments. Event Semantic Temporal Alignment (ESTA) aligns temporal context with event semantics to explicitly strengthen the correspondence between semantics and temporal events and improve span-level separability. Local Relational Consistency Alignment (LRCA) constructs a textual relation matrix derived from clip-level captions and aligns it with the temporal feature similarity matrix in the model, enhancing temporal consistency while capturing local structural information. MASRA includes two simple supporting modules, semantic-guided enhancement and second-order relational attention, to better utilize the learned semantic context and relational structure. Moreover, we introduce Decoupled Alignment Interaction (DAI) with a context-aware codebook to adaptively absorb query-irrelevant semantics and alleviate the cross-modal gap. The MLLM is only invoked during training and is not used at inference. Extensive experiments show that MASRA outperforms existing methods, and ablation studies validate its effectiveness.

Top 10 Free AI Tools for Students in 2026: The Ultimate Study Guide

Dev.to

AI as Your Contingency Co-Pilot: Automating Wedding Day 'What-Ifs'

Dev.to

Google AI Releases Multi-Token Prediction (MTP) Drafters for Gemma 4: Delivering Up to 3x Faster Inference Without Quality Loss

MarkTechPost

When Claude Hallucinates in Court: The Latham & Watkins Incident and What It Means for Attorney Liability

MarkTechPost

Solidity LM surpasses Opus

Reddit r/LocalLLaMA

MASRA: MLLM-Assisted Semantic-Relational Consistent Alignment for Video Temporal Grounding

Key Points

Abstract

Related Articles

Top 10 Free AI Tools for Students in 2026: The Ultimate Study Guide

AI as Your Contingency Co-Pilot: Automating Wedding Day 'What-Ifs'

Google AI Releases Multi-Token Prediction (MTP) Drafters for Gemma 4: Delivering Up to 3x Faster Inference Without Quality Loss

When Claude Hallucinates in Court: The Latham & Watkins Incident and What It Means for Attorney Liability

Solidity LM surpasses Opus

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer