A Paradigm Shift: Fully End-to-End Training for Temporal Sentence Grounding in Videos

arXiv cs.CV / 4/6/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper addresses temporal sentence grounding in videos (TSGV), where a system must localize the time segment matching a natural-language query in an untrimmed video.
It argues that prior approaches suffer a task-discrepancy problem by freezing pre-trained visual backbones and using offline, query-agnostic features optimized for classification rather than TSGV.
The authors propose a fully end-to-end training framework that jointly optimizes the video backbone and the temporal localization head, showing empirically that end-to-end learning beats frozen baselines across model scales.
They introduce SCADA (Sentence Conditioned Adapter), which adaptively updates a small subset of backbone parameters using sentence features to enable deeper backbones with lower memory usage and better linguistic modulation of visual features.
Experiments on two benchmarks report improved performance over state-of-the-art methods, with plans to release code and models.

Abstract

Temporal sentence grounding in videos (TSGV) aims to localize a temporal segment that semantically corresponds to a sentence query from an untrimmed video. Most current methods adopt pre-trained query-agnostic visual encoders for offline feature extraction, and the video backbones are frozen and not optimized for TSGV. This leads to a task discrepancy issue for the video backbone trained for visual classification, but utilized for TSGV. To bridge this gap, we propose a fully end-to-end paradigm that jointly optimizes the video backbone and localization head. We first conduct an empirical study validating the effectiveness of end-to-end learning over frozen baselines across different model scales. Furthermore, we introduce a Sentence Conditioned Adapter (SCADA), which leverages sentence features to train a small portion of video backbone parameters adaptively. SCADA facilitates the deployment of deeper network backbones with reduced memory and significantly enhances visual representation by modulating feature maps through precise integration of linguistic embeddings. Experiments on two benchmarks show that our method outperforms state-of-the-art approaches. The code and models will be released.

Black Hat Asia

AI Business

How Bash Command Safety Analysis Works in AI Systems

Dev.to

How I Built an AI Agent That Earns USDC While I Sleep — A Complete Guide

Dev.to

How to Get Better Output from AI Tools (Without Burning Time and Tokens)

Dev.to

How I Added LangChain4j Without Letting It Take Over My Spring Boot App

Dev.to

A Paradigm Shift: Fully End-to-End Training for Temporal Sentence Grounding in Videos

Key Points

Abstract

Related Articles

Black Hat Asia

How Bash Command Safety Analysis Works in AI Systems

How I Built an AI Agent That Earns USDC While I Sleep — A Complete Guide

How to Get Better Output from AI Tools (Without Burning Time and Tokens)

How I Added LangChain4j Without Letting It Take Over My Spring Boot App

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer