A Paradigm Shift: Fully End-to-End Training for Temporal Sentence Grounding in Videos
arXiv cs.CV / 4/6/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper addresses temporal sentence grounding in videos (TSGV), where a system must localize the time segment matching a natural-language query in an untrimmed video.
- It argues that prior approaches suffer a task-discrepancy problem by freezing pre-trained visual backbones and using offline, query-agnostic features optimized for classification rather than TSGV.
- The authors propose a fully end-to-end training framework that jointly optimizes the video backbone and the temporal localization head, showing empirically that end-to-end learning beats frozen baselines across model scales.
- They introduce SCADA (Sentence Conditioned Adapter), which adaptively updates a small subset of backbone parameters using sentence features to enable deeper backbones with lower memory usage and better linguistic modulation of visual features.
- Experiments on two benchmarks report improved performance over state-of-the-art methods, with plans to release code and models.
Related Articles

Black Hat Asia
AI Business
How Bash Command Safety Analysis Works in AI Systems
Dev.to
How I Built an AI Agent That Earns USDC While I Sleep — A Complete Guide
Dev.to
How to Get Better Output from AI Tools (Without Burning Time and Tokens)
Dev.to
How I Added LangChain4j Without Letting It Take Over My Spring Boot App
Dev.to