Stay in your Lane: Role Specific Queries with Overlap Suppression Loss for Dense Video Captioning
arXiv cs.CV / 3/13/2026
📰 NewsModels & Research
Key Points
- Introduces role-specific queries to decouple localization and captioning in dense video captioning, reducing cross-task interference.
- Adds a suppression mechanism that penalizes mutual temporal overlaps across queries to learn non-overlapping, more precise event regions.
- Applies contrastive alignment to ensure semantic consistency between the separated localization and captioning outputs.
- Proposes a lightweight core-concept module to enrich captions with concept-level representations for improved semantic richness.
- Validates the approach on major DVC benchmarks YouCook2 and ActivityNet Captions, showing effective performance gains.
Related Articles

ラピダス、半導体設計AIエージェント「国内2社海外1社が使用中」
日経XTECH

Superposition and the Capsule: Quantum State Collapse Meets AI Identity
Dev.to

The Basilisk Inversion: Why Coercive AI Futures Are Thermodynamically Unlikely
Dev.to

The Loop as Laboratory: What 3,190 Cycles of Autonomous AI Operation Reveal
Dev.to

MiMo-V2-Pro & Omni & TTS: "We will open-source — when the models are stable enough to deserve it."
Reddit r/LocalLLaMA