Decompose and Transfer: CoT-Prompting Enhanced Alignment for Open-Vocabulary Temporal Action Detection
arXiv cs.CV / 2026/3/26
📰 ニュースSignals & Early TrendsIdeas & Deep AnalysisModels & Research
要点
- The paper introduces a new Phase-wise Decomposition and Alignment (PDA) framework for Open-Vocabulary Temporal Action Detection (OV-TAD), aiming to better transfer temporally consistent visual knowledge from seen to unseen action categories.
- It proposes a CoT-Prompting Semantic Decomposition (CSD) module that uses large language model chain-of-thought reasoning to automatically break action labels into coherent phase-level descriptions.
- It adds a Text-infused Foreground Filtering (TIF) module that uses phase-wise semantic cues to filter action-relevant video segments and produce more semantically aligned visual representations.
- An Adaptive Phase-wise Alignment (APA) module performs phase-level visual-text matching and adaptively aggregates phase alignment results for final predictions.
- Experiments on two OV-TAD benchmarks reportedly show that the approach improves generalization to unseen actions over prior methods relying mainly on global alignment.



