CMTA: Leveraging Cross-Modal Temporal Artifacts for Generalizable AI-Generated Video Detection
arXiv cs.CV / 5/4/2026
📰 NewsTools & Practical UsageModels & Research
Key Points
- The paper highlights that AI-generated video (AIGV) detection is often limited to uni-modal or spatiotemporal cues, missing cross-modal signals—especially how visual and text semantics align over time.
- It proposes a new detection fingerprint called CMTA (cross-modal temporal artifact), arguing that real videos show natural, fluctuating semantic alignment while AIGVs tend to maintain unnaturally stable semantic trajectories driven by prompts.
- The CMTA framework uses BLIP to produce frame-level image captions and CLIP to extract visual-textual representations, then applies two temporal modeling branches (a GRU-based coarse-grained branch and a Transformer-based fine-grained branch) to capture temporal alignment artifacts.
- Experiments across 40 subsets on multiple datasets (GenVideo, EvalCrafter, VideoPhy, VidProM) show the method achieves state-of-the-art performance and better generalization across different video generators.
- The authors plan to release the code and models for CMTA on GitHub.



