CMTA: Leveraging Cross-Modal Temporal Artifacts for Generalizable AI-Generated Video Detection

arXiv cs.CV / 5/4/2026

📰 NewsTools & Practical UsageModels & Research

共有:

Key Points

The paper highlights that AI-generated video (AIGV) detection is often limited to uni-modal or spatiotemporal cues, missing cross-modal signals—especially how visual and text semantics align over time.
It proposes a new detection fingerprint called CMTA (cross-modal temporal artifact), arguing that real videos show natural, fluctuating semantic alignment while AIGVs tend to maintain unnaturally stable semantic trajectories driven by prompts.
The CMTA framework uses BLIP to produce frame-level image captions and CLIP to extract visual-textual representations, then applies two temporal modeling branches (a GRU-based coarse-grained branch and a Transformer-based fine-grained branch) to capture temporal alignment artifacts.
Experiments across 40 subsets on multiple datasets (GenVideo, EvalCrafter, VideoPhy, VidProM) show the method achieves state-of-the-art performance and better generalization across different video generators.
The authors plan to release the code and models for CMTA on GitHub.

Abstract

The proliferation of advanced AI video synthesis techniques poses an unprecedented challenge to digital video authenticity. Existing AI-generated video (AIGV) detection methods primarily focus on uni-modal or spatiotemporal artifacts, but they overlook the rich cues within the visual-textual cross-modal space, especially the temporal stability of semantic alignment. In this work, we identify a distinctive fingerprint in AIGVs, termed cross-modal temporal artifact (CMTA). Unlike real videos that exhibit natural temporal fluctuations in cross-modal alignment due to semantic variations, AIGVs display unnaturally stable semantic trajectories governed by given input prompts. To bridge this gap, we propose the CMTA framework, a cross-modal detection approach that captures these unique temporal artifacts through joint cross-modal embedding and multi-grained temporal modeling. Specifically, CMTA leverages BLIP to generate frame-level image captions and utilizes CLIP to extract corresponding visual-textual representations. A coarse-grained temporal modeling branch is then designed to characterize temporal fluctuations in cross-modal alignment with a GRU. In parallel, a fine-grained branch is constructed to capture intricate inter-frame variations from integrated visual-textual features with a Transformer encoder. Extensive experiments on 40 subsets across four large-scale datasets, including GenVideo, EvalCrafter, VideoPhy, and VidProM, validate that our approach sets a new state-of-the-art while exhibiting superior cross-generator generalization. Code and models of CMTA will be released at https://github.com/hwang-cs-ime/CMTA