MSCT: Differential Cross-Modal Attention for Deepfake Detection

arXiv cs.CV / 4/10/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes MSCT, a multi-scale cross-modal transformer encoder aimed at improving audio-visual deepfake detection by better extracting forgery traces across modalities.
  • It addresses shortcomings of prior approaches by introducing multi-scale self-attention to integrate adjacent embeddings and differential cross-modal attention to more effectively fuse audio and video features.
  • The method targets common failure modes in alignment-based detectors, including insufficient feature extraction and modal alignment deviation between audio and video.
  • Experiments on the FakeAVCeleb dataset show competitive performance, supporting the effectiveness of the proposed architecture.

Abstract

Audio-visual deepfake detection typically employs a complementary multi-modal model to check the forgery traces in the video. These methods primarily extract forgery traces through audio-visual alignment, which results from the inconsistency between audio and video modalities. However, the traditional multi-modal forgery detection method has the problem of insufficient feature extraction and modal alignment deviation. To address this, we propose a multi-scale cross-modal transformer encoder (MSCT) for deepfake detection. Our approach includes a multi-scale self-attention to integrate the features of adjacent embeddings and a differential cross-modal attention to fuse multi-modal features. Our experiments demonstrate competitive performance on the FakeAVCeleb dataset, validating the effectiveness of the proposed structure.