SAVe: Self-Supervised Audio-visual Deepfake Detection Exploiting Visual Artifacts and Audio-visual Misalignment

arXiv cs.LG / 3/27/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

提案手法SAVeは、マルチモーダルのディープフェイクに見られる微細な視覚アーティファクトと音声・映像の不整合を捉える自己教師ありの検出フレームワークである。
従来の「合成データ中心で学習」する依存を避けるため、SAVeは真正動画だけで学習し、アイデンティティを保ちつつ局所領域を意図的に自己ブレンドした疑似改変をその場で生成する。
視覚側では顔の複数粒度に対して補完的な手がかりを学習し、音声側では口の動きと音声の同期（リップ・スピーチ同期）のズレを検出するアライメント成分でクロスモーダル証拠を捉える。
FakeAVCelebおよびAV-LipSync-TIMITでの実験により、同一領域での競争力のある性能と、別データセットへの汎化性能の高さが示されている。
研究全体として、合成偽造への学習バイアスを抑え、未知の改変にもスケールしやすいマルチモーダル検出の自己教師ありパラダイムを提示している。

Abstract

Multimodal deepfakes can exhibit subtle visual artifacts and cross-modal inconsistencies, which remain challenging to detect, especially when detectors are trained primarily on curated synthetic forgeries. Such synthetic dependence can introduce dataset and generator bias, limiting scalability and robustness to unseen manipulations. We propose SAVe, a self-supervised audio-visual deepfake detection framework that learns entirely on authentic videos. SAVe generates on-the-fly, identity-preserving, region-aware self-blended pseudo-manipulations to emulate tampering artifacts, enabling the model to learn complementary visual cues across multiple facial granularities. To capture cross-modal evidence, SAVe also models lip-speech synchronization via an audio-visual alignment component that detects temporal misalignment patterns characteristic of audio-visual forgeries. Experiments on FakeAVCeleb and AV-LipSync-TIMIT demonstrate competitive in-domain performance and strong cross-dataset generalization, highlighting self-supervised learning as a scalable paradigm for multimodal deepfake detection.