Attribution-Guided Multimodal Deepfake Detection via Cross-Modal Forensic Fingerprints

arXiv cs.CV / 4/30/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that audio-visual deepfake detectors based on simple binary classification often pick up dataset-specific artifacts rather than true generator forensic traces, limiting robustness.
It introduces the AMDD framework, which performs both detection and generator attribution by using attribution-guided learning as structured regularization on the shared embedding space.
The proposed Cross-Modal Forensic Fingerprint Consistency (CMFFC) loss aligns generator-induced artifacts across visual and audio streams, leveraging correlated traces created by coherent manipulations.
Experiments on FakeAVCeleb report very high results (99.7% balanced accuracy, 99.8% AUC) along with strong attribution accuracy (95.9%), and cross-dataset tests show robust real-video detection, while fake detection on unseen generators remains challenging.
The architecture combines a ResNet50 with temporal attention for video with a ResNet18-based audio (mel spectrogram) encoder to address capacity imbalances in prior multimodal detectors.

Abstract

Audio-visual deepfakes have reached a level of realism that makes perceptual detection unreliable, threatening media integrity and biometric security. While multimodal detection has shown promise, most approaches are binary classification tasks that often latch onto dataset-specific artifacts rather than genuine generative traces. We argue that a detector incapable of identifying how a video was forged is likely learning the wrong signal. Unlike binary detection, attribution-guided learning imposes a stronger geometric constraint on the shared embedding space, forcing the model to encode generator-specific forensic content rather than shortcuts. We propose the Attribution-Guided Multimodal Deepfake Detection (AMDD) framework, which jointly learns to detect and attribute manipulation. AMDD treats generator attribution as a structured regularization that constrains representation geometry toward forensically meaningful features. We introduce a Cross-Modal Forensic Fingerprint Consistency (CMFFC) loss to enforce alignment between generator-induced artifacts in visual and audio streams. This exploits the fact that coherent manipulation leaves correlated traces across modalities, grounded in the physical coupling between speech and facial articulation that synthetic pipelines routinely disrupt. Architecturally, we pair a ResNet50 with temporal attention for visual encoding against a pretrained ResNet18 for mel spectrograms, closing the encoder capacity gap found in prior models. On FakeAVCeleb, AMDD achieves 99.7% balanced accuracy and 99.8% AUC with 95.9% attribution accuracy. Cross-dataset evaluation on DeepfakeTIMIT, DFDM, and LAV-DF confirms that real video detection generalizes robustly, while fake detection on unseen generators remains an open challenge that we analyze in depth.