Omni-Fake: Benchmarking Unified Multimodal Social Media Deepfake Detection

arXiv cs.CV / 5/5/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • The paper introduces Omni-Fake, a unified multimodal benchmark designed to evaluate deepfake detection performance under realistic social-media conditions.
  • Omni-Fake includes two datasets: Omni-Fake-Set (1M+ high-quality samples) and Omni-Fake-OOD (200k+ out-of-distribution samples excluded from training to test generalization).
  • The benchmark covers four modalities—image, audio, video, and audio-video talking heads—and supports a joint detection, localization, and explanation protocol.
  • The authors propose Omni-Fake-R1, a reinforcement-learning-based detector that adaptively fuses visual and auditory cues and produces structured outputs including localization and natural-language explanations.
  • Experimental results report substantial improvements in detection accuracy, cross-modal generalization, and explainability compared with existing state-of-the-art baselines.

Abstract

Multimodal deepfakes are proliferating on social media and threaten authenticity, information integrity, and digital forensics. Existing benchmarks are constrained by their single-modality scope, simplified manipulations, or unrealistic distributions, which limit their ability to assess real-world robustness. To address these limitations, we present Omni-Fake, a unified omni-dataset for comprehensive multimodal deepfake detection in social-media settings. It comprises Omni-Fake-Set, a large-scale, high-quality dataset with 1M+ samples, and Omni-Fake-OOD, an out-of-distribution benchmark with 200k+ samples intentionally excluded from training to evaluate generalization. Omni-Fake spans four modalities (image, audio, video, and audio-video talking head) and supports a joint detection-localization-explanation protocol. On top of Omni-Fake, we further propose Omni-Fake-R1, a reinforcement-learning-driven multimodal detector that adaptively integrates visual and auditory cues and outputs structured decisions, localization, and natural-language explanations. Extensive experiments show significant gains in detection accuracy, cross-modal generalization, and explainability over state-of-the-art baselines. Project page: https://tianxiao1201.github.io/omni-fake-project-page/