Robust Deepfake Detection: Mitigating Spatial Attention Drift via Calibrated Complementary Ensembles

arXiv cs.CV / 4/29/2026

📰 NewsTools & Practical UsageModels & Research

Key Points

  • The paper highlights that deepfake detectors can achieve top results on clean datasets but fail in the real world due to spatial attention drift caused by compound degradations like blur and severe lossy compression.
  • It proposes a forensic “foundation-driven” framework that uses an extreme compound degradation engine and a structurally constrained, multi-stream architecture to learn more invariant geometric and semantic priors from DINOv2-Giant.
  • The method routes images through three pathways—Global Texture, Localized Facial, and Hybrid Semantic Fusion (with CLIP)—then evaluates spatial attribution stability using Score-CAM and feature stability via cosine similarity.
  • A calibrated, discretized voting ensemble is used to suppress background attention drift and improve robustness, with the approach reportedly achieving 4th place in the NTIRE 2026 Robust Deepfake Detection Challenge at CVPR.
  • The authors provide accompanying code on GitHub to support reproducibility.

Abstract

Current deepfake detection models achieve state-of-the-art performance on pristine academic datasets but suffer severe spatial attention drift under real-world compound degradations, such as blurring and severe lossy compression. To address this vulnerability, we propose a foundation-driven forensic framework that integrates an extreme compound degradation engine with a structurally constrained, multi-stream architecture. During training, our degradation pipeline systematically destroys high-frequency artifacts, optimizing the DINOv2-Giant backbone to extract invariant geometric and semantic priors. We then process images through three specialized pathways: a Global Texture stream, a Localized Facial stream, and a Hybrid Semantic Fusion stream incorporating CLIP. Through analyzing spatial attribution via Score-CAM and feature stability using Cosine Similarity, we quantitatively demonstrate that these streams extract non-redundant, complementary feature representations and stabilize attention entropy. By aggregating these predictions via a calibrated, discretized voting mechanism, our ensemble successfully suppresses background attention drift while acting as a robust geometric anchor. Our approach yields highly stable zero-shot generalization, achieving Fourth Place in the NTIRE 2026 Robust Deepfake Detection Challenge at CVPR. Code is available at https://github.com/khoalephanminh/ntire26-deepfake-challenge.