Bridging the Micro--Macro Gap: Frequency-Aware Semantic Alignment for Image Manipulation Localization

arXiv cs.CV / 4/15/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces FASA (Frequency-Aware Semantic Alignment), a unified framework to localize both traditional image manipulations and diffusion-generated edits that look locally realistic.
  • It bridges the “micro–macro gap” by combining manipulation-sensitive frequency cues (via an adaptive dual-band DCT module) with manipulation-aware semantic priors (learned through patch-level contrastive alignment on frozen CLIP features).
  • FASA injects semantic priors into a hierarchical frequency pathway using a semantic-frequency side adapter to enable multi-scale feature interactions.
  • A prototype-guided, frequency-gated mask decoder integrates semantic consistency with boundary-aware localization to predict tampered regions more accurately.
  • Experiments on OpenSDI and several traditional manipulation benchmarks show state-of-the-art results, strong cross-generator/cross-dataset generalization, and robustness under common image degradations.

Abstract

As generative image editing advances, image manipulation localization (IML) must handle both traditional manipulations with conspicuous forensic artifacts and diffusion-generated edits that appear locally realistic. Existing methods typically rely on either low-level forensic cues or high-level semantics alone, leading to a fundamental micro--macro gap. To bridge this gap, we propose FASA, a unified framework for localizing both traditional and diffusion-generated manipulations. Specifically, we extract manipulation-sensitive frequency cues through an adaptive dual-band DCT module and learn manipulation-aware semantic priors via patch-level contrastive alignment on frozen CLIP representations. We then inject these priors into a hierarchical frequency pathway through a semantic-frequency side adapter for multi-scale feature interaction, and employ a prototype-guided, frequency-gated mask decoder to integrate semantic consistency with boundary-aware localization for tampered region prediction. Extensive experiments on OpenSDI and multiple traditional manipulation benchmarks demonstrate state-of-the-art localization performance, strong cross-generator and cross-dataset generalization, and robust performance under common image degradations.