Off-the-shelf Vision Models Benefit Image Manipulation Localization

arXiv cs.CV / 4/13/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that image manipulation localization (IML) and general vision tasks should be treated as connected directions, with semantic priors potentially helping IML performance.
It introduces ReVi, a trainable adapter designed to repurpose off-the-shelf vision models (including image generation and segmentation networks) for IML without altering the base models.
ReVi uses an approach inspired by robust principal component analysis to separate semantic redundancy from manipulation-specific signals and then amplify the manipulation-relevant components.
The method is efficient to deploy because it freezes the original vision model parameters and fine-tunes only the lightweight adapter, avoiding extensive redesign and full retraining.
Experiments indicate improved IML results and suggest that scalable IML frameworks can be built by plugging adapters into existing general-purpose vision backbones.

Abstract

Image manipulation localization (IML) and general vision tasks are typically treated as two separate research directions due to the fundamental differences between manipulation-specific and semantic features. In this paper, however, we bridge this gap by introducing a fresh perspective: these two directions are intrinsically connected, and general semantic priors can benefit IML. Building on this insight, we propose a novel trainable adapter (named ReVi) that repurposes existing off-the-shelf general-purpose vision models (e.g., image generation and segmentation networks) for IML. Inspired by robust principal component analysis, the adapter disentangles semantic redundancy from manipulation-specific information embedded in these models and selectively enhances the latter. Unlike existing IML methods that require extensive model redesign and full retraining, our method relies on the off-the-shelf vision models with frozen parameters and only fine-tunes the proposed adapter. The experimental results demonstrate the superiority of our method, showing the potential for scalable IML frameworks.