HAM: A Training-Free Style Transfer Approach via Heterogeneous Attention Modulation for Diffusion Models

arXiv cs.CV / 3/26/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes HAM (Heterogeneous Attention Modulation), a training-free diffusion-model method for image/text-guided style reference transfer that aims to resolve the style–content identity trade-off.
  • It introduces a style noise initialization strategy to set latent noise for the diffusion process, followed by HAM that modulates different attention mechanisms to better preserve user content identity.
  • HAM includes two components—Global Attention Regulation (GAR) and Local Attention Transplantation (LAT)—intended to balance global style adherence with local detail retention.
  • Experiments (qualitative and quantitative) reportedly achieve state-of-the-art results across multiple metrics on style transfer tasks.
  • Overall, the work suggests that carefully designed attention modulation during inference can improve identity preservation without additional model training.

Abstract

Diffusion models have demonstrated remarkable performance in image generation, particularly within the domain of style transfer. Prevailing style transfer approaches typically leverage pre-trained diffusion models' robust feature extraction capabilities alongside external modular control pathways to explicitly impose style guidance signals. However, these methods often fail to capture complex style reference or retain the identity of user-provided content images, thus falling into the trap of style-content balance. Thus, we propose a training-free style transfer approach via \textbf{h}eterogeneous \textbf{a}ttention \textbf{m}odulation (\textbf{HAM}) to protect identity information during image/text-guided style reference transfer, thereby addressing the style-content trade-off challenge. Specifically, we first introduces style noise initialization to initialize latent noise for diffusion. Then, during the diffusion process, it innovatively employs HAM for different attention mechanisms, including Global Attention Regulation (GAR) and Local Attention Transplantation (LAT), which better preserving the details of the content image while capturing complex style references. Our approach is validated through a series of qualitative and quantitative experiments, achieving state-of-the-art performance on multiple quantitative metrics.