A training-free framework for high-fidelity appearance transfer via diffusion transformers

arXiv cs.CV / 3/31/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces a training-free framework to enable high-fidelity appearance transfer with Diffusion Transformers, addressing the difficulty of controllable reference-image-based editing caused by DiTs’ global self-attention.
  • It disentangles structure and appearance by using high-fidelity inversion to build a rich content prior for the source image, capturing lighting and micro-texture details.
  • A new attention-sharing mechanism fuses purified appearance features from a reference image, with the fusion guided by geometric priors to preserve overall scene structure.
  • The method runs at 1024px resolution and reportedly outperforms specialized approaches across tasks including semantic attribute transfer and fine-grained material application while improving both structural preservation and appearance fidelity.

Abstract

Diffusion Transformers (DiTs) excel at generation, but their global self-attention makes controllable, reference-image-based editing a distinct challenge. Unlike U-Nets, naively injecting local appearance into a DiT can disrupt its holistic scene structure. We address this by proposing the first training-free framework specifically designed to tame DiTs for high-fidelity appearance transfer. Our core is a synergistic system that disentangles structure and appearance. We leverage high-fidelity inversion to establish a rich content prior for the source image, capturing its lighting and micro-textures. A novel attention-sharing mechanism then dynamically fuses purified appearance features from a reference, guided by geometric priors. Our unified approach operates at 1024px and outperforms specialized methods on tasks ranging from semantic attribute transfer to fine-grained material application. Extensive experiments confirm our state-of-the-art performance in both structural preservation and appearance fidelity.