LatRef-Diff: Latent and Reference-Guided Diffusion for Facial Attribute Editing and Style Manipulation

arXiv cs.CV / 4/24/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • LatRef-Diff is a new diffusion-based framework aimed at more precise facial attribute editing and controllable style manipulation for applications like virtual avatars and photo editing.
  • Instead of using traditional semantic directions, it introduces “style codes” generated via latent and reference guidance, which are then used to modulate the target image through a dedicated style modulation module.
  • The style modulation module uses learnable vectors, cross-attention, and a hierarchical design to improve accuracy and overall image quality, supporting both random and user-customized style changes.
  • To improve training stability without requiring paired data (before/after images), the paper proposes a forward-backward consistency strategy that removes the target attribute and then restores it using losses such as perceptual and classification losses.
  • Experiments on CelebA-HQ report state-of-the-art results in both qualitative and quantitative metrics, with ablation studies confirming the contributions of key components.

Abstract

Facial attribute editing and style manipulation are crucial for applications like virtual avatars and photo editing. However, achieving precise control over facial attributes without altering unrelated features is challenging due to the complexity of facial structures and the strong correlations between attributes. While conditional GANs have shown progress, they are limited by accuracy issues and training instability. Diffusion models, though promising, face challenges in style manipulation due to the limited expressiveness of semantic directions. In this paper, we propose LatRef-Diff, a novel diffusion-based framework that addresses these limitations. We replace the traditional semantic directions in diffusion models with style codes and propose two methods for generating them: latent and reference guidance. Based on these style codes, we design a style modulation module that integrates them into the target image, enabling both random and customized style manipulation. This module incorporates learnable vectors, cross-attention mechanisms, and a hierarchical design to improve accuracy and image quality. Additionally, to enhance training stability while eliminating the need for paired images (e.g., before and after editing), we propose a forward-backward consistency training strategy. This strategy first removes the target attribute approximately using image-specific semantic directions and then restores it via style modulation, guided by perceptual and classification losses. Extensive experiments on CelebA-HQ demonstrate that LatRef-Diff achieves state-of-the-art performance in both qualitative and quantitative evaluations. Ablation studies validate the effectiveness of our model's design choices.