MD-Face: MoE-Enhanced Label-Free Disentangled Representation for Interactive Facial Attribute Editing

arXiv cs.CV / 4/23/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces MD-Face, a label-free method to learn disentangled facial representations for more reliable GAN-based attribute editing without unintended attribute changes.
  • MD-Face uses a Mixture of Experts (MoE) backbone with a gating mechanism to assign experts dynamically, aiming to learn more independent semantic vectors.
  • To reduce attribute entanglement further, it proposes a geometry-aware loss that aligns each semantic vector with a corresponding Semantic Boundary Vector (SBV) using a Jacobian-based pushforward approach.
  • Experiments on ProGAN and StyleGAN indicate MD-Face outperforms unsupervised baselines and is competitive with supervised disentanglement methods.
  • Compared with diffusion-based editing methods, the approach reports better image quality and lower inference latency, supporting interactive facial editing use cases.

Abstract

GAN-based facial attribute editing is widely used in virtual avatars and social media but often suffers from attribute entanglement, where modifying one face attribute unintentionally alters others. While supervised disentangled representation learning can address this, it relies heavily on labeled data, incurring high annotation costs. To address these challenges, we propose MD-Face, a label-free disentangled representation learning framework based on Mixture of Experts (MoE). MD-Face utilizes a MoE backbone with a gating mechanism that dynamically allocates experts, enabling the model to learn semantic vectors with greater independence. To further enhance attribute entanglement, we introduce a geometry-aware loss, which aligns each semantic vector with its corresponding Semantic Boundary Vector (SBV) through a Jacobian-based pushforward method. Experiments with ProGAN and StyleGAN show that MD-Face outperforms unsupervised baselines and competes with supervised ones. Compared to diffusion-based methods, it offers better image quality and lower inference latency, making it ideal for interactive editing.