PortraitDirector: A Hierarchical Disentanglement Framework for Controllable and Real-time Facial Reenactment

arXiv cs.CV / 4/22/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

共有:

Key Points

PortraitDirector proposes a hierarchical, compositional approach to facial reenactment to resolve the common trade-off between expressiveness and fine-grained controllability.
The framework disentangles facial motion into a Spatial Layer (global head pose plus local expressions filtered from emotional cues) and a Semantic Layer (global emotion), then recomposes them into an expressive motion latent.
An Emotion-Filtering Module based on an information bottleneck helps remove emotional signals from the local expression components to improve disentanglement quality.
To enable real-time use, the method applies optimizations such as diffusion distillation, causal attention, and VAE acceleration.
The paper reports streaming 512×512 reenactment at 20 FPS with end-to-end ~800 ms latency on a single NVIDIA 5090 GPU.

Abstract

Existing facial reenactment methods struggle with a trade-off between expressiveness and fine-grained controllability. Holistic facial reenactment models often sacrifice granular control for expressiveness, while methods designed for control may struggle with fidelity and robust disentanglement. Instead of treating facial motion as a monolithic signal, we explore an alternative compositional perspective. In this paper, we introduce PortraitDirector, a novel framework that formulates face reenactment as a hierarchical composition task, achieving high-fidelity and controllable results. We employ a Hierarchical Motion Disentanglement and Composition strategy, deconstructing facial motion into a Spatial Layer for physical movements and a Semantic Layer for emotional content. The Spatial Layer comprises: (i) global head pose, managed via a dedicated representation and injection pathway; (ii) spatially separated local facial expressions, distilled from cropped facial regions and purged of emotional cues via Emotion-Filtering Module leveraging an information bottleneck. The Semantic Layer contains a derived global emotion. The disentangled components are then recomposed into an expressive motion latent. Furthermore, we engineer the framework for real-time performance through a suite of optimizations, including diffusion distillation, causal attention and VAE acceleration. PortraitDirector achieves streaming, high-fidelity, controllable 512 x 512 face reenactment at 20 FPS with a end-to-end 800 ms latency on a single 5090 GPU.