PerformRecast: Expression and Head Pose Disentanglement for Portrait Video Editing

arXiv cs.CV / 3/23/2026

💬 OpinionTools & Practical UsageModels & Research

共有:

Key Points

PerformRecast proposes an expression-only portrait video editing approach that disentangles facial expression from head pose using a 3D Morphable Face Model (3DMM).
The method improves the keypoints transformation to align with the 3DMM, enabling finer control over expressions while preserving identity and head motion.
It decouples facial and non-facial regions to reduce boundary misalignment and employs a teacher model to provide region-specific supervision, boosting result quality and stability.
Extensive experiments show higher fidelity to the driving video, better controllability, and improved efficiency compared with existing methods, with code, data, and trained models released online.

Abstract

This paper primarily investigates the task of expression-only portrait video performance editing based on a driving video, which plays a crucial role in animation and film industries. Most existing research mainly focuses on portrait animation, which aims to animate a static portrait image according to the facial motion from the driving video. As a consequence, it remains challenging for them to disentangle the facial expression from head pose rotation and thus lack the ability to edit facial expression independently. In this paper, we propose PerformRecast, a versatile expression-only video editing method which is dedicated to recast the performance in existing film and animation. The key insight of our method comes from the characteristics of 3D Morphable Face Model (3DMM), which models the face identity, facial expression and head pose of 3D face mesh with separate parameters. Therefore, we improve the keypoints transformation formula in previous methods to make it more consistent with 3DMM model, which achieves a better disentanglement and provides users with much more fine-grained control. Furthermore, to avoid the misalignment around the boundary of face in generated results, we decouple the facial and non-facial regions of input portrait images and pre-train a teacher model to provide separate supervision for them. Extensive experiments show that our method produces high-quality results which are more faithful to the driving video, outperforming existing methods in both controllability and efficiency. Our code, data and trained models are available at https://youku-aigc.github.io/PerformRecast.