DiffMagicFace: Identity Consistent Facial Editing of Real Videos

arXiv cs.CV / 4/16/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

DiffMagicFaceは、テキスト条件付きの拡散モデルを実ビデオの顔編集に拡張するための枠組みで、編集後も顔のアイデンティティを維持しつつ編集内容の意味整合を保つことを目指しています。
2つの微調整済みモデル（テキスト制御と画像制御）を推論時に同時動作させ、フレーム間で同一人物らしさを維持しながら編集対象を一貫して整列させる設計になっています。
編集の一貫性を高めるため、各編集対象ごとに多様な顔の視点を示すデータセットを、レンダリングと最適化手法により構築しています。
ビデオデータセットに依存しないにもかかわらず、トーキングヘッド等の複雑タスクで一貫性と内容の両面において高品質な結果を示し、レンダリングソフト作成動画と同等水準を主張しています。
既存の最先端手法との比較で、視覚的な魅力と定量指標の両方で優れた性能を報告しています。

Abstract

Text-conditioned image editing has greatly benefitted from the advancements in Image Diffusion Models. However, extending these techniques to facial video editing introduces challenges in preserving facial identity throughout the source video and ensuring consistency of the edited subject across frames. In this paper, we introduce DiffMagicFace, a unique video editing framework that integrates two fine-tuned models for text and image control. These models operate concurrently during inference to produce video frames that maintain identity features while seamlessly aligning with the editing semantics. To ensure the consistency of the edited videos, we develop a dataset comprising images showcasing various facial perspectives for each edited subject. The creation of a data set is achieved through rendering techniques and the subsequent application of optimization algorithms. Remarkably, our approach does not depend on video datasets but still delivers high-quality results in both consistency and content. The excellent effect holds even for complex tasks like talking head videos and distinguishing closely related categories. The videos edited using our framework exhibit parity with videos that are made using traditional rendering software. Through comparative analysis with current state-of-the-art methods, our framework demonstrates superior performance in both visual appeal and quantitative metrics.