Micro-Expression-Aware Avatar Fingerprinting via Inter-Frame Feature Differencing

arXiv cs.CV / 4/28/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces an end-to-end avatar fingerprinting approach that verifies who generated a synthetic talking-head video, focusing on driver identity rather than real-vs-fake authenticity.
It replaces fixed, non-differentiable landmark extraction with a preprocessing-free pipeline that uses a micro-expression-aware backbone directly on raw video frames.
The core method computes inter-frame feature differencing by subtracting consecutive feature maps in deep space, causing temporally stable appearance cues to cancel while preserving driver-specific motion dynamics.
Ablation experiments on NVFAIR show that temporal motion provides most of the discriminative power and that raw appearance features can harm identity separation.
The proposed system reports an overall AUC of 0.877 on NVFAIR and generally matches or outperforms landmark-based baselines across most cross-generator evaluation pairs.

Abstract

Avatar fingerprinting, i.e., verifying who drives a synthetic talking-head video rather than whether it is real, is a critical safeguard for authorized use of face-reenactment technology. Existing methods rely on a fixed, non-differentiable landmark extraction stage that prevents the fingerprinting model from being optimized end-to-end from raw pixels. We propose a preprocessing-free system built on a micro-expression-aware backbone operating on raw video frames, with inter-frame feature differencing as the core design principle: consecutive feature maps are subtracted in the learned deep feature space, so that temporally stable appearance dimensions contribute zero to the output while driver-specific motion dynamics are preserved. A controlled ablation on NVFAIR confirms that temporal motion accounts for the large majority of discriminative performance, and that raw appearance features actively degrade identity separation. Both the choice of backbone and the differencing principle are essential: differencing alone is insufficient when applied to a generic encoder, as appearance-dominated features collapse to near-identical representations across adjacent frames, while the micro-expression-aware F5C backbone retains measurable motion variation that the differencing operation can exploit. Without any external preprocessing, our model achieves an overall AUC of 0.877 on NVFAIR and matches or exceeds the landmark-based baseline on the majority of cross-generator pairs.