Modeling Biomechanical Constraint Violations for Language-Agnostic Lip-Sync Deepfake Detection

arXiv cs.CV / 4/21/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

共有:

Key Points

The paper argues that existing lip-sync deepfake detectors overfit to pixel-level or audio-visual cues that don’t generalize across languages.
It proposes a language-agnostic detection signal based on a biomechanical violation: generative models fail to enforce natural orofacial articulation constraints, leading to elevated temporal lip variance.
The authors define this discrepancy as “temporal lip jitter,” showing it remains consistent across factors like language, ethnicity, and recording conditions.
They introduce BioLip, a lightweight framework that uses 64 perioral landmark coordinates from MediaPipe rather than raw pixels, to operationalize the proposed principle.
The approach is positioned as more universal than artifact-based methods by tying detection to physical plausibility rather than data-dependent patterns.

Abstract

Current lip-sync deepfake detectors rely on pixel-level artifacts or audio-visual correspondence, failing to generalize across languages because these cues encode data-dependent patterns rather than universal physical laws. We identify a more fundamental principle: generative models do not enforce the biomechanical constraints of authentic orofacial articulation, producing measurably elevated temporal lip variance -- a signal we term temporal lip jitter -- that is empirically consistent across the speaker's language, ethnicity, and recording conditions. We instantiate this principle through BioLip, a lightweight framework operating on 64 perioral landmark coordinates extracted by MediaPipe.