Modeling Biomechanical Constraint Violations for Language-Agnostic Lip-Sync Deepfake Detection

arXiv cs.CV / 4/21/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

Key Points

  • The paper argues that existing lip-sync deepfake detectors overfit to pixel-level or audio-visual cues that don’t generalize across languages.
  • It proposes a language-agnostic detection signal based on a biomechanical violation: generative models fail to enforce natural orofacial articulation constraints, leading to elevated temporal lip variance.
  • The authors define this discrepancy as “temporal lip jitter,” showing it remains consistent across factors like language, ethnicity, and recording conditions.
  • They introduce BioLip, a lightweight framework that uses 64 perioral landmark coordinates from MediaPipe rather than raw pixels, to operationalize the proposed principle.
  • The approach is positioned as more universal than artifact-based methods by tying detection to physical plausibility rather than data-dependent patterns.

Abstract

Current lip-sync deepfake detectors rely on pixel-level artifacts or audio-visual correspondence, failing to generalize across languages because these cues encode data-dependent patterns rather than universal physical laws. We identify a more fundamental principle: generative models do not enforce the biomechanical constraints of authentic orofacial articulation, producing measurably elevated temporal lip variance -- a signal we term temporal lip jitter -- that is empirically consistent across the speaker's language, ethnicity, and recording conditions. We instantiate this principle through BioLip, a lightweight framework operating on 64 perioral landmark coordinates extracted by MediaPipe.