Energy-Based Constraint Networks: Learning Structural Coherence Across Modalities

arXiv cs.CV / 5/5/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes an energy-based constraint network that learns structural coherence in a modality-agnostic way using contrastive pairs and outputs both a global energy score and per-position violation localization.
  • It applies the same architecture to text and vision by freezing pretrained encoders (BERT for text, DINOv2 for vision) and training only a small number of parameters in an energy/state-space + dual-head attention setup.
  • The model performs strongly in text corruption detection (93.4% on trained corruptions and 87.2% on nine unseen types) and competes in deepfake detection without using Celeb-DF training data (AUC 0.959 on FaceForensics++ Deepfakes, 0.870 on Celeb-DF).
  • Multiple independently trained branches can detect different violation types and be composed at inference, provided the representations are compatible; the authors report that several incompatible approaches failed before a compatible design worked.
  • The framework is flexible and reusable across domains and encoders: changing domains mainly requires new corruption strategies, while changing encoders can be done via a new input projection layer, enabling cross-modal transfer through “corruption respecification.”

Abstract

We introduce energy-based constraint networks -- a modality-agnostic architecture that learns structural coherence from contrastive pairs. The system processes frozen encoder embeddings through a state-space model with dual-head attention, producing a scalar energy measuring structural consistency alongside per-position energy scores that localize violations. Multiple independently trained branches detect different violation types and compose at inference without interference. We demonstrate the framework in two domains. In text, the system achieves 93.4% accuracy on trained corruption types and 87.2% on 9 unseen types, using frozen BERT and 7.4M trainable parameters. In vision, the same architecture achieves competitive deepfake detection: 0.959 AUC on FaceForensics++ Deepfakes and 0.870 on Celeb-DF without any Celeb-DF training data, using frozen DINOv2 and 3.6M parameters per branch. The framework supports flexible training: branches learn from designer-specified corruptions, real-world paired data, or both. Composable branches require representation compatibility -- a finding validated through extensive experimentation where five incompatible approaches failed before the compatible one succeeded. The architecture is encoder-agnostic and domain-agnostic: changing the domain requires only new corruption strategies; changing the encoder requires only a new input projection layer. To our knowledge, this is the first architecture to learn within-modality structural coherence as an explicit energy landscape with per-position decomposition, and to demonstrate that the same architecture transfers across modalities via corruption respecification alone.