Joint-Centric Dual Contrastive Alignment with Structure-Preserving and Information-Balanced Regularization
arXiv cs.LG / 4/20/2026
📰 NewsModels & Research
Key Points
- The paper introduces HILBERT, a cross-attentive multimodal framework for learning document-level audio–text representations from long, segmented sequences in low-resource settings.
- It uses frozen pre-trained speech and language encoders to extract segment features, then aggregates them via cross-modal attention and self-attentive pooling to produce both modality-specific and joint embeddings.
- To better handle severe audio–text dimensional imbalance, HILBERT trains with a reciprocal dual contrastive objective that aligns audio-to-joint and text-to-joint representations instead of directly contrasting audio and text.
- Two additional regularizers improve stability during long-sequence fusion: a Centered Kernel Alignment (CKA) loss to preserve structural consistency and a mutual-information balancing loss to prevent one modality from dominating the joint space.
- For prediction, HILBERT uses a Mixture-of-Experts (MoE) classifier over concatenated audio, text, and joint representations, and reports improved results—especially on highly imbalanced multi-class downstream tasks.
Related Articles

From Theory to Reality: Why Most AI Agent Projects Fail (And How Mine Did Too)
Dev.to

GPT-5.4-Cyber: OpenAI's Game-Changer for AI Security and Defensive AI
Dev.to
Local LLM Beginner’s Guide (Mac - Apple Silicon)
Reddit r/artificial

Is Your Skill Actually Good? Systematically Validating Agent Skills with Evals
Dev.to

Space now with memory
Dev.to