Joint-Centric Dual Contrastive Alignment with Structure-Preserving and Information-Balanced Regularization

arXiv cs.LG / 4/20/2026

📰 NewsModels & Research

共有:

Key Points

The paper introduces HILBERT, a cross-attentive multimodal framework for learning document-level audio–text representations from long, segmented sequences in low-resource settings.
It uses frozen pre-trained speech and language encoders to extract segment features, then aggregates them via cross-modal attention and self-attentive pooling to produce both modality-specific and joint embeddings.
To better handle severe audio–text dimensional imbalance, HILBERT trains with a reciprocal dual contrastive objective that aligns audio-to-joint and text-to-joint representations instead of directly contrasting audio and text.
Two additional regularizers improve stability during long-sequence fusion: a Centered Kernel Alignment (CKA) loss to preserve structural consistency and a mutual-information balancing loss to prevent one modality from dominating the joint space.
For prediction, HILBERT uses a Mixture-of-Experts (MoE) classifier over concatenated audio, text, and joint representations, and reports improved results—especially on highly imbalanced multi-class downstream tasks.

Abstract

We propose HILBERT (HIerarchical Long-sequence Balanced Embedding with Reciprocal contrastive Training), a cross-attentive multimodal framework for learning document-level audio-text representations from long, segmented sequences in low-resource data settings. HILBERT leverages frozen pre-trained speech and language encoders to extract segment-level features, which are aggregated via cross-modal attention and self-attentive pooling to form modality-specific document representations and a joint cross-attentive embedding. To align modalities while preserving modality-specific structure under severe audio-text dimensional imbalance, we introduce a reciprocal dual contrastive objective that simultaneously aligns audio-to-joint and text-to-joint representations, rather than directly contrasting audio and text alone. Two auxiliary regularizers further stabilize long-sequence fusion: a Centered Kernel Alignment (CKA) loss that preserves structural consistency between each modality and the joint embedding, and a mutual information balancing loss that prevents dominance of a single modality by equalizing information flow from audio and text into the joint space. For downstream prediction, HILBERT employs a Mixture-of-Experts (MoE) classifier over concatenated audio, text, and joint representations to accommodate heterogeneous label regimes. Extensive evaluation across multiple audio-text backbone combinations demonstrates that HILBERT learns semantically meaningful long-sequence representations and achieves superior performance on highly imbalanced multi-class settings.

From Theory to Reality: Why Most AI Agent Projects Fail (And How Mine Did Too)

Dev.to

GPT-5.4-Cyber: OpenAI's Game-Changer for AI Security and Defensive AI

Dev.to

Local LLM Beginner’s Guide (Mac - Apple Silicon)

Reddit r/artificial

Is Your Skill Actually Good? Systematically Validating Agent Skills with Evals

Dev.to

Space now with memory

Dev.to

Joint-Centric Dual Contrastive Alignment with Structure-Preserving and Information-Balanced Regularization

Key Points

Abstract

Related Articles

From Theory to Reality: Why Most AI Agent Projects Fail (And How Mine Did Too)

GPT-5.4-Cyber: OpenAI's Game-Changer for AI Security and Defensive AI

Local LLM Beginner’s Guide (Mac - Apple Silicon)

Is Your Skill Actually Good? Systematically Validating Agent Skills with Evals

Space now with memory

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer