AlignMamba-2: Enhancing Multimodal Fusion and Sentiment Analysis with Modality-Aware Mamba

arXiv cs.AI / 3/20/2026

💬 OpinionModels & Research

共有:

Key Points

AlignMamba-2 tackles the quadratic complexity of Transformer-based multimodal models and the limited global cross-modal interactions of sequential Mamba architectures by introducing a dual-alignment and modality-aware fusion framework.
The method employs dual regularization using Optimal Transport distance and Maximum Mean Discrepancy to enforce geometric and statistical consistency across modalities without adding any inference-time overhead.
It introduces a Modality-Aware Mamba layer based on a Mixture-of-Experts design with modality-specific and modality-shared experts to better handle data heterogeneity during fusion.
Experiments on dynamic time-series benchmarks (CMU-MOSI, CMU-MOSEI) and static image-text tasks (NYU-Depth V2, MVSA-Single) demonstrate state-of-the-art performance and improved efficiency across diverse tasks.

Abstract

In the era of large-scale pre-trained models, effectively adapting general knowledge to specific affective computing tasks remains a challenge, particularly regarding computational efficiency and multimodal heterogeneity. While Transformer-based methods have excelled at modeling inter-modal dependencies, their quadratic computational complexity limits their use with long-sequence data. Mamba-based models have emerged as a computationally efficient alternative; however, their inherent sequential scanning mechanism struggles to capture the global, non-sequential relationships that are crucial for effective cross-modal alignment. To address these limitations, we propose \textbf{AlignMamba-2}, an effective and efficient framework for multimodal fusion and sentiment analysis. Our approach introduces a dual alignment strategy that regularizes the model using both Optimal Transport distance and Maximum Mean Discrepancy, promoting geometric and statistical consistency between modalities without incurring any inference-time overhead. More importantly, we design a Modality-Aware Mamba layer, which employs a Mixture-of-Experts architecture with modality-specific and modality-shared experts to explicitly handle data heterogeneity during the fusion process. Extensive experiments on four challenging benchmarks, including dynamic time-series (on the CMU-MOSI and CMU-MOSEI datasets) and static image-related tasks (on the NYU-Depth V2 and MVSA-Single datasets), demonstrate that AlignMamba-2 establishes a new state-of-the-art in both effectiveness and efficiency across diverse pattern recognition tasks, ranging from dynamic time-series analysis to static image-text classification.

Built a small free iOS app to reduce LLM answer uncertainty with multiple models

Dev.to

[P] We built a Weights & Biases for Autoresearch - track steps, compare experiments, and share results

Reddit r/MachineLearning

Mistral Small 4 vs Qwen3.5-9B on document understanding benchmarks, but it does better than GPT-4.1

Reddit r/LocalLLaMA

Nvidia built a silent opinion engine into NemotronH to gaslight you and they're not the only ones doing it

Reddit r/LocalLLaMA

Ooh, new drama just dropped 👀

Reddit r/LocalLLaMA

AlignMamba-2: Enhancing Multimodal Fusion and Sentiment Analysis with Modality-Aware Mamba

Key Points

Abstract

Related Articles

Built a small free iOS app to reduce LLM answer uncertainty with multiple models

[P] We built a Weights & Biases for Autoresearch - track steps, compare experiments, and share results

Mistral Small 4 vs Qwen3.5-9B on document understanding benchmarks, but it does better than GPT-4.1

Nvidia built a silent opinion engine into NemotronH to gaslight you and they're not the only ones doing it

Ooh, new drama just dropped 👀

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer