Modality-Aware and Anatomical Vector-Quantized Autoencoding for Multimodal Brain MRI

arXiv cs.CV / 4/8/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces NeuroQuant, a modality-aware, anatomically grounded 3D vector-quantized VAE designed to reconstruct multi-modal brain MRI rather than single-modality (e.g., only T1) data.
  • NeuroQuant learns a shared latent representation across MRI modalities using factorized multi-axis attention, aiming to model relationships between distant brain regions.
  • It uses a dual-stream 3D encoder to separate modality-invariant anatomical structure from modality-dependent appearance, improving controllability and robustness.
  • The anatomical component is discretized with a shared codebook and merged with modality-specific features via FiLM during decoding to better handle cross-modal differences.
  • Experiments on two multi-modal brain MRI datasets show improved reconstruction quality over existing VAE approaches, supporting scalable downstream generative modeling and cross-modal analysis.

Abstract

Learning a robust Variational Autoencoder (VAE) is a fundamental step for many deep learning applications in medical image analysis, such as MRI synthesizes. Existing brain VAEs predominantly focus on single-modality data (i.e., T1-weighted MRI), overlooking the complementary diagnostic value of other modalities like T2-weighted MRIs. Here, we propose a modality-aware and anatomically grounded 3D vector-quantized VAE (VQ-VAE) for reconstructing multi-modal brain MRIs. Called NeuroQuant, it first learns a shared latent representation across modalities using factorized multi-axis attention, which can capture relationships between distant brain regions. It then employs a dual-stream 3D encoder that explicitly separates the encoding of modality-invariant anatomical structures from modality-dependent appearance. Next, the anatomical encoding is discretized using a shared codebook and combined with modality-specific appearance features via Feature-wise Linear Modulation (FiLM) during the decoding phase. This entire approach is trained using a joint 2D/3D strategy in order to account for the slice-based acquisition of 3D MRI data. Extensive experiments on two multi-modal brain MRI datasets demonstrate that NeuroQuant achieves superior reconstruction fidelity compared to existing VAEs, enabling a scalable foundation for downstream generative modeling and cross-modal brain image analysis.