ViBE: Visual-to-M/EEG Brain Encoding via Spatio-Temporal VAE and Distribution-Aligned Projection

arXiv cs.CV / 4/30/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces ViBE, a new brain encoding framework that generates MEG/EEG signals from visual stimuli to support both neuroscience understanding and potential visual prosthesis applications.
  • ViBE uses a spatio-temporal convolutional variational autoencoder (TSC-VAE) to reconstruct neural responses by learning the spatio-temporal structure of M/EEG signals.
  • To align visual and neural modalities, the method employs Q-Former to map CLIP image embeddings into the TSC-VAE latent space as neural proxy embeddings.
  • For cross-modal alignment, ViBE combines point-wise feature matching with MSE loss and distribution-level alignment using sliced Wasserstein distance (SWD).
  • Experiments on the THINGS-EEG2 and THINGS-MEG datasets show that the approach can produce high-quality MEG/EEG signals from images.

Abstract

Brain encoding models not only serve to decipher how visual stimuli are transformed into neural responses, but also represent a critical step toward visual prostheses that restore vision for patients with severe vision disorders. Brain encoding involves two fundamental steps: achieving faithful reconstruction of neural responses and establishing cross-modal alignment between visual stimuli and neural responses. To this end, we propose ViBE, a novel brain encoding framework for generating magnetoencephalography (MEG) and electroencephalography (EEG) signals from visual stimuli. Specifically, we first design a spatio-temporal convolutional variational autoencoder (TSC-VAE) that captures the spatio-temporal characteristics of M/EEG signals for effective neural response reconstruction. To bridge the modality gap between visual features and neural representations, we employ Q-Former to map CLIP image embeddings to the TSC-VAE latent space, producing neural proxy embeddings. For comprehensive cross-modal alignment, we combine mean squared error (MSE) loss for point-wise feature matching with sliced Wasserstein distance (SWD) for probability distribution alignment between the neural proxy embeddings and TSC-VAE latent embeddings. We conduct extensive experiments on the THINGS-EEG2 and THINGS-MEG datasets, demonstrating the effectiveness of our approach in generating high-quality M/EEG signals from visual stimuli.

ViBE: Visual-to-M/EEG Brain Encoding via Spatio-Temporal VAE and Distribution-Aligned Projection | AI Navigate