Can Natural Image Autoencoders Compactly Tokenize fMRI Volumes for Long-Range Dynamics Modeling?

arXiv cs.CV / 4/7/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes TABLeT, a Two-dimensionally Autoencoded Brain Latent Transformer that tokenizes 3D fMRI volumes into compact continuous tokens to make long-range spatiotemporal modeling feasible under limited memory.
  • By leveraging a pre-trained 2D natural image autoencoder, each fMRI volume is compressed into tokens that can be processed by a simple Transformer encoder while reducing VRAM requirements compared with voxel-based approaches.
  • Experiments on large benchmarks (UK-Biobank, HCP, and ADHD-200) show TABLeT outperforming existing models across multiple tasks.
  • The authors introduce a self-supervised masked token modeling pre-training method for TABLeT, which further improves downstream performance.
  • The work claims gains in computational and memory efficiency while aiming to preserve interpretability for scalable brain-activity dynamics modeling, with code released on GitHub.

Abstract

Modeling long-range spatiotemporal dynamics in functional Magnetic Resonance Imaging (fMRI) remains a key challenge due to the high dimensionality of the four-dimensional signals. Prior voxel-based models, although demonstrating excellent performance and interpretation capabilities, are constrained by prohibitive memory demands and thus can only capture limited temporal windows. To address this, we propose TABLeT (Two-dimensionally Autoencoded Brain Latent Transformer), a novel approach that tokenizes fMRI volumes using a pre-trained 2D natural image autoencoder. Each 3D fMRI volume is compressed into a compact set of continuous tokens, enabling long-sequence modeling with a simple Transformer encoder with limited VRAM. Across large-scale benchmarks including the UK-Biobank (UKB), Human Connectome Project (HCP), and ADHD-200 datasets, TABLeT outperforms existing models in multiple tasks, while demonstrating substantial gains in computational and memory efficiency over the state-of-the-art voxel-based method given the same input. Furthermore, we develop a self-supervised masked token modeling approach to pre-train TABLeT, which improves the model's performance for various downstream tasks. Our findings suggest a promising approach for scalable and interpretable spatiotemporal modeling of brain activity. Our code is available at https://github.com/beotborry/TABLeT.