Brain-Inspired Multimodal Spiking Neural Network for Image-Text Retrieval

arXiv cs.CV / 3/31/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces a brain-inspired Cross-Modal Spike Fusion (CMSF) spiking neural network designed for multimodal image-text retrieval, aiming to jointly address energy efficiency and cross-modal interaction.
  • CMSF performs spike-level fusion of unimodal features and uses the fused representation to provide soft supervisory signals that refine unimodal spike embeddings and reduce semantic loss.
  • The method achieves top-tier image-text retrieval accuracy using only two time steps, positioning it as both fast and low-energy compared with typical ANN-based approaches.
  • It is presented as the first application of a directly trained, low-energy multimodal SNN framework to image-text retrieval, with code released on GitHub.
  • The work suggests a temporal-dynamics-plus-cross-modal-alignment design direction for future spiking-based multimodal research and systems.

Abstract

Spiking neural networks (SNNs) have recently shown strong potential in unimodal visual and textual tasks, yet building a directly trained, low-energy, and high-performance SNN for multimodal applications such as image-text retrieval (ITR) remains highly challenging. Existing artificial neural network (ANN)-based methods often pursue richer unimodal semantics using deeper and more complex architectures, while overlooking cross-modal interaction, retrieval latency, and energy efficiency. To address these limitations, we present a brain-inspired Cross-Modal Spike Fusion network (CMSF) and apply it to ITR for the first time. The proposed spike fusion mechanism integrates unimodal features at the spike level, generating enhanced multimodal representations that act as soft supervisory signals to refine unimodal spike embeddings, effectively mitigating semantic loss within CMSF. Despite requiring only two time steps, CMSF achieves top-tier retrieval accuracy, surpassing state-of-the-art ANN counterparts while maintaining exceptionally low energy consumption and high retrieval speed. This work marks a significant step toward multimodal SNNs, offering a brain-inspired framework that unifies temporal dynamics with cross-modal alignment and provides new insights for future spiking-based multimodal research. The code is available at https://github.com/zxt6174/CMSF.