BSViT: A Burst Spiking Vision Transformer for Expressive and Efficient Visual Representation Learning

arXiv cs.CV / 4/28/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces BSViT, a Burst Spiking Vision Transformer designed to improve energy-efficient visual representation learning within spiking vision transformer frameworks.
  • It addresses key limitations of prior S-ViTs by using DBSSA, which increases information capacity via binary spikes for queries and burst spikes for keys.
  • BSViT uses a dual excitatory/inhibitory value pathway for signed modulation, aiming for richer and more expressive spike interactions.
  • The approach keeps attention computation addition-only, making it more compatible with energy-efficient neuromorphic hardware.
  • A patch adjacency masking strategy further adds spatial priors by restricting attention to local neighborhoods, reducing spike activity and computational overhead while boosting performance on static and event-based benchmarks.

Abstract

Spiking Vision Transformers (S-ViTs) offer a promising framework for energy-efficient visual learning. However, existing designs remain limited by two fundamental issues: the restricted information capacity of binary spike coding and the dense token interactions introduced by global self-attention. To address these challenges, this work proposes BSViT, a burst spiking-driven Vision Transformer featuring a Dual-Channel Burst Spiking Self-Attention (DBSSA) mechanism. DBSSA encodes queries with binary spikes and keys with burst spikes to enhance representational capacity. The value pathway adopts dual excitatory and inhibitory binary channels, enabling signed modulation and richer spike interactions. Importantly, the entire attention operation preserves addition-only computation, ensuring compatibility with energy-efficient neuromorphic hardware. To further reduce spike activity and incorporate spatial priors, a patch adjacency masking strategy is introduced to restrict attention to local neighborhoods, resulting in structure-aware sparsity and reduced computational overhead. In addition, burst spike coding is systematically integrated across the network to increase spike-level representational capacity beyond conventional binary spiking. Extensive experiments on both static and event-based vision benchmarks demonstrate that BSViT consistently outperforms existing spiking Transformers in accuracy while maintaining competitive energy efficiency.