BSViT: A Burst Spiking Vision Transformer for Expressive and Efficient Visual Representation Learning

arXiv cs.CV / 4/28/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces BSViT, a Burst Spiking Vision Transformer designed to improve energy-efficient visual representation learning within spiking vision transformer frameworks.
It addresses key limitations of prior S-ViTs by using DBSSA, which increases information capacity via binary spikes for queries and burst spikes for keys.
BSViT uses a dual excitatory/inhibitory value pathway for signed modulation, aiming for richer and more expressive spike interactions.
The approach keeps attention computation addition-only, making it more compatible with energy-efficient neuromorphic hardware.
A patch adjacency masking strategy further adds spatial priors by restricting attention to local neighborhoods, reducing spike activity and computational overhead while boosting performance on static and event-based benchmarks.

Abstract

Spiking Vision Transformers (S-ViTs) offer a promising framework for energy-efficient visual learning. However, existing designs remain limited by two fundamental issues: the restricted information capacity of binary spike coding and the dense token interactions introduced by global self-attention. To address these challenges, this work proposes BSViT, a burst spiking-driven Vision Transformer featuring a Dual-Channel Burst Spiking Self-Attention (DBSSA) mechanism. DBSSA encodes queries with binary spikes and keys with burst spikes to enhance representational capacity. The value pathway adopts dual excitatory and inhibitory binary channels, enabling signed modulation and richer spike interactions. Importantly, the entire attention operation preserves addition-only computation, ensuring compatibility with energy-efficient neuromorphic hardware. To further reduce spike activity and incorporate spatial priors, a patch adjacency masking strategy is introduced to restrict attention to local neighborhoods, resulting in structure-aware sparsity and reduced computational overhead. In addition, burst spike coding is systematically integrated across the network to increase spike-level representational capacity beyond conventional binary spiking. Extensive experiments on both static and event-based vision benchmarks demonstrate that BSViT consistently outperforms existing spiking Transformers in accuracy while maintaining competitive energy efficiency.

Write a 1,200-word blog post: "What is Generative Engine Optimization (GEO) and why SEO teams need it now"

Dev.to

Indian Developers: How to Build AI Side Income with $0 Capital in 2026

Dev.to

Most People Use AI Like Google. That's Why It Sucks.

Dev.to

Behind the Scenes of a Self-Evolving AI: The Architecture of Tian AI

Dev.to

Tian AI vs ChatGPT: Why Local AI Is the Future of Privacy

Dev.to

BSViT: A Burst Spiking Vision Transformer for Expressive and Efficient Visual Representation Learning

Key Points

Abstract

Related Articles

Write a 1,200-word blog post: "What is Generative Engine Optimization (GEO) and why SEO teams need it now"

Indian Developers: How to Build AI Side Income with $0 Capital in 2026

Most People Use AI Like Google. That's Why It Sucks.

Behind the Scenes of a Self-Evolving AI: The Architecture of Tian AI

Tian AI vs ChatGPT: Why Local AI Is the Future of Privacy

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer