Where to Bind Matters: Hebbian Fast Weights in Vision Transformers for Few-Shot Character Recognition

arXiv cs.CV / 5/6/2026

💬 OpinionModels & Research

Key Points

  • The paper studies how adding Hebbian Fast-Weight (HFW) modules to vision transformers enables rapid, episode-level adaptation that standard slow-weight transformers lack during inference.
  • Experiments integrate HFW into multiple transformer backbones (ViT-Small, DeiT-Small, and Swin-Tiny) and evaluate six variants on Omniglot 5-way 1-shot and 5-way 5-shot tasks under a prototypical network meta-learning setup.
  • For Swin-Tiny, the authors find that applying a single HFW module to the final stage feature map (after hierarchical stages complete) avoids training instability seen when placing Hebbian modules at multiple stages.
  • This placement achieves the best accuracy across all evaluated models, reaching 96.2% (1-shot) and 99.2% (5-shot), with a reported +0.3 percentage point improvement over the non-Hebbian baseline at 1-shot.
  • The study analyzes why the shifted-window inductive bias in Swin interacts effectively with Hebbian binding, while per-block HFW placement fails for ViT/DeiT in low-data regimes, and connects findings to fast/slow-weight meta-learning literature.

Abstract

Standard transformer architectures learn fixed slow-weight representations during training and lack mechanisms for rapid adaptation within an episode. In contrast, biological neural systems address this through fast synaptic updates that form transient associative memories during inference, a property known as Hebbian plasticity. In this paper, we conduct an empirical study of Hebbian Fast-Weight (HFW) modules integrated into multiple transformer backbones, including ViT-Small, DeiT-Small, and Swin-Tiny. We evaluate six model variants: ViT, DeiT, Swin, ViT-Hebbian, DeiT-Hebbian, and Swin-Hebbian on 5-way 1-shot and 5-way 5-shot classification tasks using the Omniglot benchmark under a Prototypical Network meta-learning framework. We propose a single module placement strategy for Swin-Tiny in which one HFW module is applied to the final stage feature map after all hierarchical stages have completed. This design avoids the training instability caused by placing separate Hebbian modules at each stage and achieves the highest test accuracy across all six models (96.2\% at 1-shot; 99.2\% at 5-shot), outperforming its non-Hebbian baseline by +0.3 percentage points at 1-shot. We analyze the interaction between Swin's shifted window inductive bias and episode-level Hebbian binding, discuss why per-block placement fails for ViT and DeiT variants in a low-data regime, and situate the results within the wider literature on fast and slow-weight meta-learning.