GradAttn: Replacing Fixed Residual Connections with Task-Modulated Attention Pathways

arXiv cs.CV / 3/31/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that fixed residual connections in deep ConvNets can limit learning because they cannot adapt gradient flow or feature emphasis to input complexity and task relevance across depth.
It proposes GradAttn, a hybrid CNN–transformer approach that replaces fixed residual shortcuts with self-attention–controlled gradient flow using multi-scale CNN features.
Experiments on eight datasets (including natural images, medical imaging, and fashion recognition) show GradAttn variants outperform ResNet-18 on five datasets, with up to an +11.07% accuracy gain on FashionMNIST while keeping comparable model size.
Gradient flow analysis suggests that some attention-induced controlled instabilities may correlate with better generalization, contradicting the idea that maximal stability is always optimal.
The study also finds positional encoding effectiveness is dataset-dependent, with CNN hierarchies sometimes providing sufficient spatial structure on their own.

Abstract

Deep ConvNets suffer from gradient signal degradation as network depth increases, limiting effective feature learning in complex architectures. ResNet addressed this through residual connections, but these fixed short-circuits cannot adapt to varying input complexity or selectively emphasize task relevant features across network hierarchies. This study introduces GradAttn, a hybrid CNN-transformer framework that replaces fixed residual connections with attention-controlled gradient flow. By extracting multi-scale CNN features at different depths and regulating them through self-attention, GradAttn dynamically weights shallow texture features and deep semantic representations. For representational analysis, we evaluated three GradAttn variants across eight diverse datasets, from natural images, medical imaging, to fashion recognition. Results demonstrate that GradAttn outperforms ResNet-18 on five of eight datasets, achieving up to +11.07% accuracy improvement on FashionMNIST while maintaining comparable network size. Gradient flow analysis reveals that controlled instabilities, introduced by attention, often coincide with improved generalization, challenging the assumption that perfect stability is optimal. Furthermore, positional encoding effectiveness proves dataset dependent, with CNN hierarchies frequently encoding sufficient spatial structure. These findings allow attention mechanisms as enablers of learnable gradient control, offering a new paradigm for adaptive representation learning in deep neural architectures.