XAttnRes: Cross-Stage Attention Residuals for Medical Image Segmentation

arXiv cs.CV / 4/7/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces Cross-Stage Attention Residuals (XAttnRes), a new mechanism that keeps a global feature-history pool of prior encoder and decoder stage outputs for medical image segmentation.
  • XAttnRes uses lightweight pseudo-query attention so each stage can selectively aggregate information from all preceding representations, improving over fixed residual connections.
  • It adds spatial alignment and channel projection to bridge dimensional and resolution differences between LLM-style same-dimensional layers and multi-scale encoder-decoder segmentation architectures, with minimal added overhead.
  • Experiments on four datasets across three imaging modalities show consistent segmentation gains when XAttnRes is incorporated into existing models.
  • The authors report that XAttnRes can achieve baseline-competitive results even without traditional skip connections, implying learned attention-based aggregation can replace some inter-stage information flow.

Abstract

In the field of Large Language Models (LLMs), Attention Residuals have recently demonstrated that learned, selective aggregation over all preceding layer outputs can outperform fixed residual connections. We propose Cross-Stage Attention Residuals (XAttnRes), a mechanism that maintains a global feature history pool accumulating both encoder and decoder stage outputs. Through lightweight pseudo-query attention, each stage selectively aggregates from all preceding representations. To bridge the gap between the same-dimensional Transformer layers in LLMs and the multi-scale encoder-decoder stages in segmentation networks, XAttnRes introduces spatial alignment and channel projection steps that handle cross-resolution features with negligible overhead. When added to existing segmentation networks, XAttnRes consistently improves performance across four datasets and three imaging modalities. We further observe that XAttnRes alone, even without skip connections, achieves performance on par with the baseline, suggesting that learned aggregation can recover the inter-stage information flow traditionally provided by predetermined connections.