A3-FPN: Asymptotic Content-Aware Pyramid Attention Network for Dense Visual Prediction

arXiv cs.CV / 4/14/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces A3-FPN, an Asymptotic Content-Aware Pyramid Attention Network designed to improve dense visual prediction by better capturing discriminative multi-scale features, especially for small objects.
  • A3-FPN uses a horizontally spread column network with an asymptotically disentangled framework to enable asymptotically global feature interaction and disentangle each pyramid level from hierarchical representations.
  • For feature fusion, it introduces content-aware attention that collects supplementary adjacent-level context to compute position-wise offsets/weights for context-aware resampling and applies deep context reweighting to enhance intra-category similarity.
  • For feature reassembly, it strengthens intra-scale discriminative learning and reassembles redundant features using information content and spatial variation of feature maps.
  • Experiments on MS COCO, VisDrone2019-DET, and Cityscapes show that A3-FPN can be plugged into both CNN and Transformer-based SOTA architectures, reporting strong results such as 49.6 mask AP with OneFormer + Swin-L and 85.6 mIoU on Cityscapes.

Abstract

Learning multi-scale representations is the common strategy to tackle object scale variation in dense prediction tasks. Although existing feature pyramid networks have greatly advanced visual recognition, inherent design defects inhibit them from capturing discriminative features and recognizing small objects. In this work, we propose Asymptotic Content-Aware Pyramid Attention Network (A3-FPN), to augment multi-scale feature representation via the asymptotically disentangled framework and content-aware attention modules. Specifically, A3-FPN employs a horizontally-spread column network that enables asymptotically global feature interaction and disentangles each level from all hierarchical representations. In feature fusion, it collects supplementary content from the adjacent level to generate position-wise offsets and weights for context-aware resampling, and learns deep context reweights to improve intra-category similarity. In feature reassembly, it further strengthens intra-scale discriminative feature learning and reassembles redundant features based on information content and spatial variation of feature maps. Extensive experiments on MS COCO, VisDrone2019-DET and Cityscapes demonstrate that A3-FPN can be easily integrated into state-of-the-art CNN and Transformer-based architectures, yielding remarkable performance gains. Notably, when paired with OneFormer and Swin-L backbone, A3-FPN achieves 49.6 mask AP on MS COCO and 85.6 mIoU on Cityscapes. Codes are available at https://github.com/mason-ching/A3-FPN.