ARTA: Adaptive Mixed-Resolution Token Allocation for Efficient Dense Feature Extraction

arXiv cs.AI / 3/30/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • ARTA (Adaptive Mixed-Resolution Token Allocation) is a coarse-to-fine vision transformer that begins with low-resolution tokens and selectively allocates additional fine tokens to image regions that need higher detail.
  • A lightweight allocator predicts semantic boundary scores iteratively, adding fine tokens only where boundary evidence is sufficiently strong, which concentrates compute near class boundaries and reduces redundant processing in homogeneous areas.
  • Mixed-resolution attention lets coarse and fine tokens interact so the model focuses computation on semantically complex regions while preserving sensitivity to weak boundary cues.
  • Experiments report state-of-the-art performance on ADE20K and COCO-Stuff with substantially fewer FLOPs, and competitive results on Cityscapes at markedly lower compute (e.g., ARTA-Base at 54.6 mIoU on ADE20K in the ~100M-parameter range).
  • The method is designed to improve semantic consistency by encouraging tokens to represent a single class rather than mixing semantics across boundaries.

Abstract

We present ARTA, a mixed-resolution coarse-to-fine vision transformer for efficient dense feature extraction. Unlike models that begin with dense high-resolution (fine) tokens, ARTA starts with low-resolution (coarse) tokens and uses a lightweight allocator to predict which regions require more fine tokens. The allocator iteratively predicts a semantic (class) boundary score and allocates additional tokens to patches above a low threshold, concentrating token density near boundaries while maintaining high sensitivity to weak boundary evidence. This targeted allocation encourages tokens to represent a single semantic class rather than a mixture of classes. Mixed-resolution attention enables interaction between coarse and fine tokens, focusing computation on semantically complex areas while avoiding redundant processing in homogeneous regions. Experiments demonstrate that ARTA achieves state-of-the-art results on ADE20K and COCO-Stuff with substantially fewer FLOPs, and delivers competitive performance on Cityscapes at markedly lower compute. For example, ARTA-Base attains 54.6 mIoU on ADE20K in the ~100M-parameter class while using fewer FLOPs and less memory than comparable backbones.