ARTA: Adaptive Mixed-Resolution Token Allocation for Efficient Dense Feature Extraction

arXiv cs.AI / 3/30/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

ARTA (Adaptive Mixed-Resolution Token Allocation) is a coarse-to-fine vision transformer that begins with low-resolution tokens and selectively allocates additional fine tokens to image regions that need higher detail.
A lightweight allocator predicts semantic boundary scores iteratively, adding fine tokens only where boundary evidence is sufficiently strong, which concentrates compute near class boundaries and reduces redundant processing in homogeneous areas.
Mixed-resolution attention lets coarse and fine tokens interact so the model focuses computation on semantically complex regions while preserving sensitivity to weak boundary cues.
Experiments report state-of-the-art performance on ADE20K and COCO-Stuff with substantially fewer FLOPs, and competitive results on Cityscapes at markedly lower compute (e.g., ARTA-Base at 54.6 mIoU on ADE20K in the ~100M-parameter range).
The method is designed to improve semantic consistency by encouraging tokens to represent a single class rather than mixing semantics across boundaries.

Abstract

We present ARTA, a mixed-resolution coarse-to-fine vision transformer for efficient dense feature extraction. Unlike models that begin with dense high-resolution (fine) tokens, ARTA starts with low-resolution (coarse) tokens and uses a lightweight allocator to predict which regions require more fine tokens. The allocator iteratively predicts a semantic (class) boundary score and allocates additional tokens to patches above a low threshold, concentrating token density near boundaries while maintaining high sensitivity to weak boundary evidence. This targeted allocation encourages tokens to represent a single semantic class rather than a mixture of classes. Mixed-resolution attention enables interaction between coarse and fine tokens, focusing computation on semantically complex areas while avoiding redundant processing in homogeneous regions. Experiments demonstrate that ARTA achieves state-of-the-art results on ADE20K and COCO-Stuff with substantially fewer FLOPs, and delivers competitive performance on Cityscapes at markedly lower compute. For example, ARTA-Base attains 54.6 mIoU on ADE20K in the ~100M-parameter class while using fewer FLOPs and less memory than comparable backbones.

Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer

Simon Willison's Blog

Beyond the Chatbot: Engineering Multi-Agent Ecosystems in 2026

Dev.to

I missed the "fun" part in software development

Dev.to

The Billion Dollar Tax on AI Agents

Dev.to

Hermes Agent: A Self-Improving AI Agent That Runs Anywhere

Dev.to

ARTA: Adaptive Mixed-Resolution Token Allocation for Efficient Dense Feature Extraction

Key Points

Abstract

Related Articles

Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer

Beyond the Chatbot: Engineering Multi-Agent Ecosystems in 2026

I missed the "fun" part in software development

The Billion Dollar Tax on AI Agents

Hermes Agent: A Self-Improving AI Agent That Runs Anywhere

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer