Self-supervised pretraining for an iterative image size agnostic vision transformer

arXiv cs.CV / 4/23/2026

📰 NewsModels & Research

共有:

Key Points

The paper addresses the computational inefficiency and poor image-size scaling of existing self-supervised Vision Transformers, which often force low-resolution pretraining (e.g., DINO).
It builds on a foveal-inspired, resolution-agnostic ViT that iteratively processes a fixed-size context of multi-zoom patches using a sequential, recurrent-like procedure without backpropagation through time.
The authors propose a new sequential-to-global self-supervised learning framework that adapts DINO’s self-distillation objective to unlock the model’s potential as a foundational backbone.
With an efficient integral-image patch extraction method, the approach supports large-scale pretraining while keeping computational cost constant regardless of input resolution.
Experiments show competitive results on ImageNet-1K and downstream classification tasks, indicating practical value for flexible, image-size-agnostic vision encoders.

Abstract

Vision Transformers (ViTs) dominate self-supervised learning (SSL). While they have proven highly effective for large-scale pretraining, they are computationally inefficient and scale poorly with image size. Consequently, foundational models like DINO are constrained to low-resolution processing. A recent foveal-inspired transformer achieves resolution agnosticism by iteratively processing a fixed-size context of multi-zoom patches. This model demonstrated promising results via supervised learning, utilizing a sequential, recurrent-like process without backpropagation through time. To unlock its potential as a foundational backbone, we introduce a novel sequential-to-global SSL framework based on DINO's self-distillation objective. Supported by an efficient integral-image patch extraction method, our approach enables large-scale pretraining for image-size agnostic vision encoders. We achieve competitive performance on ImageNet-1K and downstream classification tasks, maintaining a constant computational budget regardless of input resolution.