Self-supervised pretraining for an iterative image size agnostic vision transformer
arXiv cs.CV / 4/23/2026
📰 NewsModels & Research
Key Points
- The paper addresses the computational inefficiency and poor image-size scaling of existing self-supervised Vision Transformers, which often force low-resolution pretraining (e.g., DINO).
- It builds on a foveal-inspired, resolution-agnostic ViT that iteratively processes a fixed-size context of multi-zoom patches using a sequential, recurrent-like procedure without backpropagation through time.
- The authors propose a new sequential-to-global self-supervised learning framework that adapts DINO’s self-distillation objective to unlock the model’s potential as a foundational backbone.
- With an efficient integral-image patch extraction method, the approach supports large-scale pretraining while keeping computational cost constant regardless of input resolution.
- Experiments show competitive results on ImageNet-1K and downstream classification tasks, indicating practical value for flexible, image-size-agnostic vision encoders.
Related Articles

Trajectory Forecasts in Unknown Environments Conditioned on Grid-Based Plans
Dev.to

OpenAI Just Named It Workspace Agents. We Open-Sourced Our Lark Version Six Months Ago
Dev.to

GPT Image 2 Subject-Lock Editing: A Practical Guide to input_fidelity
Dev.to

GPT Image 2 vs DALL-E 3: What Actually Changed in OpenAI's New Image Model
Dev.to

AI Tutor for Science Students — Physics Chemistry Biology Solved by AI
Dev.to