Distilling Vision Transformers for Distortion-Robust Representation Learning

arXiv cs.CV / 4/27/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses self-supervised visual representation learning when clean images are unavailable or extremely limited by using pretrained vision models to build distortion-robust representations.
  • It proposes an asymmetric knowledge distillation setup where both teacher and student start from the same pretrained Vision Transformer, but the teacher is trained on clean views while the student is trained on distorted views.
  • Multi-level distillation is introduced to align multiple representation types, including global embeddings, patch-level features, and attention maps, enabling the student to mimic clean-image representations without ever seeing clean data.
  • Experiments on image classification across multiple datasets and distortion types show consistent gains over prior methods given the same amount of human supervision.

Abstract

Self-supervised learning has achieved remarkable success in learning visual representations from clean data, yet remains challenging when clean observations are sparse or not available at all. In this paper, we demonstrate that pretrained vision models can be leveraged to learn distortion-robust representations, which can then be effectively applied to downstream tasks operating on distorted observations. In particular, we propose an asymmetric knowledge distillation framework in which both teacher and student are initialized from the same pretrained Vision Transformer but receive different views of each image: the teacher processes clean images, while the student sees their distorted versions. We introduce multi-level distillation that aligns global embeddings, patch-level features, and attention maps and show that the student is able to approximate clean-image representations despite never directly accessing clean data. We evaluate our approach on image classification tasks across several datasets and under various distortions, consistently outperforming existing alternatives for the same amount of human supervision.