DINO-QPM: Adapting Visual Foundation Models for Globally Interpretable Image Classification

arXiv cs.CV / 4/9/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces DINO-QPM, a lightweight interpretability adapter that transforms DINOv2-like visual foundation model features into globally interpretable, class-independent representations via Quadratic Programming Enhanced Model (QPM).
  • Instead of relying on the standard CLS-token pathway, DINO-QPM uses average pooling to connect patch embeddings to interpretable features, enabling spatial localization of explanations in the input.
  • It adds a sparsity loss to reduce spatial scatter and background noise, aiming to ground explanations in relevant object parts rather than irrelevant regions.
  • The method adapts QPM to run on strictly frozen DINO backbones and reports improved results versus DINOv2 linear probing in both classification accuracy and explanation quality.
  • Evaluation includes a newly introduced Plausibility metric alongside other interpretability metrics to demonstrate that DINO-QPM yields higher-quality explanations while maintaining strong performance.

Abstract

Although visual foundation models like DINOv2 provide state-of-the-art performance as feature extractors, their complex, high-dimensional representations create substantial hurdles for interpretability. This work proposes DINO-QPM, which converts these powerful but entangled features into contrastive, class-independent representations that are interpretable by humans. DINO-QPM is a lightweight interpretability adapter that pursues globally interpretable image classification, adapting the Quadratic Programming Enhanced Model (QPM) to operate on strictly frozen DINO backbones. While classification with visual foundation models typically relies on the \texttt{CLS} token, we deliberately diverge from this standard. By leveraging average-pooling, we directly connect the patch embeddings to the model's features and therefore enable spatial localisation of DINO-QPM's globally interpretable features within the input space. Furthermore, we apply a sparsity loss to minimise spatial scatter and background noise, ensuring that explanations are grounded in relevant object parts. With DINO-QPM we make the level of interpretability of QPM available as an adapter while exceeding the accuracy of DINOv2 linear probe. Evaluated through an introduced Plausibility metric and other interpretability metrics, extensive experiments demonstrate that DINO-QPM is superior to other applicable methods for frozen visual foundation models in both classification accuracy and explanation quality.