HSFM: Hard-Set-Guided Feature-Space Meta-Learning for Robust Classification under Spurious Correlations

arXiv cs.CV / 4/1/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies how deep neural networks can perform poorly under distribution shift because they exploit spurious correlations, especially on minority-group (hard) samples where those correlations fail.
  • It argues that the classifier head is a major source of failure and builds on the idea of freezing a strong feature extractor/backbone while improving a lightweight head.
  • HSFM (Hard-Set-Guided Feature-Space Meta-Learning) is introduced as a bilevel meta-learning approach that performs targeted feature-space augmentations (feature edits) to improve worst-group performance with few inner-loop updates.
  • By editing features at the backbone output rather than in pixel space or via end-to-end optimization, the method is reported to be efficient, stable, and fast to train (minutes on a single GPU).
  • The authors provide CLIP-based visualizations suggesting that the learned feature-space updates correspond to semantically meaningful changes aligned with spurious attributes.

Abstract

Deep neural networks often rely on spurious features to make predictions, which makes them brittle under distribution shift and on samples where the spurious correlation does not hold (e.g., minority-group examples). Recent studies have shown that, even in such settings, the feature extractor of an Empirical Risk Minimization (ERM)-trained model can learn rich and informative representations, and that much of the failure may be attributed to the classifier head. In particular, retraining a lightweight head while keeping the backbone frozen can substantially improve performance on shifted distributions and minority groups. Motivated by this observation, we propose a bilevel meta-learning method that performs augmentation directly in feature space to improve spurious correlation handling in the classifier head. Our method learns support-side feature edits such that, after a small number of inner-loop updates on the edited features, the classifier achieves lower loss on hard examples and improved worst-group performance. By operating at the backbone output rather than in pixel space or through end-to-end optimization, the method is highly efficient and stable, requiring only a few minutes of training on a single GPU. We further validate our method with CLIP-based visualizations, showing that the learned feature-space updates induce semantically meaningful shifts aligned with spurious attributes.