Learning from Imperfect Text Guidance: Robust Long-Tail Visual Recognition with High-Noise Label

arXiv cs.CV / 4/28/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses a common real-world problem where long-tailed datasets contain many high-noise (inaccurate) labels that significantly degrade deep model performance.
  • It argues that prior methods miss a key issue in high-noise settings: severe label–image mismatch, and proposes explicitly correcting this inconsistency.
  • The proposed approach uses auxiliary text from the noisy labels and exploits cross-modal alignment in pre-trained vision-language models to generate a supervision signal called Weak Teacher Supervision (WTS).
  • WTS is selectively activated by measuring disagreement between text-predicted labels and the observed noisy labels, aiming to reduce the impact of label noise and distribution bias.
  • Experiments on both synthetic and real-world datasets show that WTS improves robustness, with the largest gains in high-noise conditions, and the authors release code publicly.

Abstract

Real-world data often exhibit long-tailed distributions with numerous noisy labels, substantially degrading the performance of deep models. While prior research has made progress in addressing this combined challenge, it overlooks the severe label-image mismatch inherent to high-noise settings, thereby limiting their effectiveness. Given that observed labels, though mismatched with images, still retain category information, we propose employing auxiliary text information from labels to address label-image inconsistencies in long-tailed noisy data. Specifically, we leverage the intrinsic cross-modal alignment in pre-trained visual-language models to correct the label-image inconsistencies. This supervisory signal, referred to as Weak Teacher Supervision (WTS), is unaffected by label noise and data distribution biases, albeit exhibits limited accuracy. Therefore, the activation of WTS is determined by evaluating the discrepancy between text-predicted labels and observed labels. Extensive experiments demonstrate the superior performance of WTS across synthetic and real-world datasets, particularly under high-noise conditions. The source code is available at https://anonymous.4open.science/r/WTS-0F3C.