Spectral Signatures of Data Quality: Eigenvalue Tail Index as a Diagnostic for Label Noise in Neural Networks

arXiv cs.LG / 3/31/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The study tests whether spectral properties of neural network weight matrices can predict test accuracy and finds that the eigenvalue tail index (tail parameter α) at the bottleneck layer strongly tracks accuracy under controlled label-noise variation (leave-one-out R² = 0.984), far outperforming conventional metrics like the Frobenius norm (LOO R² = 0.149).
  • This predictive relationship is reported to generalize across three architectures (MLP, CNN, ResNet-18) and two datasets (MNIST, CIFAR-10) when the dominant factor is label corruption.
  • When hyperparameters are varied while data quality is held fixed, both spectral measures (including tail α) and conventional measures become weak predictors of accuracy (R² < 0.25), and simple baselines slightly outperform spectral ones.
  • The authors therefore position the tail index as a diagnostic for data quality—detecting label noise and training-set degradation—rather than a universal generalization predictor.
  • A calibrated detector trained on synthetic noise is reported to identify real annotation errors in CIFAR-10N (detecting 9% noise with 3% error), and the work links the effect to the information-processing bottleneck layer and BBP phase-transition concepts, while finding the eigenvalue level spacing ratio <r> to be uninformative due to Wishart universality.

Abstract

We investigate whether spectral properties of neural network weight matrices can predict test accuracy. Under controlled label noise variation, the tail index alpha of the eigenvalue distribution at the network's bottleneck layer predicts test accuracy with leave-one-out R^2 = 0.984 (21 noise levels, 3 seeds per level), far exceeding all baselines: the best conventional metric (Frobenius norm of the optimal layer) achieves LOO R^2 = 0.149. This relationship holds across three architectures (MLP, CNN, ResNet-18) and two datasets (MNIST, CIFAR-10). However, under hyperparameter variation at fixed data quality (180 configurations varying width, depth, learning rate, and weight decay), all spectral and conventional measures are weak predictors (R^2 < 0.25), with simple baselines (global L_2 norm, LOO R^2 = 0.219) slightly outperforming spectral measures (tail alpha, LOO R^2 = 0.167). We therefore frame the tail index as a data quality diagnostic: a powerful detector of label corruption and training set degradation, rather than a universal generalization predictor. A noise detector calibrated on synthetic noise successfully identifies real human annotation errors in CIFAR-10N (9% noise detected with 3% error). We identify the information-processing bottleneck layer as the locus of this signature and connect the observations to the BBP phase transition in spiked random matrix models. We also report a negative result: the level spacing ratio is uninformative for weight matrices due to Wishart universality.