VISTA: Validation-Informed Trajectory Adaptation via Self-Distillation

arXiv cs.AI / 4/15/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper identifies a failure mode called Trajectory Deviation, where deep models reach validation accuracy yet still converge to suboptimal solutions by abandoning earlier high-generalization states without triggering classical overfitting signals.
  • It proposes VISTA, an online self-distillation framework that enforces consistency along the model’s optimization trajectory using a validation-informed Marginal Coverage score to select “expert anchor” model states.
  • VISTA builds a coverage-weighted ensemble of these expert anchors during training, using it to regularize the loss landscape and preserve previously learned latent features.
  • Experiments across multiple benchmarks show VISTA improves robustness and generalization compared with standard training and prior self-distillation approaches.
  • The authors report that a lightweight implementation cuts storage overhead by about 90% while maintaining performance, making the method more practical.

Abstract

Deep learning models may converge to suboptimal solutions despite strong validation accuracy, masking an optimization failure we term Trajectory Deviation. This is because as training proceeds, models can abandon high generalization states for specific data sub-populations, thus discarding previously learned latent features without triggering classical overfitting signals. To address this problem we introduce VISTA, an online self-distillation framework that enforces consistency along the optimization trajectory. Using a validation-informed Marginal Coverage score, VISTA identifies expert anchors, which are earlier model states that retain specialized competence over distinct data regions. A coverage-weighted ensemble of these anchors is integrated online during training, regularizing the loss landscape and preserving mastered knowledge. When evaluated across multiple benchmarks, VISTA demonstrates improved robustness and generalization over standard training and prior self-distillation methods, while a lightweight implementation reduces storage overhead by 90% without performance loss.