Pushing the Limits of Distillation-Based Continual Learning via Classifier-Proximal Lightweight Plugins

arXiv stat.ML / 4/6/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper tackles distillation-based continual learning, focusing on the stability-plasticity dilemma that limits how well coupled distillation and learning objectives can preserve old knowledge while learning new data.
  • It introduces Distillation-aware Lightweight Components (DLC), a plugin-based extension that inserts lightweight residual plugins into the classifier-proximal layer to apply semantic-level corrections without heavily disturbing the base feature extractor.
  • For inference, DLC aggregates plugin-enhanced representations to form predictions, and it adds a lightweight weighting unit to down-rank non-target plugin representations and reduce interference.
  • Experiments report about an 8% accuracy improvement on large-scale benchmarks with only a 4% increase in backbone parameters, indicating high parameter- and disruption-efficiency.
  • The approach is designed to be compatible with other plug-and-play continual learning enhancements and can provide additional gains when combined with them.

Abstract

Continual learning requires models to learn continuously while preserving prior knowledge under evolving data streams. Distillation-based methods are appealing for retaining past knowledge in a shared single-model framework with low storage overhead. However, they remain constrained by the stability-plasticity dilemma: knowledge acquisition and preservation are still optimized through coupled objectives, and existing enhancement methods do not alter this underlying bottleneck. To address this issue, we propose a plugin extension paradigm termed Distillation-aware Lightweight Components (DLC) for distillation-based CL. DLC deploys lightweight residual plugins into the base feature extractor's classifier-proximal layer, enabling semantic-level residual correction for better classification accuracy while minimizing disruption to the overall feature extraction process. During inference, plugin-enhanced representations are aggregated to produce classification predictions. To mitigate interference from non-target plugins, we further introduce a lightweight weighting unit that learns to assign importance scores to different plugin-enhanced representations. DLC could deliver a significant 8% accuracy gain on large-scale benchmarks while introducing only a 4% increase in backbone parameters, highlighting its exceptional efficiency. Moreover, DLC is compatible with other plug-and-play CL enhancements and delivers additional gains when combined with them.