AI Navigate

A Comparative Empirical Study of Catastrophic Forgetting Mitigation in Sequential Task Adaptation for Continual Natural Language Processing Systems

arXiv cs.CL / 3/20/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper presents a comparative empirical study of catastrophic forgetting mitigation in continual learning for sequential task adaptation in continual natural language processing, using a 10-task label-disjoint CLINC150 setup for intent classification.
  • It evaluates three backbones—ANN, GRU, and Transformer—and three continual learning strategies—MIR (replay), LwF (regularization), and HAT (parameter isolation)—in various combinations.
  • Results show that naive sequential fine-tuning suffers severe forgetting across architectures, while replay-based MIR is the most reliable single strategy, and combinations including MIR achieve high final performance with near-zero or mildly positive backward transfer.
  • The optimal CL configuration is architecture-dependent (e.g., MIR+HAT for ANN/Transformer, MIR+LwF+HAT for GRU), and in some cases CL methods even surpass joint training, highlighting the importance of jointly selecting backbone and CL mechanism for continual intent classification systems.

Abstract

Neural language models deployed in real-world applications must continually adapt to new tasks and domains without forgetting previously acquired knowledge. This work presents a comparative empirical study of catastrophic forgetting mitigation in continual intent classification. Using the CLINC150 dataset, we construct a 10-task label-disjoint scenario and evaluate three backbone architectures: a feed-forward Artificial Neural Network (ANN), a Gated Recurrent Unit (GRU), and a Transformer encoder, under a range of continual learning (CL) strategies. We consider one representative method from each major CL family: replay-based Maximally Interfered Retrieval (MIR), regularization-based Learning without Forgetting (LwF), and parameter-isolation via Hard Attention to Task (HAT), both individually and in all pairwise and triple combinations. Performance is assessed with average accuracy, macro F1, and backward transfer, capturing the stability-plasticity trade-off across the task sequence. Our results show that naive sequential fine-tuning suffers from severe forgetting for all architectures and that no single CL method fully prevents it. Replay emerges as a key ingredient: MIR is the most reliable individual strategy, and combinations that include replay (MIR+HAT, MIR+LwF, MIR+LwF+HAT) consistently achieve high final performance with near-zero or mildly positive backward transfer. The optimal configuration is architecture-dependent. MIR+HAT yields the best result for ANN and Transformer, MIR+LwF+HAT, on the other hand, works the best for GRU, and in several cases CL methods even surpass joint training, indicating a regularization effect. These findings highlight the importance of jointly selecting backbone architecture and CL mechanism when designing continual intent-classification systems.