Hidden Failure Modes of Gradient Modification under Adam in Continual Learning, and Adaptive Decoupled Moment Routing as a Repair

arXiv cs.AI / 4/27/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper finds a hidden failure mode when continual-learning methods modify gradients upstream while assuming Adam is a neutral backend, showing that the combination can cause near-collapse behavior on high-overlap domain streams.
In an 8-domain continual LM setting, shared-routing projection baselines perform close to vanilla forgetting, even when using a 0.5% replay buffer, and fixed-strength decoupling can worsen results versus vanilla.
The authors attribute the problem to Adam’s second-moment pathway, where projection can inflate the effective learning rate of old gradient directions by a factor of about 1/(1-alpha), making the conflict largely invisible on clean benchmarks.
They propose “Adaptive Decoupled Moment Routing,” which sends the modified gradient only to Adam’s first moment while preserving magnitude-faithful second-moment statistics with overlap-aware adaptive strength.
Across tested scales and setups (including a 16-domain stream and LoRA at ~7B), the proposed routing is the only configuration that consistently avoids collapse and yields large improvements over the strongest shared-routing baselines.

Abstract

Many continual-learning methods modify gradients upstream (e.g., projection, penalty rescaling, replay mixing) while treating Adam as a neutral backend. We show this composition has a hidden failure mode. In a high-overlap, non-adaptive 8-domain continual LM, all shared-routing projection baselines collapse close to vanilla forgetting (12.5--12.8 vs. 13.2). A 0.5% replay buffer is the strongest shared alternative but still reaches 11.6, while fixed-strength decoupling falls below vanilla at 14.1. Only adaptive decoupled routing remains stable at 9.4, improving over vanilla by 3.8 units. On a 16-domain stream, its gain over the strongest shared-routing projection baseline grows to 4.5--4.8 units. The failure is largely invisible on clean benchmarks. We explain this effect through Adam's second-moment pathway: in the tested regime, projection induces a 1/(1-alpha) inflation of the old-direction effective learning rate, matching measurements within 8% across eight alpha values. The same conflict appears with penalty methods, replay mixing, and at 7B scale under LoRA. Our fix routes the modified gradient only to the first moment while preserving magnitude-faithful second-moment statistics, with overlap-aware adaptive strength. This simple change is the only tested configuration that consistently avoids collapse across methods, optimizers, and scale.