Hidden Failure Modes of Gradient Modification under Adam in Continual Learning, and Adaptive Decoupled Moment Routing as a Repair
arXiv cs.AI / 4/27/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper finds a hidden failure mode when continual-learning methods modify gradients upstream while assuming Adam is a neutral backend, showing that the combination can cause near-collapse behavior on high-overlap domain streams.
- In an 8-domain continual LM setting, shared-routing projection baselines perform close to vanilla forgetting, even when using a 0.5% replay buffer, and fixed-strength decoupling can worsen results versus vanilla.
- The authors attribute the problem to Adam’s second-moment pathway, where projection can inflate the effective learning rate of old gradient directions by a factor of about 1/(1-alpha), making the conflict largely invisible on clean benchmarks.
- They propose “Adaptive Decoupled Moment Routing,” which sends the modified gradient only to Adam’s first moment while preserving magnitude-faithful second-moment statistics with overlap-aware adaptive strength.
- Across tested scales and setups (including a 16-domain stream and LoRA at ~7B), the proposed routing is the only configuration that consistently avoids collapse and yields large improvements over the strongest shared-routing baselines.
Related Articles

Subagents: The Building Block of Agentic AI
Dev.to

DeepSeek-V4 Models Could Change Global AI Race
AI Business

Got OpenAI's privacy filter model running on-device via ExecuTorch
Reddit r/LocalLLaMA

The Agent-Skill Illusion: Why Prompt-Based Control Fails in Multi-Agent Business Consulting Systems
Dev.to

We Built a Voice AI Receptionist in 8 Weeks — Every Decision We Made and Why
Dev.to