Learning to Forget: Continual Learning with Adaptive Weight Decay
arXiv cs.LG / 5/1/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper addresses continual learning under finite capacity by proposing controlled forgetting to free up model capacity for new knowledge.
- It argues that standard weight decay acts as uniform forgetting, which can be inefficient because different parameters may encode stable knowledge versus rapidly changing targets.
- The authors introduce FADE (Forgetting through Adaptive Decay), which adapts weight-decay rates per parameter online using approximate meta-gradient descent.
- They derive FADE for an online linear setting and test it by applying the method to the final layer of neural networks.
- Experiments show FADE learns distinct decay rates automatically, works well alongside step-size adaptation, and improves performance over fixed weight decay on online tracking and streaming classification tasks.
Related Articles

Why Autonomous Coding Agents Keep Failing — And What Actually Works
Dev.to

Why Enterprise AI Pilots Fail
Dev.to

The PDF Feature Nobody Asked For (That I Use Every Day)
Dev.to

How to Fix OpenClaw Tool Calling Issues
Dev.to

Mistral's new flagship Medium 3.5 folds chat, reasoning, and code into one model
THE DECODER