Delve into the Applicability of Advanced Optimizers for Multi-Task Learning

arXiv cs.LG / 4/13/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper finds that optimization-based multi-task learning methods can underperform with advanced optimizers because instant-derived gradients contribute only marginally to parameter updates, limiting learning-dynamics gains.
  • It observes that Muon, an advanced optimizer, effectively behaves like a multi-task learner and that the orthogonalization quality depends critically on the gradients used.
  • To address these issues, the authors introduce APT (Applicability of advanced oPTimizers), which adds a simple adaptive momentum mechanism to balance advanced-optimizer behavior with multi-task needs.
  • The framework also includes a lightweight direction-preservation technique to improve Muon’s orthogonalization process.
  • Experiments on four mainstream MTL datasets show APT consistently improves multiple existing MTL approaches with substantial performance gains.

Abstract

Multi-Task Learning (MTL) is a foundational machine learning problem that has seen extensive development over the past decade. Recently, various optimization-based MTL approaches have been proposed to learn multiple tasks simultaneously by altering the optimization trajectory. Although these methods strive to de-conflict and re-balance tasks, we empirically identify that their effectiveness is often undermined by an overlooked factor when employing advanced optimizers: the instant-derived gradients play only a marginal role in the actual parameter updates. This discrepancy prevents MTL frameworks from fully releasing its power on learning dynamics. Furthermore, we observe that Muon-a recently emerged advanced optimizer-inherently functions as a multi-task learner, which underscores the critical importance of the gradients used for its orthogonalization. To address these issues, we propose APT (Applicability of advanced oPTimizers), a framework featuring a simple adaptive momentum mechanism designed to balance the strengths between advanced optimizers and MTL. Additionally, we introduce a light direction preservation method to facilitate Muon's orthogonalization. Extensive experiments across four mainstream MTL datasets demonstrate that APT consistently augments existing MTL approaches, yielding substantial performance improvements.