MAny: Merge Anything for Multimodal Continual Instruction Tuning

arXiv cs.LG / 4/16/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • Multimodal Continual Instruction Tuning (MCIT) for Multimodal LLMs is limited by catastrophic forgetting, and the paper argues this happens via two dual mechanisms: perception drift in cross-modal projection space and reasoning collapse in low-rank parameter space.
  • The proposed MAny (Merge Anything) framework addresses both issues using Cross-modal Projection Merging (CPM) to maintain perceptual alignment with visual-prototype guidance during inference.
  • It also uses Low-rank Parameter Merging (LPM) to reduce interference among task-specific low-rank modules by recursively merging low-rank weight matrices, with a closed-form solution derived using recursive least squares for stable reasoning.
  • MAny is presented as training-free for the merging step, relying on efficient CPU-based algebraic operations rather than additional gradient-based optimization beyond initial tuning.
  • Experiments on multiple MLLMs and benchmarks report improved final average accuracy, including up to 8.57% and 2.85% gains on the UCIT benchmark over state-of-the-art methods.

Abstract

Multimodal Continual Instruction Tuning (MCIT) is essential for sequential task adaptation of Multimodal Large Language Models (MLLMs) but is severely restricted by catastrophic forgetting. While existing literature focuses on the reasoning language backbone, in this work, we expose a critical yet neglected dual-forgetting phenomenon across both perception drift in Cross-modal Projection Space and reasoning collapse in Low-rank Parameter Space. To resolve this, we present \textbf{MAny} (\textbf{M}erge \textbf{Any}thing), a framework that merges task-specific knowledge through \textbf{C}ross-modal \textbf{P}rojection \textbf{M}erging (\textbf{CPM}) and \textbf{L}ow-rank \textbf{P}arameter \textbf{M}erging (\textbf{LPM}). Specifically, CPM recovers perceptual alignment by adaptively merging cross-modal visual representations via visual-prototype guidance, ensuring accurate feature recovery during inference. Simultaneously, LPM eliminates mutual interference among task-specific low-rank modules by recursively merging low-rank weight matrices. By leveraging recursive least squares, LPM provides a closed-form solution that mathematically guarantees an optimal fusion trajectory for reasoning stability. Notably, MAny operates as a training-free paradigm that achieves knowledge merging via efficient CPU-based algebraic operations, eliminating additional gradient-based optimization beyond initial tuning. Our extensive evaluations confirm the superior performance and robustness of MAny across multiple MLLMs and benchmarks. Specifically, on the UCIT benchmark, MAny achieves significant leads of up to 8.57\% and 2.85\% in final average accuracy over state-of-the-art methods across two different MLLMs, respectively.