MAny: Merge Anything for Multimodal Continual Instruction Tuning
arXiv cs.LG / 4/16/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- Multimodal Continual Instruction Tuning (MCIT) for Multimodal LLMs is limited by catastrophic forgetting, and the paper argues this happens via two dual mechanisms: perception drift in cross-modal projection space and reasoning collapse in low-rank parameter space.
- The proposed MAny (Merge Anything) framework addresses both issues using Cross-modal Projection Merging (CPM) to maintain perceptual alignment with visual-prototype guidance during inference.
- It also uses Low-rank Parameter Merging (LPM) to reduce interference among task-specific low-rank modules by recursively merging low-rank weight matrices, with a closed-form solution derived using recursive least squares for stable reasoning.
- MAny is presented as training-free for the merging step, relying on efficient CPU-based algebraic operations rather than additional gradient-based optimization beyond initial tuning.
- Experiments on multiple MLLMs and benchmarks report improved final average accuracy, including up to 8.57% and 2.85% gains on the UCIT benchmark over state-of-the-art methods.
Related Articles

Black Hat Asia
AI Business

Introducing Claude Opus 4.7
Anthropic News

AI traffic to US retailers rose 393% in Q1, and it’s boosting their revenue too
TechCrunch

Who Audits the Auditors? Building an LLM-as-a-Judge for Agentic Reliability
Dev.to

"Enterprise AI Cost Optimization: How Companies Are Cutting AI Infrastructure Sp
Dev.to