PivotMerge: Bridging Heterogeneous Multimodal Pre-training via Post-Alignment Model Merging
arXiv cs.CV / 4/28/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper proposes PivotMerge to bridge heterogeneous multimodal pre-training by merging MLLM components specifically at the “post-alignment” stage, rather than only after fine-tuning.
- It frames multimodal pre-training as the problem of building cross-modal alignment between visual and textual representations, motivating the new post-alignment merging task.
- PivotMerge addresses two main issues in merging heterogeneous models: cross-domain parameter interference and uneven layer-wise alignment contribution across layers/projectors.
- The method uses Shared-space Decomposition and Filtering to separate shared alignment from domain-specific differences and suppress conflicting update directions.
- Experiments on CC12M-based post-alignment merging scenarios across multiple multimodal benchmarks show PivotMerge consistently outperforms prior baselines, indicating strong performance and generalization.
Related Articles

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to

Same Agent, Different Risk | How Microsoft 365 Copilot Grounding Changes the Security Model | Rahsi Framework™
Dev.to

Claude Haiku for Low-Cost AI Inference: Patterns from a Horse Racing Prediction System
Dev.to

How We Built an Ambient AI Clinical Documentation Pipeline (and Saved Doctors 8+ Hours a Week)
Dev.to

🦀 PicoClaw Deep Dive — A Field Guide to Building an Ultra-Light AI Agent in Go 🐹
Dev.to