A Step Toward Federated Pretraining of Multimodal Large Language Models
arXiv cs.LG / 3/31/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that multimodal LLM pre-training is limited by saturated public data and proposes using federated learning to leverage privacy-preserving multimodal data silos.
- It introduces the Federated MLLM Alignment (Fed-MA) task, freezing the vision encoder and LLM while only collaboratively training the cross-modal projector during a lightweight pre-training stage.
- The authors identify two key issues for federated pre-training—parameter interference when aggregating local projectors and gradient oscillations under one-pass collaborative SGD.
- To address these, they propose Fed-CMP, using Canonical Reliability-Aware Aggregation to fuse decomposed client projectors via a shared alignment basis with reliability weighting, and Orthogonality-Preserved Momentum to stabilize optimization while preserving geometric structure.
- Experiments across four federated pre-training scenarios using public datasets show Fed-CMP significantly outperforms existing federated pre-training baselines.



