Federated Cross-Modal Retrieval with Missing Modalities via Semantic Routing and Adapter Personalization

arXiv cs.CV / 4/28/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • The paper introduces RCSR, a personalization-friendly federated learning framework for cross-modal retrieval that targets two key real-world challenges: non-IID client data and missing modalities.
  • RCSR builds on a frozen CLIP backbone and uses lightweight shared adapters to transfer global cross-modal knowledge while optionally adding client-specific adapters for efficient local personalization.
  • It improves unimodal clients’ alignment with global semantics through prototype anchoring, helping them better map into the shared cross-modal space.
  • A server-side semantic router assigns aggregation weights based on retrieval consistency, aiming to reduce alignment drift caused by heterogeneous client updates.
  • Experiments on MS-COCO, Flickr30K, and other benchmarks indicate RCSR boosts both global retrieval accuracy/training stability and client-level performance, particularly when clients have incomplete modalities.

Abstract

Federated cross-modal retrieval faces severe challenges from heterogeneous client data, particularly non-IID semantic distributions and missing modalities. Under such heterogeneity, a single global model is often insufficient to capture both shared cross-modal knowledge and client-specific characteristics. We propose RCSR, a personalization-friendly federated framework that integrates prototype anchoring, retrieval-centric semantic routing, and optional client-specific adapters. Built on a frozen CLIP backbone, RCSR leverages lightweight shared adapters for global knowledge transfer while supporting efficient local personalization. Prototype anchoring helps unimodal clients align with global cross-modal semantics, and a server-side semantic router adaptively assigns aggregation weights based on retrieval consistency to mitigate alignment drift during heterogeneous updates. Extensive experiments on MS-COCO, Flickr30K, and other benchmarks show that RCSR consistently improves global retrieval accuracy and training stability, while further enhancing client-level retrieval performance, especially for clients with incomplete modalities. Code is available at https://github.com/RezinChow/RCSR-Retrieval-Centric-Semantic-Routing.