CrossFlowDG: Bridging the Modality Gap with Cross-modal Flow Matching for Domain Generalization

arXiv cs.CV / 4/21/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

Domain generalization (DG) targets robust performance under domain shift, where computer vision models often overfit to style cues rather than class semantics.
Existing multimodal DG methods use text anchors but can suffer a “modality gap,” keeping image and text embeddings geometrically separated despite semantic alignment.
CrossFlowDG introduces noise-free cross-modal flow matching that learns continuous transformations in a joint Euclidean latent space to transport domain-biased image embeddings toward domain-invariant text embeddings for the correct class.
The approach uses a VMamba-based image encoder and CLIP text encoder, and reports competitive results on multiple DG benchmarks with state-of-the-art performance on TerraIncognita.
The authors provide an open-source implementation via the linked GitHub repository.

Abstract

Domain generalization (DG) aims to maintain performance under domain shift, which in computer vision appears primarily as stylistic variations that cause models to overfit to domain-specific appearance cues rather than class semantics. To overcome this, recent methods use textual representations as stable, domain-invariant anchors. However, multimodal approaches that rely on cosine similarity-based contrastive alignment leave a modality gap where image and text embeddings remain geometrically separated despite semantic correspondence. We propose CrossFlowDG, a novel DG framework that addresses this residual gap using noise-free, cross-modal flow matching. By learning a continuous transformation in the joint Euclidean latent space, our framework explicitly transports domain-biased image embeddings toward domain-invariant text embeddings of the correct class. Using the efficient VMamba image encoder and CLIP's text encoder, CrossFlowDG is tested against four common DG benchmarks, and achieves competitive performance on several benchmarks and state-of-the-art on TerraIncognita. Code is available at: https://github.com/ajkrit/CrossFlowDG

Every time a new model comes out, the old one is obsolete of course

Reddit r/LocalLLaMA

We built it during the NVIDIA DGX Spark Full-Stack AI Hackathon — and it ended up winning 1st place overall 🏆

Dev.to

Stop Losing Progress: Setting Up a Pro Jupyter Workflow in VS Code (No More Colab Timeouts!)

Dev.to

Building AgentOS: Why I’m Building the AWS Lambda for Insurance Claims

Dev.to

Where we are. In a year, everything has changed. Kimi - Minimax - Qwen - Gemma - GLM

Reddit r/LocalLLaMA

CrossFlowDG: Bridging the Modality Gap with Cross-modal Flow Matching for Domain Generalization

Key Points

Abstract

Related Articles

Every time a new model comes out, the old one is obsolete of course

We built it during the NVIDIA DGX Spark Full-Stack AI Hackathon — and it ended up winning 1st place overall 🏆

Stop Losing Progress: Setting Up a Pro Jupyter Workflow in VS Code (No More Colab Timeouts!)

Building AgentOS: Why I’m Building the AWS Lambda for Insurance Claims

Where we are. In a year, everything has changed. Kimi - Minimax - Qwen - Gemma - GLM

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer