CrossFlowDG: Bridging the Modality Gap with Cross-modal Flow Matching for Domain Generalization
arXiv cs.CV / 4/21/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- Domain generalization (DG) targets robust performance under domain shift, where computer vision models often overfit to style cues rather than class semantics.
- Existing multimodal DG methods use text anchors but can suffer a “modality gap,” keeping image and text embeddings geometrically separated despite semantic alignment.
- CrossFlowDG introduces noise-free cross-modal flow matching that learns continuous transformations in a joint Euclidean latent space to transport domain-biased image embeddings toward domain-invariant text embeddings for the correct class.
- The approach uses a VMamba-based image encoder and CLIP text encoder, and reports competitive results on multiple DG benchmarks with state-of-the-art performance on TerraIncognita.
- The authors provide an open-source implementation via the linked GitHub repository.
Related Articles

Every time a new model comes out, the old one is obsolete of course
Reddit r/LocalLLaMA

We built it during the NVIDIA DGX Spark Full-Stack AI Hackathon — and it ended up winning 1st place overall 🏆
Dev.to

Stop Losing Progress: Setting Up a Pro Jupyter Workflow in VS Code (No More Colab Timeouts!)
Dev.to

Building AgentOS: Why I’m Building the AWS Lambda for Insurance Claims
Dev.to

Where we are. In a year, everything has changed. Kimi - Minimax - Qwen - Gemma - GLM
Reddit r/LocalLLaMA