MM-OVSeg:Multimodal Optical-SAR Fusion for Open-Vocabulary Segmentation in Remote Sensing

arXiv cs.CV / 3/19/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

MM-OVSeg is introduced as a multimodal Optical-SAR fusion framework designed for resilient open-vocabulary segmentation in remote sensing, capable of operating under cloudy or haze-contaminated conditions.
The method features a cross-modal unification process to align representations across sensors and a dual-encoder fusion module that integrates hierarchical features from multiple vision foundation models for text-aligned segmentation.
Extensive experiments show improved robustness and generalization across diverse cloud conditions, addressing the cross-modal domain gap and dense prediction challenges of current vision-language models.
The framework leverages optical imagery for rich spectral semantics while exploiting SAR's cloud-penetrating structural cues, and the authors release the source dataset and code.

Abstract

Open-vocabulary segmentation enables pixel-level recognition from an open set of textual categories, allowing generalization beyond fixed classes. Despite great potential in remote sensing, progress in this area remains largely limited to clear-sky optical data and struggles under cloudy or haze-contaminated conditions. We present MM-OVSeg, a multimodal Optical-SAR fusion framework for resilient open-vocabulary segmentation under adverse weather conditions. MM-OVSeg leverages the complementary strengths of the two modalities--optical imagery provides rich spectral semantics, while synthetic aperture radar (SAR) offers cloud-penetrating structural cues. To address the cross-modal domain gap and the limited dense prediction capability of current vision-language models, we propose two key designs: a cross-modal unification process for multi-sensor representation alignment, and a dual-encoder fusion module that integrates hierarchical features from multiple vision foundation models for text-aligned multimodal segmentation. Extensive experiments demonstrate that MM-OVSeg achieves superior robustness and generalization across diverse cloud conditions. The source dataset and code are available here.

The massive shift toward edge computing and local processing

Dev.to

Self-Refining Agents in Spec-Driven Development

Dev.to

Week 3: Why I'm Learning 'Boring' ML Before Building with LLMs

Dev.to

The Three-Agent Protocol Is Transferable. The Discipline Isn't.

Dev.to

has anyone tried this? Flash-MoE: Running a 397B Parameter Model on a Laptop

Reddit r/LocalLLaMA

MM-OVSeg:Multimodal Optical-SAR Fusion for Open-Vocabulary Segmentation in Remote Sensing

Key Points

Abstract

Related Articles

The massive shift toward edge computing and local processing

Self-Refining Agents in Spec-Driven Development

Week 3: Why I'm Learning 'Boring' ML Before Building with LLMs

The Three-Agent Protocol Is Transferable. The Discipline Isn't.

has anyone tried this? Flash-MoE: Running a 397B Parameter Model on a Laptop

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer