Decoding the Delta: Unifying Remote Sensing Change Detection and Understanding with Multimodal Large Language Models

arXiv cs.CV / 4/16/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that current multimodal large language models struggle with remote sensing change understanding due to “temporal blindness,” lacking mechanisms for multi-temporal contrastive reasoning and precise spatial grounding.
  • It introduces Delta-QA, a benchmark with 180k visual question-answering samples that unifies change interpretation across bi- and tri-temporal settings while covering both pixel-level segmentation and QA.
  • It proposes Delta-LLaVA, a remote-sensing-specific MLLM architecture that improves over naive feature concatenation using Change-Enhanced Attention, Change-SEG with Change Prior Embedding, and Local Causal Attention to reduce cross-temporal leakage.
  • Experiments reportedly show Delta-LLaVA outperforms both generalist MLLMs and specialized segmentation models on change deduction and high-precision boundary localization, positioning it as a unified earth observation framework for “change understanding.”

Abstract

While Multimodal Large Language Models (MLLMs) excel in general vision-language tasks, their application to remote sensing change understanding is hindered by a fundamental "temporal blindness". Existing architectures lack intrinsic mechanisms for multi-temporal contrastive reasoning and struggle with precise spatial grounding. To address this, we first introduce Delta-QA, a comprehensive benchmark comprising 180k visual question-answering samples. Delta-QA unifies pixel-level segmentation and visual question answering across bi- and tri-temporal scenarios, structuring change interpretation into four progressive cognitive dimensions. Methodologically, we propose Delta-LLaVA, a novel MLLM framework explicitly tailored for multi-temporal remote sensing interpretation. It overcomes the limitations of naive feature concatenation through three core innovations: a Change-Enhanced Attention module that systematically isolates and amplifies visual differences, a Change-SEG module utilizing Change Prior Embedding to extract differentiable difference features as input for the LLM, and Local Causal Attention to prevent cross-temporal contextual leakage. Extensive experiments demonstrate that Delta-LLaVA decisively outperforms leading generalist MLLMs and specialized segmentation models in complex change deduction and high-precision boundary localization, establishing a unified framework for earth observation intelligence.