When Earth Foundation Models Meet Diffusion: An Application to Land Surface Temperature Super-Resolution

arXiv cs.CV / 4/21/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces Earth Foundation Model-guided Diffusion (EFDiff), a new framework for land surface temperature (LST) super-resolution under extremely degraded spatial observations.
  • EFDiff leverages the Prithvi-EO-2.0 Earth foundation model to encode high-resolution multispectral reflectance into geospatial embeddings, which are injected into a diffusion denoising network using cross-attention for guided reconstruction.
  • The authors propose two variants, EFDiff-ε and EFDiff-x0, providing different trade-offs between perceptual realism and pixel-level fidelity.
  • Using a large, globally diverse Landsat benchmark with 242,416 co-registered patches and a challenging 32× scale gap, EFDiff outperforms baseline methods, and cross-attention conditioning via the Earth foundation model works better than simpler feature concatenation.
  • While demonstrated for LST super-resolution, the framework is presented as broadly applicable to other remote-sensing tasks where pretrained geospatial representations can guide generative reconstruction.

Abstract

Land surface temperature (LST) super-resolution is important for environmental monitoring. However, it remains challenging as coarse thermal observations severely underdetermine fine-scale structure. In this paper, we propose Earth Foundation Model-guided Diffusion (EFDiff), a novel framework for super-resolution under extreme spatial degradation. EFDiff uses the Prithvi-EO-2.0 Earth foundation model to encode high-resolution multispectral reflectance into geospatial embeddings, which are injected into the denoising network via cross-attention to guide fine-scale reconstruction from highly degraded observations. We study two variants, EFDiff-\epsilon and EFDiff-x_0, which offer complementary trade-offs between perceptual realism and pixel-level fidelity. We evaluate EFDiff under an extreme 32\times scale gap using a globally diverse benchmark comprising 242,416 co-registered Landsat thermal-reflectance patches. Results show that EFDiff consistently outperforms baseline methods and that cross-attention conditioning by EFM is more effective than HLS channel concatenation. Although we present EFDiff in the context of LST super-resolution, the framework is broadly applicable to remote sensing problems in which pretrained geospatial representations can guide generative reconstruction.