Are Pretrained Image Matchers Good Enough for SAR-Optical Satellite Registration?

arXiv cs.CV / 4/14/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper tests 24 pretrained image matcher families for cross-modal optical–SAR satellite registration in a strict zero-shot setup (no fine-tuning or SAR-domain adaptation) on SpaceNet9 plus two additional benchmarks.
  • Results show asymmetric domain transfer: matchers with explicit cross-modal training do not consistently outperform those without it, with top performance around 3.0 px mean error on labeled SpaceNet9 scenes.
  • RoMa reaches the lowest reported mean error (~3.0 px) without cross-modal training, while XoFTR also performs best, suggesting that foundation-model features (e.g., DINOv2) may partially provide modality invariance.
  • Protocol and deployment choices strongly affect accuracy: geometry model selection, tile size, and inlier gating can change mean error by as much as 33×, sometimes more than switching matchers.
  • 3D-reconstruction-focused matchers (MASt3R, DUSt3R) are found to be highly sensitive to the evaluation/protocol and remain fragile under default settings, indicating they may not be reliable “out of the box” for traditional 2D registration pipelines.

Abstract

Cross-modal optical-SAR (Synthetic Aperture Radar) registration is a bottleneck for disaster-response via remote sensing, yet modern image matchers are developed and benchmarked almost exclusively on natural-image domains. We evaluate twenty-four pretrained matcher families--in a zero-shot setting with no fine-tuning or domain adaptation on satellite or SAR data--on SpaceNet9 and two additional cross-modal benchmarks under a deterministic protocol with tiled large-image inference, robust geometric filtering, and tie-point-grounded metrics. Our results reveal asymmetric transfer--matchers with explicit cross-modal training do not uniformly outperform those without it. While XoFTR (trained for visible-thermal matching) and RoMa achieve the lowest reported mean error at 3.0 px on the labeled SpaceNet9 training scenes, RoMa achieves this without any cross-modal training, and MatchAnything-ELoFTR (3.4 px)--trained on synthetic cross-modal pairs--matches closely, suggesting (as a working hypothesis) that foundation-model features (DINOv2) may contribute to modality invariance that partially substitutes for explicit cross-modal supervision. 3D-reconstruction matchers (MASt3R, DUSt3R), which are not designed for traditional 2D image matching, are highly protocol-sensitive and remain fragile under default settings. Deployment protocol choices (geometry model, tile size, inlier gating) shift accuracy by up to 33\times for a single matcher, sometimes exceeding the effect of swapping matchers entirely within the evaluated sweep--affine geometry alone reduces mean error from 12.34 to 9.74 px. These findings inform both practical deployment of existing matchers and future matcher design for cross-modal satellite registration.