Geometry Reinforced Efficient Attention Tuning Equipped with Normals for Robust Stereo Matching

arXiv cs.CV / 4/13/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The arXiv paper introduces GREATEN, a stereo-matching framework aimed at improving synthetic-to-real (Syn-to-Real) generalization by using surface normals as domain-invariant geometric cues.
  • It proposes a Gated Contextual-Geometric Fusion (GCGF) module to suppress unreliable texture/context features and fuse them with normal-driven geometry for more discriminative representations.
  • To handle non-Lambertian regions (e.g., specular/transparent surfaces), it adds a Specular-Transparent Augmentation (STA) strategy that makes the fusion more robust to misleading visual cues.
  • The method uses sparse attention variants (SSA, SDMA, SVA) to preserve fine-grained global feature extraction for occlusions while reducing computational cost, improving inference speed and enabling high-resolution (3K) disparity estimation.
  • Experiments show substantial error reductions when trained only on synthetic data, including 30% fewer errors on ETH3D and faster runtime (19.2% faster than its baseline variant), with support for disparity ranges up to 768 on Middlebury.

Abstract

Despite remarkable advances in image-driven stereo matching over the past decade, Synthetic-to-Realistic Zero-Shot (Syn-to-Real) generalization remains an open challenge. This suboptimal generalization performance mainly stems from cross-domain shifts and ill-posed ambiguities inherent in image textures, particularly in occluded, textureless, repetitive, and non-Lambertian (specular/transparent) regions. To improve Syn-to-Real generalization, we propose GREATEN, a framework that incorporates surface normals as domain-invariant, object-intrinsic, and discriminative geometric cues to compensate for the limitations of image textures. The proposed framework consists of three key components. First, a Gated Contextual-Geometric Fusion (GCGF) module adaptively suppresses unreliable contextual cues in image features and fuses the filtered image features with normal-driven geometric features to construct domain-invariant and discriminative contextual-geometric representations. Second, a Specular-Transparent Augmentation (STA) strategy improves the robustness of GCGF against misleading visual cues in non-Lambertian regions. Third, sparse attention designs preserve the fine-grained global feature extraction capability of GREAT-Stereo for handling occlusion and texture-related ambiguities while substantially reducing computational overhead, including Sparse Spatial (SSA), Sparse Dual-Matching (SDMA), and Simple Volume (SVA) attentions. Trained exclusively on synthetic data such as SceneFlow, GREATEN-IGEV achieves outstanding Syn-to-Real performance. Specifically, it reduces errors by 30% on ETH3D, 8.5% on the non-Lambertian Booster, and 14.1% on KITTI-2015, compared to FoundationStereo, Monster-Stereo, and DEFOM-Stereo, respectively. In addition, GREATEN-IGEV runs 19.2% faster than GREAT-IGEV and supports high-resolution (3K) inference on Middlebury with disparity ranges up to 768.