Normalized Matching Transformer

arXiv cs.CV / 5/6/2026

📰 NewsModels & Research

Key Points

  • The Normalized Matching Transformer (NMT) is presented as a deep-learning method for efficient, high-accuracy sparse semantic keypoint matching between image pairs.
  • NMT uses a visual backbone, geometric refinement via SplineCNN, and a normalized Transformer to produce matching features.
  • A core contribution is hyperspherical normalization, which enforces unit-norm embeddings at every Transformer layer and trains with a combined loss (InfoNCE contrastive loss plus a hyperspherical uniformity loss).
  • The approach improves both matching alignment and non-matching separation not only at the output but at intermediate layers.
  • NMT achieves new state-of-the-art results on PascalVOC and SPair-71k, outperforming several prior methods and converging faster (at least 1.7× fewer epochs).

Abstract

We introduce the Normalized Matching Transformer (NMT), a deep learning approach for efficient and accurate sparse semantic keypoint matching between image pairs. NMT consists of a strong visual backbone, geometric feature refinement via SplineCNN, followed by a normalized Transformer for computing matching features. Central to NMT is our hyperspherical normalization strategy: we enforce unit-norm embeddings at every Transformer layer and train with a combined contrastive InfoNCE and hyperspherical uniformity loss to yield more discriminative keypoint representations. This novel architecture/loss combination encourages close alignment of matching image features and large distances between non-matching ones not only at the output level, but for each layer. Despite its architectural simplicity, NMT sets a new state-of-the-art performance on PascalVOC and SPair-71k, outperforming BBGM, ASAR, COMMON and GMTR by 5.1% and 2.2%, respectively, while converging in at least 1.7x fewer epochs compared to other state-of-the-art baselines. These results underscore the power of combining pervasive normalization with hyperspherical learning for matching tasks.