SEAR: Simple and Efficient Adaptation of Visual Geometric Transformers for RGB+Thermal 3D Reconstruction

arXiv cs.CV / 3/20/2026

📰 NewsTools & Practical UsageModels & Research

共有:

Key Points

SEAR is a simple fine-tuning strategy that adapts a pretrained visual geometric transformer to multimodal RGB-T inputs for improved 3D reconstruction and camera pose estimation.
On a relatively small RGB-T dataset, SEAR significantly outperforms state-of-the-art methods, including a notable 29% gain in AUC@30.
The approach delivers higher detail and better alignment between RGB and thermal modalities with negligible inference-time overhead compared to the original pretrained model, even under challenging conditions like low lighting and dense smoke.
The authors introduce a new RGB-T dataset with sequences captured across varying times, viewpoints, and illumination to serve as a robust benchmark for multimodal 3D scene reconstruction.
Code and pretrained models are publicly available on GitHub, facilitating replication and practical adoption.

Abstract

Foundational feed-forward visual geometry models enable accurate and efficient camera pose estimation and scene reconstruction by learning strong scene priors from massive RGB datasets. However, their effectiveness drops when applied to mixed sensing modalities, such as RGB-thermal (RGB-T) images. We observe that while a visual geometry grounded transformer pretrained on RGB data generalizes well to thermal-only reconstruction, it struggles to align RGB and thermal modalities when processed jointly. To address this, we propose SEAR, a simple yet efficient fine-tuning strategy that adapts a pretrained geometry transformer to multimodal RGB-T inputs. Despite being trained on a relatively small RGB-T dataset, our approach significantly outperforms state-of-the-art methods for 3D reconstruction and camera pose estimation, achieving significant improvements over all metrics (e.g., over 29\% in AUC@30) and delivering higher detail and consistency between modalities with negligible overhead in inference time compared to the original pretrained model. Notably, SEAR enables reliable multimodal pose estimation and reconstruction even under challenging conditions, such as low lighting and dense smoke. We validate our architecture through extensive ablation studies, demonstrating how the model aligns both modalities. Additionally, we introduce a new dataset featuring RGB and thermal sequences captured at different times, viewpoints, and illumination conditions, providing a robust benchmark for future work in multimodal 3D scene reconstruction. Code and models are publicly available at https://www.github.com/Schindler-EPFL-Lab/SEAR.

I Built an AI That Audits Other AI Agents for Token Waste — Launching on Product Hunt Today

Dev.to

Check out this article on AI-Driven Reporting 2.0: From Manual Bottlenecks to Real-Time Decision Intelligence (2026 Edition)

Dev.to

SYNCAI

Dev.to

How AI-Powered Decision Making is Reshaping Enterprise Strategy in 2024

Dev.to

When AI Grows Up: Identity, Memory, and What Persists Across Versions

Dev.to

SEAR: Simple and Efficient Adaptation of Visual Geometric Transformers for RGB+Thermal 3D Reconstruction

Key Points

Abstract

Related Articles

I Built an AI That Audits Other AI Agents for Token Waste — Launching on Product Hunt Today

Check out this article on AI-Driven Reporting 2.0: From Manual Bottlenecks to Real-Time Decision Intelligence (2026 Edition)

SYNCAI

How AI-Powered Decision Making is Reshaping Enterprise Strategy in 2024

When AI Grows Up: Identity, Memory, and What Persists Across Versions

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer