A Dual-Stream Transformer Architecture for Illumination-Invariant TIR-LiDAR Person Tracking

arXiv cs.RO / 4/2/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces a dual-stream Transformer-based architecture for all-weather person tracking using Thermal-Infrared (TIR) and LiDAR/Depth sensors, targeting failure cases of RGB-D tracking under extreme lighting like darkness and backlighting.
  • It leverages standard SLAM-capable robot sensor suites (LiDAR and TIR cameras) to build a practical TIR-D tracking system intended for autonomous mobile robots performing reliable human-following.
  • A key bottleneck addressed is limited annotated multi-modal TIR-D datasets, which the authors tackle via a sequential knowledge transfer method that transfers structural priors from a large-scale thermal-trained model into the TIR-D domain.
  • The method uses a “Fine-grained Differential Learning Rate Strategy” to retain pre-trained feature extraction while rapidly adapting to geometric depth cues for the tracking task.
  • Experiments report improved performance over RGB-transfer and single-modality baselines, including an Average Overlap (AO) of 0.700 and a Success Rate (SR) of 58.7%.

Abstract

Robust person tracking is a critical capability for autonomous mobile robots operating in diverse and unpredictable environments. While RGB-D tracking has shown high precision, its performance severely degrades under challenging illumination conditions, such as total darkness or intense backlighting. To achieve all-weather robustness, this paper proposes a novel Thermal-Infrared and Depth (TIR-D) tracking architecture that leverages the standard sensor suite of SLAM-capable robots, namely LiDAR and TIR cameras. A major challenge in TIR-D tracking is the scarcity of annotated multi-modal datasets. To address this, we introduce a sequential knowledge transfer strategy that evolves structural priors from a large-scale thermal-trained model into the TIR-D domain. By employing a differential learning rate strategy -- referred to as ``Fine-grained Differential Learning Rate Strategy'' -- we effectively preserve pre-trained feature extraction capabilities while enabling rapid adaptation to geometric depth cues. Experimental results demonstrate that our proposed TIR-D tracker achieves superior performance, with an Average Overlap (AO) of 0.700 and a Success Rate (SR) of 58.7\%, significantly outperforming conventional RGB-transfer and single-modality baselines. Our approach provides a practical and resource-efficient solution for robust human-following in all-weather robotics applications.