Temporal Prototyping and Hierarchical Alignment for Unsupervised Video-based Visible-Infrared Person Re-Identification

arXiv cs.CV / 4/24/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses unsupervised video-based visible–infrared person re-identification (VI-ReID), which is more realistic for surveillance than existing approaches that are mostly image-focused or supervised.
  • It proposes HiTPro (Hierarchical Temporal Prototyping), a prototype-driven framework that avoids explicit hard pseudo-label assignment while learning from RGB and infrared tracklets.
  • HiTPro uses a temporal-aware feature encoder to produce both discriminative frame-level features and robust tracklet-level representations.
  • It introduces hierarchical alignment with two-stage positive mining (within-modality first, then cross-modality) using dynamic thresholds and soft weight assignment, followed by hierarchical contrastive learning across three alignment levels.
  • Experiments on HITSZ-VCM and BUPTCampus show that HiTPro achieves state-of-the-art results in fully unsupervised settings and sets a strong baseline for future work.

Abstract

Visible-infrared person re-identification (VI-ReID) enables cross-modality identity matching for all-day surveillance, yet existing methods predominantly focus on the image level or rely heavily on costly identity annotations. While video-based VI-ReID has recently emerged to exploit temporal dynamics for improved robustness, existing studies remain limited to supervised settings. Crucially, the unsupervised video VI-ReID problem, where models must learn from RGB and infrared tracklets without identity labels, remains largely unexplored despite its practical importance in real-world deployment. To bridge this gap, we propose HiTPro (Hierarchical Temporal Prototyping), a prototype-driven framework without explicit hard pseudo-label assignment for unsupervised video-based VI-ReID. HiTPro begins with an efficient Temporal-aware Feature Encoder that first extracts discriminative frame-level features and then aggregates them into a robust tracklet-level representation. Building upon these features, HiTPro first constructs reliable intra-camera prototypes via Intra-Camera Tracklet Prototyping by aggregating features from temporally partitioned sub-tracklets. Through Hierarchical Cross-Prototype Alignment, we perform a two-stage positive mining process: progressing from within-modality associations to cross-modality matching, enhanced by Dynamic Threshold Strategy and Soft Weight Assignment. Finally, {Hierarchical Contrastive Learning} progressively optimizes feature-prototype alignment across three levels: intra-camera discrimination, cross-camera same-modality consistency, and cross-modality invariance. Extensive experiments on HITSZ-VCM and BUPTCampus demonstrate that HiTPro achieves state-of-the-art performance under fully unsupervised settings, significantly outperforming adapted baselines and establishes a strong baseline for future research.