Beyond Pedestrians: Caption-Guided CLIP Framework for High-Difficulty Video-based Person Re-Identification

arXiv cs.CV / 4/10/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses weaknesses in video-based person re-identification (ReID) under high-difficulty conditions (e.g., sports and dance), where multiple people with similar clothing move dynamically across cameras.
  • It proposes CG-CLIP, a caption-guided CLIP framework that uses textual descriptions generated by multimodal LLMs to refine identity-specific features via Caption-guided Memory Refinement (CMR).
  • CG-CLIP also introduces Token-based Feature Extraction (TFE), which uses cross-attention with fixed-length learnable tokens to aggregate spatiotemporal features efficiently and reduce computation.
  • Experiments on standard datasets (MARS, iLIDS-VID) and two new high-difficulty datasets (SportsVReID, DanceVReID) show improved performance over state-of-the-art methods across benchmarks.
  • By combining caption guidance and tokenized spatiotemporal aggregation, the work aims to improve robustness for ReID scenarios that go beyond typical pedestrian footage.

Abstract

In recent years, video-based person Re-Identification (ReID) has gained attention for its ability to leverage spatiotemporal cues to match individuals across non-overlapping cameras. However, current methods struggle with high-difficulty scenarios, such as sports and dance performances, where multiple individuals wear similar clothing while performing dynamic movements. To overcome these challenges, we propose CG-CLIP, a novel caption-guided CLIP framework that leverages explicit textual descriptions and learnable tokens. Our method introduces two key components: Caption-guided Memory Refinement (CMR) and Token-based Feature Extraction (TFE). CMR utilizes captions generated by Multi-modal Large Language Models (MLLMs) to refine identity-specific features, capturing fine-grained details. TFE employs a cross-attention mechanism with fixed-length learnable tokens to efficiently aggregate spatiotemporal features, reducing computational overhead. We evaluate our approach on two standard datasets (MARS and iLIDS-VID) and two newly constructed high-difficulty datasets (SportsVReID and DanceVReID). Experimental results demonstrate that our method outperforms current state-of-the-art approaches, achieving significant improvements across all benchmarks.