Beyond Pedestrians: Caption-Guided CLIP Framework for High-Difficulty Video-based Person Re-Identification

arXiv cs.CV / 4/10/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper addresses weaknesses in video-based person re-identification (ReID) under high-difficulty conditions (e.g., sports and dance), where multiple people with similar clothing move dynamically across cameras.
It proposes CG-CLIP, a caption-guided CLIP framework that uses textual descriptions generated by multimodal LLMs to refine identity-specific features via Caption-guided Memory Refinement (CMR).
CG-CLIP also introduces Token-based Feature Extraction (TFE), which uses cross-attention with fixed-length learnable tokens to aggregate spatiotemporal features efficiently and reduce computation.
Experiments on standard datasets (MARS, iLIDS-VID) and two new high-difficulty datasets (SportsVReID, DanceVReID) show improved performance over state-of-the-art methods across benchmarks.
By combining caption guidance and tokenized spatiotemporal aggregation, the work aims to improve robustness for ReID scenarios that go beyond typical pedestrian footage.

Abstract

In recent years, video-based person Re-Identification (ReID) has gained attention for its ability to leverage spatiotemporal cues to match individuals across non-overlapping cameras. However, current methods struggle with high-difficulty scenarios, such as sports and dance performances, where multiple individuals wear similar clothing while performing dynamic movements. To overcome these challenges, we propose CG-CLIP, a novel caption-guided CLIP framework that leverages explicit textual descriptions and learnable tokens. Our method introduces two key components: Caption-guided Memory Refinement (CMR) and Token-based Feature Extraction (TFE). CMR utilizes captions generated by Multi-modal Large Language Models (MLLMs) to refine identity-specific features, capturing fine-grained details. TFE employs a cross-attention mechanism with fixed-length learnable tokens to efficiently aggregate spatiotemporal features, reducing computational overhead. We evaluate our approach on two standard datasets (MARS and iLIDS-VID) and two newly constructed high-difficulty datasets (SportsVReID and DanceVReID). Experimental results demonstrate that our method outperforms current state-of-the-art approaches, achieving significant improvements across all benchmarks.

GLM 5.1 tops the code arena rankings for open models

Reddit r/LocalLLaMA

can we talk about how AI has gotten really good at lying to you?

Reddit r/artificial

AI just found thousands of zero-days. Your firewall is still pattern-matching from 2014

Dev.to

Emergency Room and the Vanishing Moat

Dev.to

I Built a 100% Browser-Based OCR That Never Uploads Your Documents — Here's How

Dev.to

Beyond Pedestrians: Caption-Guided CLIP Framework for High-Difficulty Video-based Person Re-Identification

Key Points

Abstract

Related Articles

GLM 5.1 tops the code arena rankings for open models

can we talk about how AI has gotten really good at lying to you?

AI just found thousands of zero-days. Your firewall is still pattern-matching from 2014

Emergency Room and the Vanishing Moat

I Built a 100% Browser-Based OCR That Never Uploads Your Documents — Here's How

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer